koldfront

OpenAI crawling websites #ai

🕛︎ - 2025-11-14 - 🟊 1
OpenAI logo with goatse hands grabbing it

If you run a webserver you have probably got hits from OpenAI's crawlers.

They nicely announce themselves in the User-Agent header, and include a link to a page: https://openai.com/gptbot - from which I quote here:

OpenAI uses the following robots.txt tags to enable webmasters to manage how their sites and content work with AI. [...] Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models.

For my little calendar project, Sundial, the robots.txt has since day 0 said:

User-agent: *
Disallow: /

Because, as you can imagine, there are a lot of pages in a digital calendar.

So it was surprising to see the OpenAI crawler requesting page after page from all manner of years, a lot of them negative numbers, even.

The robots.txt clearly says that crawlers should not request anything!

5 days ago I extended the robots.txt to explicitly tell the OpenAI crawlers to not crawl the calender, in case * wasn't explicit enough for OpenAI.

They are still at it:

74.7.227.55 - - [14/Nov/2025:23:23:42 +0100] "GET /-1000/03/ HTTP/1.1" 200 6031 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)"

So they are basically lying to people's faces, when they say that robots.txt allows webmasters to manage how OpenAI crawls their websites.

(Yes, I realize they are trying to weasel-word their way around what robots.txt means, by writing that they interpret disallowing crawling as an indication that the content should not be used for training their models - but why crawl an infinite number of pages you won't be using!?!)

It is quite the shit show.

On the positive side, I have apparently unintendedly created an AI crawler tarpit.

Add comment

How to comment, in excruciating detail…

To avoid spam many websites make you fill out a CAPTCHA, or log in via an account at a corporation such as Facebook, Google or even Microsoft GitHub.

I have chosen to use a more old school method of spam prevention.

To post a comment here, you need to:

  • Configure a newsreader¹ to connect to the server koldfront.dk on port 1119 using nntps (nntp over TLS).
  • Open the newsgroup called lantern.koldfront and post a follow up to the article.
¹ Such as Thunderbird, Pan, slrn, tin or Gnus (part of Emacs).

Or, you can fill in this form:

+=