OpenAI crawling websites #ai

If you run a webserver you have probably got hits from OpenAI's crawlers.
They nicely announce themselves in the User-Agent header, and include a link to a page: https://openai.com/gptbot - from which I quote here:
OpenAI uses the following robots.txt tags to enable webmasters to manage how their sites and content work with AI. [...] Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models.
For my little calendar project, Sundial, the robots.txt has since day 0 said:
User-agent: *
Disallow: /
Because, as you can imagine, there are a lot of pages in a digital calendar.
So it was surprising to see the OpenAI crawler requesting page after page from all manner of years, a lot of them negative numbers, even.
The robots.txt clearly says that crawlers should not request anything!
5 days ago I extended the robots.txt to explicitly tell the OpenAI crawlers to not crawl the calender, in case * wasn't explicit enough for OpenAI.
They are still at it:
74.7.227.55 - - [14/Nov/2025:23:23:42 +0100] "GET /-1000/03/ HTTP/1.1" 200 6031 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)"
So they are basically lying to people's faces, when they say that robots.txt allows webmasters to manage how OpenAI crawls their websites.
(Yes, I realize they are trying to weasel-word their way around what robots.txt means, by writing that they interpret disallowing crawling as an indication that the content should not be used for training their models - but why crawl an infinite number of pages you won't be using!?!)
It is quite the shit show.
On the positive side, I have apparently unintendedly created an AI crawler tarpit.
Add comment
How to comment, in excruciating detail…
To avoid spam many websites make you fill out a CAPTCHA, or log in via an account at a corporation such as Facebook, Google or even Microsoft GitHub.
I have chosen to use a more old school method of spam prevention.
To post a comment here, you need to:
- Configure a newsreader¹ to connect to the server
- Open the newsgroup called
¹ Such as Thunderbird, Pan, slrn, tin or Gnus (part of Emacs).koldfront.dkon port1119using nntps (nntp over TLS).lantern.koldfrontand post a follow up to the article.Or, you can fill in this form: