Do we need a robots.txt disallow for AI crawlers?

David Bosch
4 replies
I'm a big fan of ChatGPT, Stable Diffusion, etc, but I must admit that all their training is based in crawling "public" text and images published in the Internet. But when we say "public", it means content that it's been published by their authors with the hope of monetizing this content somehow, usually with visits to their sites. Is it fair that all this content is used to train these AIs without even citing their authors? I don't think so. Are these crawlers honouring any robots.txt directives? Can we block AI crawlers?

Replies

Kira Leigh
Good and important question. LMK if someone has an answer, because it's critical to be able to opt-out if one doesn't wish to have their work included.
Richard Gao
@kiraieigh Don't think there's any way to opt out unless the people doing the scraping decide to respect it. Don't take this as advice, but legally, it seems to be considered "transformative" in terms of fair use to train an AI on copyrighted images. Keep in mind the AI is not "stitching things together". It's learning much like a human is and then generating its own content.
Richard Gao
Web scrapers and crawlers all operate the same. An AI one will not be any different than a non-AI one (with how simple they are, AI isn't even needed). If you set up anything that just prevents webscraping (cookies or javascript or captcha to determine if user is human) then you can stop your content being scraped for AI stuff. However, that does not stop a human from accessing your website, downloading the images, and training an AI on it.