OpenAI on Monday introduced GPTBot, a web crawler designed to collect publicly available data from the internet to train artificial intelligence (AI) models.
The introduction of GPTBot provides a benefit of privacy controls for website administrators to
improve data privacy and accuracy in its AI models.
Allowing the web crawler to access a site means it contributes to this data pool to improve the AI ecosystem, but website owners can now opt-out of data collection. To disallow GPTBot to access a site, the site owner can add the GPTBot exclusion to its site’s robots.txt.
The text to
disallow the GPTBot to access a site's content is: User-agent: GPTBot
The move addresses recent controversies surrounding the practice of scraping websites without consent to power large language models (LLM) like GPT-4.
OpenAI also published a token that will only allow access to parts of a website.
OpenAI offers its models two different ways. One is through first party consumer applications like the ChatGPT app, and also through an API platform for developers and businesses that includes powerful models such as GPT-4, GPT-3.5 Turbo, embeddings, fine-tuning, and more. It enables organizations to incorporate OpenAI models directly into their products, applications, and services.