If you’ve used ChatGPT Search or Perplexity, you know that being able to search the web and see citations inline greatly improves these AI chatbots. Results are better when they involve timely information, and web search may reduce so-called hallucinations (i.e. when a generative AI outputs incorrect information).
That’s why French startup Linkup is building an API that lets developers access web content from premium, trusted sources and hand the results to a large language model (LLM) to enrich its answers. Many AI developers call this workflow Retrieval-Augmented Generation (or RAG).
More importantly, the future of scraping bots is uncertain. If there’s no pre-existing financial agreement between content publishers and the entities scraping web pages, these bots are lifting content from the open web without paying, and many people aren’t happy about that deal — which is increasing regulatory scrutiny around AI training.
There are also now high-profile legal cases in the frame, such as the ongoing lawsuit between OpenAI, the maker of ChatGPT, and the New York Times, so the situation around web scraping could change in the near future. Hence why OpenAI has signed multi-year content licensing deals with major publishers such as AP, Axel Springer, Condé Nast, El País, the Financial Times, Le Monde, and others.
“We set up the company around the time when OpenAI was making deals with news sources… for training or inference purposes, to augment the answers from OpenAI models and their products. And we thought: ‘OK, this is great because we finally have AI companies that pay their sources,’” Linkup co-founder and CEO Philippe Mizrahi told TechCrunch, laying out what propelled the founders to set up a business to connect AI devs with content providers for — hopefully — their mutual benefit.
Currently, content publishers are faced with difficult decisions over what to do about GenAI’s thirst for data. They can block web scrapers using the non-legally binding robots.txt metadata file, which indicates whether a website can be used to train an AI model or not. Furthermore, they can sue AI companies that they believe have breached their copyright. Alternatively, they could let bots index their content freely (er, YOLO?). Or they may be able to license content to AI devs to get some recompense for their intellectual property.
But there are thousands of tech companies using A that don’t have the scale and reach of OpenAI. At the same time, what’s great about the web is that there’s a long tail of content publishers. But this means that a small content publisher usually doesn’t have enough financial resources to file a lawsuit. It also means that it will be difficult to switch from a scraping model to a licensing model for millions of websites.