Cloudflare will now, by default, block AI bots from crawling its clients’ websites

The internet infrastructure company Cloudflare announced today that it will now default to blocking AI bots from visiting websites it hosts. Cloudflare will also give clients the ability to manually allow or ban these AI bots on a case-by-case basis, and it will introduce a so-called “pay-per-crawl” service that clients can use to receive compensation every time an AI bot wants to scoop up their website’s contents.

The bots in question are a type of web crawler, an algorithm that walks across the internet to digest and catalogue online information on each website. In the past, web crawlers were most commonly associated with gathering data for search engines, but developers now use them to gather data they need to build and use AI systems.

However, such systems don’t provide the same opportunities for monetization and credit as search engines historically have. AI models draw from a great deal of data on the web to generate their outputs, but these data sources are often not credited, limiting the creators’ ability to make money from their work. Search engines that feature AI-generated answers may include links to original sources, but they may also reduce people’s interest in clicking through to other sites and could even usher in a “zero-click” future.

“Traditionally, the unspoken agreement was that a search engine could index your content, then they would show the relevant links to a particular query and send you traffic back to your website,” Will Allen, Cloudflare’s head of AI privacy, control, and media products, wrote in an email to MIT Technology Review. “That is fundamentally changing.”

Generally, creators and publishers want to decide how their content is used, how it’s associated with them, and how they are paid for it. Cloudflare claims its clients can now allow or disallow crawling for each stage of the AI life cycle (in particular, training, fine-tuning, and inference) and white-list specific verified crawlers. Clients can also set a rate for how much it will cost AI bots to crawl their website.

In a press release from Cloudflare, media companies like the Associated Press and Time and forums like Quora and Stack Overflow voiced support for the move. “Community platforms that fuel LLMs should be compensated for their contributions so they can invest back in their communities,” Stack Overflow CEO Prashanth Chandrasekar said in the release.

Crawlers are supposed to obey a given website’s directions (provided through a robots.txt file) to determine whether they can crawl there, but some AI companies have been accused of ignoring these instructions.

Cloudflare already has a bot verification system where AI web crawlers can tell websites who they work for and what they want to do. For these, Cloudflare hopes its system can facilitate good-faith negotiations between AI companies and website owners. For the less honest crawlers, Cloudflare plans to use its experience dealing with coordinated denial-of-service attacks from bots to stop them.

“A web crawler that is going across the internet looking for the latest content is just another type of bot—so all of our work to understand traffic and network patterns for the clearly malicious bots helps us understand what a crawler is doing,” wrote Allen.

Cloudflare had already developed other ways to deter unwanted crawlers, like allowing websites to send them down a path of AI-generated fake web pages to waste their efforts. While this approach will still apply for the truly bad actors, the company says it hopes its new services can foster better relationships between AI companies and content producers.

Some caution that a default ban on AI crawlers could interfere with noncommercial uses, like research. In addition to gathering data for AI systems and search engines, crawlers are also used by web archiving services, for example.

“Not all AI systems compete with all web publishers. Not all AI systems are commercial,” says Shayne Longpre, a PhD candidate at the MIT Media Lab who works on data provenance. “Personal use and open research shouldn’t be sacrificed here.”

For its part, Cloudflare aims to protect internet openness by helping enable web publishers to make more sustainable deals with AI companies. “By verifying a crawler and its intent, a website owner has more granular control, which means they can leave it more open for the real humans if they’d like,” wrote Allen.

Leave a Reply Cancel reply

Related News

5 common Linux dealbreakers that could ruin your Windows 10 switch

Today’s best laptop deals: Save big on work, school, home use, and gaming