economie

Meta unleashes new web crawling bots with sneaky ways of avoiding a rule that blocks scraping of online content

Meta CEO Mark Zuckerberg hugs a bald man with a tattoo on his back.

  • Generative AI tools are based on models that use huge amounts of content scraped from the web.
  • Meta has trained several versions of its Llama AI models, which rely heavily on internet content.
  • The company recently rolled out new bots that crawl the web and suck up data.

Meta recently unleashed new bots that crawl the web and suck up data for its AI models and related products.

These bots have features that make it harder for website owners to block their content from being scraped and collected.

The Meta-ExternalAgent bot is “for use cases such as training AI models or improving products by indexing content directly,” according to the company.

A second one, called Meta-ExternalFetcher, is related to the company’s AI assistant offerings and collects web links to support specific product functions.

These bots first appeared some time in July, according to archived Meta web pages analyzed by Originality.ai, a startup that specializes in spotting AI content.

Robots.txt under fire

Startups and tech giants are racing to build the most powerful AI models. A key ingredient is high-quality training data. One of the main ways to amass this is to send bots out on to the web to crawl and scrape online content. Google, OpenAI, Anthropic, and several other AI companies have these bots.

If content owners want to block such bots, they use an established rule called robots.txt that prevents automated scraping of websites. It’s a single bit of code that’s been used since the late 1990s and is widely accepted as one of the unofficial rules supporting the web.

The thirst for AI training data has undermined this system, though. In June, OpenAI and Anthropic were found to be either ignoring or circumventing robots.txt.

Meta’s bot bypass

Meta may also be trying to skirt the robots.txt rule in subtle ways.

The company warns that one of its new bots, Meta-ExternalFetcher, “may bypass robots.txt rules.”

Meanwhile, the Meta-ExternalAgent bot performs two functions, which is unusual. One is to collect AI training data, while the other is to index content.

Website owners may wish to block Meta from sucking up their data for AI model training, but they may want the tech giant to index their sites so more human users visit.

Combining both functions in a single bot makes it harder to block. And this can be seen, with only 1.5% of the top websites blocking the new Meta-ExternalAgent bot, according to Originality.ai.

That compares to an earlier Meta crawler, called FacebookBot, which has been scraping online data for years to train Meta’s large language models and AI speech recognition technology. This bot is blocked by almost 10% of the top websites, including Twitter and Yahoo, according to Originality.ai.

The other new Meta bot, Meta-ExternalFetcher, is being blocked by less than 1% of the top websites, according to Originality.ai.

“Companies should provide the ability for websites to block their sites’ data from being used for training while not reducing the visibility of the websites’ content in its products,” said Jon Gillham, CEO of Originality.ai.

Meta comments

A Meta spokesperson countered this by saying that the company is trying “to make it easier for publishers to indicate their preferences.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesperson also wrote in an email to Business Insider. “We recognize that some publishers and web domain owners want options when it comes to their websites and generative AI.”

Meta has multiple web crawling bots to avoid “bundling all use cases under a single agent, providing more flexibility for web publishers,” the spokesperson added.

Website owners can find information on how to block Meta’s bots here.

Read the original article on Business Insider

https://www.businessinsider.com/meta-web-crawler-bots-robots-txt-ai-2024-8