close
close
Baidu blocks Google and Bing from scraping content because data is required for AI projects

Chinese Internet Search giant baidu appears to have started blocking Alphabet’s online search engines Google And Microsoft’s Bing by scraping content from the mainland Chinese company’s Wikipedia-like service, according to a survey by The Post.

A recent update to Baidu Baike’s robots.txt file – a file that tells search engine crawlers which Uniform Resource Locators (commonly known as web addresses) can be accessed from a site – has completely blocked the ability of Googlebot and Bingbot crawlers to index content from the Chinese platform.

This update appears to have occurred sometime on August 8, according to records from the Internet archive service Wayback Machine. It also showed that earlier that same day, Baidu Baike had still allowed Google and Bing to search and index its online repository of nearly 30 million entries, with only a portion of its site marked as blocked.

This initiative shows BeijingCanada-based Baidu’s increased efforts to protect its online resources as demand for vast amounts of data for training and building artificial intelligence (AI) models and applications.

This followed the move by US social media platform and forum Reddit in July when it blocked several search engines, except Google, from indexing its online posts and discussions. Google has struck a multimillion-dollar deal with Reddit that gives it the right to mine the social media platform for data to train its AI services.

Since OpenAI released ChatGPT on November 30, 2022, major search platforms Google and Microsoft have been trying to get more data for their own generative artificial intelligence systems. Photo: Shutterstock

For comparison: The Chinese version of the online encyclopedia Wikipedia currently has 1.43 million entries that are made accessible to search engine crawlers.

After Baidu updated Baike’s robots.txt file, The Post’s survey of Google and Bing on Friday found that many entries from the Wikipedia-like service were still appearing in the U.S. search engines’ results – likely from older cached content.

Representatives of Baidu, Google and Microsoft did not immediately respond to requests for comment on Friday.

More than two years after the groundbreaking launch of OpenAI‘S ChatGPTMany major AI developers around the world are contracting with content providers to gain access to high-quality content for their GenAI projects.

GenAI refers to the algorithms and services like ChatGPT that are used to create new content, including audio, code, images, text, simulations and videos.

In June, for example, OpenAI signed a contract with the American news magazine Time, giving it access to all archived content from the publication’s more than 100-year history.

By Bronte

Leave a Reply

Your email address will not be published. Required fields are marked *