A recent update to Baidu Baike’s robots.txt file – a file that tells search engine crawlers which Uniform Resource Locators (commonly known as web addresses) can be accessed from a site – has completely blocked the ability of Googlebot and Bingbot crawlers to index content from the Chinese platform.
This update appears to have occurred sometime on August 8, according to records from the Internet archive service Wayback Machine. It also showed that earlier that same day, Baidu Baike had still allowed Google and Bing to search and index its online repository of nearly 30 million entries, with only a portion of its site marked as blocked.
This followed the move by US social media platform and forum Reddit in July when it blocked several search engines, except Google, from indexing its online posts and discussions. Google has struck a multimillion-dollar deal with Reddit that gives it the right to mine the social media platform for data to train its AI services.
For comparison: The Chinese version of the online encyclopedia Wikipedia currently has 1.43 million entries that are made accessible to search engine crawlers.
After Baidu updated Baike’s robots.txt file, The Post’s survey of Google and Bing on Friday found that many entries from the Wikipedia-like service were still appearing in the U.S. search engines’ results – likely from older cached content.
Representatives of Baidu, Google and Microsoft did not immediately respond to requests for comment on Friday.
GenAI refers to the algorithms and services like ChatGPT that are used to create new content, including audio, code, images, text, simulations and videos.
In June, for example, OpenAI signed a contract with the American news magazine Time, giving it access to all archived content from the publication’s more than 100-year history.