US publishers demand Common Crawl stop removing their content

Digital Content Next, a trade organization representing American digital publishers, sent a cease and desist letter at the Common Crawl Foundation.

The letter asks Common Crawl to stop collecting content from publishers and remove material already in its datasets.

DCN CEO Jason Kint announced the legal notices in a statement. blog postAnd Press journal reported additional details of the letter this week.

Common Crawl has crawled several billion new pages every month since 2007 to create a free public archive. These archives have been used to train many AI models used today. OpenAI GPT-3 paper listed Common Crawl filtered as 60% of the model training mix.

The dispute is important for any site that blocks AI crawlers. Blocking Common Crawl’s crawler, CCBot, stops future collection but does not affect content already in the archive, which anyone can still download.

What DCN requires

The letter calls on Common Crawl to stop “scraping, curating, or sharing copyrighted, paid, subscriber-only, or otherwise protected content from DCN member companies in its datasets” and to remove member content it has already collected.

DCN claims that Common Crawl “blatantly infringed” on copyrighted content by creating its datasets and sharing them with AI companies.

The letter asserts that “copyright law is not an opt-out regime.” In other words, DCN’s position is that publishers should not have to ask to be excluded. Common Crawl should need permission to include them.

Kint wrote that the notice:

“challenges the increasingly common assumption that content created through substantial investment can be collected, stored, reused and monetized simply because it is technically accessible. »

Why DCN doubts the removal process

DCN’s letter questions whether Common Crawl follows opt-out instructions and removes content when requested. According to Press Gazette, DCN’s lawyers are examining whether Common Crawl’s statements to publishers “could have been inaccurate or misleading.”

Common Crawl publishes a public register of websites that have requested not to be removed. It includes entries for the Associated Press, the BBC and a major News/Media Alliance submission covering hundreds of subject areas. Press Gazette reports that the list also includes other major publishers.

This is not the first time the removal process has been called into question. The Atlantic reported as of November, this content from the New York Times and Danish publishers was still available after Common Crawl agreed to remove it.

Response from Common Crawl

Common Crawl Executive Director Rich Skrenta declined to comment on the letter when contacted by Press Gazette.

He has previously pushed back against similar claims. In a November blog post In response to The Atlantic, Skrenta denied that the organization lied to publishers or removed paid material.

He stated that the file format of the archive could not be changed after publication without breaking its integrity. Instead, Common Crawl says it removes or filters affected URLs from further crawls and makes them inaccessible through its public tools and indexes:

“When an editor asks us to remove previously explored material, we respond quickly and initiate a removal process that reflects the technical design of our dataset. »

He added:

“No one at Common Crawl has ever claimed that this work is instantaneous or complete; rather, we have been open about its complexity and ongoing nature.”

In a forum post This week, Skrenta said Common Crawl contributes to work on open standards for how websites express AI scraping preferences.

Why it matters

The DCN letter targets stored archives, not just future exploration, and argues that the burden of opting out should not fall on publishers in the first place.

Most publishers of BuzzStream Sample have already made the blocking decision, with 79% of 100 verified news sites blocking at least one training bot. Cloudflare Year in Review we covered in January found CCBot among the bots with the most comprehensive ban guidelines in major areas. The question DCN raises is what these blocks accomplish if years of content remain available for training anyway.

Looking to the future

Escalation of the DCN depends on how Common Crawl responds, and Common Crawl has not specified how it would do so. Both sides want different rules about who acts first.

Skrenta supports standardization work that would allow sites to indicate their scraping preferences, which continues to opt out as a pattern. The UK CMA followed a similar path when it required Google to allow publishers to disable AI search features.

DCN argues that scrapers should first need authorization. If more professional groups adopt this argument, the pressure will shift from individual robots.txt files to the archives themselves.

Featured Image: Andre Boukreev/Shutterstock

Source link