Google may expand list of unsupported Robots.txt rules


Google may expand the list of unsupported robots.txt rules in its documentation based on analysis of real-world robots.txt data collected via HTTP Archive.

Gary Illyes and Martin Splitt described the project in the last episode of Search Off the Record. The work began after a community member submitted a pull request to Google’s robots.txt repository proposing that two new tags be added to the unsupported list.

Illyes explained why the team expanded the scope beyond the two PR tags:

“We tried not to do things arbitrarily, but rather to collect data. »

Rather than adding just the two proposed tags, the team decided to look at the 10 or 15 most used unsupported rules. Illyes said the goal was “a decent starting point, a decent baseline” for documenting the most common unsupported tags in the wild.

How the search worked

The team used HTTP Archive to study the rules that websites use in their robots.txt files. HTTP Archive runs monthly scans on millions of URLs using WebPageTest and stores the results in Google BigQuery.

The first attempt hit a wall. The team “quickly realized that no one was actually asking for robots.txt files” when crawling by default, meaning that HTTP Archive datasets typically don’t include robots.txt content.

After consulting with Barry Pollard and the HTTP Archive community, the team wrote a custom JavaScript parser that extracts robots.txt rules line by line. THE custom metric was merged before the February crawl and the resulting data is now available in the custom_metrics dataset in BigQuery.

What the data shows

The parser extracted each row matching a field-colon value pattern. Illyes described the resulting distribution:

“After allow, ban and user agent, the drop is extremely drastic.”

Beyond these three fields, rule usage falls into a long tail of less common directives, as well as unwanted data from faulty files that return HTML instead of plain text.

Google currently supports four fields in robots.txt. These fields are user-agent, allow, prohibit and sitemap. The documentation states that other fields are “not supported” without listing the most common unsupported fields in the wild.

Google clarified that unsupported fields are ignored. The current project extends this work by identifying specific rules that Google plans to document.

The 10-15 most used rules beyond the four supported fields should be added to Google’s list of unsupported rules. Illyes did not name specific rules that would be included.

Typo tolerance can expand

Illyes said the analysis also revealed common spelling errors in the ban rule:

“I’ll probably increase the typos we accept.”

Its wording implies that the parser already accepts some spelling errors. Illyes did not commit to a timeline or specific typos in the name.

Why it matters

Search Console is already showing unrecognized robots.txt tags. If Google documents more unsupported directives, its public documentation could more accurately reflect the unrecognized tags people already see appearing in Search Console.

Looking to the future

The planned update would affect Google’s public documentation and how banned typos are handled. Anyone managing a robots.txt file with rules beyond user agent, allow, disallow and sitemap should check the guidelines which have never worked for Google.

The data in the HTTP archive is publicly viewable on BigQuery for anyone who wants to look directly at the distribution.


Featured image: Screenshot from: YouTube.com/GoogleSearchCentral, April 2026.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *