Google explains Googlebot's byte limits and crawling architecture

Gary Illyes of Google published a blog post explaining how Googlebot’s crawling systems work. The article covers byte limits, partial fetch behavior, and how Google’s crawl infrastructure is organized.

Post references episode 105 of the Search Off the Record podcastwhere Illyes and Martin Splitt discussed the same topics. Illyes adds more details about mining architecture and byte-level behavior.

What’s new

Googlebot is one of the clients of a shared platform

Illyes describes Googlebot as “just a user of something that looks like a centralized crawling platform.”

Google Shopping, AdSense, and other products all send their crawl requests through the same system under different crawler names. Each client defines its own configuration, including user agent string, robots.txt tokens, and byte limits.

When Googlebot appears in the server logs, it is Google Search. Other clients appear under their own bot names, which Google lists on its website. crawler documentation site.

How the 2MB limit works in practice

Googlebot recovers up to 2 MB for any URL, excluding PDFs. PDFs are limited to 64 MB. Crawlers that do not specify a limit default to 15 MB.

Illyes adds several details about what is happening at the byte level.

It says that HTTP request headers count toward the 2MB limit. When a page exceeds 2MB, Googlebot doesn’t reject it. The crawler stops at the cutoff and sends the truncated content to Google’s indexing systems and the Web Rendering Service (WRS).

These systems treat the truncated file as if it were complete. Anything larger than 2 MB is never retrieved, rendered, or indexed.

Each external resource referenced in HTML, such as CSS and JavaScript files, is retrieved with its own distinct byte counter. These files do not count towards the 2 MB of the parent page. Media files, fonts, and what Google calls “some exotic files” are not picked up by WRS.

Rendered after recovery

The WRS processes JavaScript and executes client-side code to understand the content and structure of a page. It retrieves JavaScript, CSS and XHR requests but does not request images or videos.

Illyes also notes that WRS operates stateless, clearing local storage and session data between requests. Google JavaScript troubleshooting documentation covers the implications for JavaScript-dependent sites.

Best practices for staying under the limit

Google recommends moving heavy CSS and JavaScript to external files, as these have their own byte limits. Meta tags, title tags, link elements, canonical elements, and structured data should appear higher in the HTML. On large pages, content placed lower in the document may fall below the threshold.

Illyes points to inline base64 images, large blocks of inline CSS or JavaScript, and oversized menus as examples of what could push pages beyond 2MB.

The 2MB limit “is not set in stone and may change over time as the web evolves and HTML page sizes increase.”

Why it matters

The 2 MB limit and the 64 MB PDF limit were first documented as Googlebot-specific numbers in February. HTTP archive data displayed most pages fall well below the threshold. This blog post adds the technical context behind these numbers.

The platform description explains why different Google crawlers behave differently in server logs and why the default of 15MB differs from Googlebot’s 2MB limit. These are separate settings for different clients.

HTTP header details are important for pages near the limit. Google states that headers consume part of the 2MB limit alongside HTML data. Most sites won’t be affected, but pages with large headers and markup may hit the limit sooner.

Looking to the future

Google has now covered Googlebot’s crawl limitations in documentation updates, a podcast episode, and a dedicated blog post in the span of two months. Illyes’ note that the limit may change over time suggests that these numbers are not permanent.

For sites with standard HTML pages, the 2MB limit is not a problem. Pages with heavy inline content, embedded data, or oversized navigation should ensure that their critical content is within the first 2 MB of the response.

Featured image: Sergey Elagin/Shutterstock

Source link

Google explains Googlebot’s byte limits and crawling architecture

What’s new

Googlebot is one of the clients of a shared platform

How the 2MB limit works in practice

Rendered after recovery

Best practices for staying under the limit

Why it matters

Looking to the future

Leave a ReplyCancel Reply

Reddit Got Top Positions in Every Niche After May Core Update

Judge Megan Goldish on Karina’s Law in Illinois

From Chaos to Control: How Fleet Visibility Improves Customer Experience for Local Businesses

What’s new

Googlebot is one of the clients of a shared platform

How the 2MB limit works in practice

Rendered after recovery

Best practices for staying under the limit

Why it matters

Looking to the future

Leave a ReplyCancel Reply

Trending now

Reddit Got Top Positions in Every Niche After May Core Update

Judge Megan Goldish on Karina’s Law in Illinois

From Chaos to Control: How Fleet Visibility Improves Customer Experience for Local Businesses