Google explains why URLs blocked by Robots.txt can still be indexed


Google’s John Mueller answered a question about the curious circumstance in which Search Console reports thousands of URLs indexed despite being blocked by robots.txt. Mueller helped explain how this happens and how to fix it.

Content indexed despite being blocked by Robots.txt

A Reddit user asked for advice because Google Search Console was reporting over 51,000 pages as “Indexed, although blocked by robots.txt.” The affected URLs were primarily WooCommerce product URLs containing add-to-cart URL parameters such as “?add-to-cart=”.

Since the problem appeared suddenly, the site owner wondered if the rules in the robots.txt file themselves were responsible for creating the problem. They also wanted to know if removing the rules would help Google process canonical signals and eliminate URLs flagged by Search Console.

The person asked:

“I have a WooCommerce site and since last month we have been facing this problem: “Indexed, although blocked by robots.txt”

there are in total “Pages affected 51,000 pages”

at the end of the URL I mainly see ?page&post_type=product&product=slug&add-to-cart=98063,

After inspecting these URLs I found that they had an index tag setup and robots.txt had

* Prohibit: /*?add-to-cart=
* Prohibit: /*?*add-to-cart=

I removed these two rules from the robots.txt file and hope these pages will be fixed as they are canonically set to fix the product, will this fix the problem?

or do I also need to configure noindex rules? will this cost us our crawl budget? it’s a pretty big woocommerce site, let me know what you think if anyone has experience solving such a problem? and what will be the right method without preventing our SEO or our loss of functionality.

Google says add-to-cart URLs don’t need to be indexed

Mueller responded that add-to-cart URLs do not need to be indexed and that blocking them via robots.txt is an acceptable approach.

He explained that even when Google reports these URLs as indexed, they are unlikely to appear in normal search results because they are blocked by robots.txt. According to Mueller, users typically don’t search for these URLs directly, making them poor candidates for search visibility.

John Mueller responded:

“You don’t need add-to-cart URLs to be indexed. Blocking them with robots.txt is fine. Even if they are “indexed” since they are blocked by robots.txt, they are unlikely to show up in search (unless you make specific queries for those URLs, which users don’t).”

I’m a little hesitant about what Mueller said about “robots.txt” making it “unlikely” for the URLs to be shown in search. The reason is that robots.txt does not prevent a web page from appearing in Google search. This simply prevents Googlebot from crawling these pages. Technically that’s not entirely correct and I’m a little surprised that Mueller would say that.

Noindex is probably not a solution

One of the writers who answered this question suggested the solution of adding a robots noindex tag to parameterized URLs. But this may not be a viable solution because pages with and without URL parameters are essentially the same thing. They are rendered using the same template for a specific page. So unless WooCommerce treats them differently and can render parameterized URLs with a noindex and the normal page without the noindex, it’s not a real solution.

Why Google reports indexed URLs that it can’t crawl

Another Redditor offered a possible explanation for why so many URLs appear in Search Console. They suggested that Google likely discovered links containing add-to-cart settings somewhere on the site and added those URLs to its systems.

My suggestion for the person who originally asked this question is to crawl the website with Screaming Frog, look at the internal links to identify where these pages are linking from, and then take action, like removing those links or adding a rel=”nofollow” link attribute to them.

Probably the best solution is to use the robots.txt block to prevent crawling, provided it is understood that that’s all it does. If the person wants to be extra sure, they can also identify where these links exist and then add the nofollow link attribute as an extra layer, a clue to Google. Nofollow is not a guideline, but it is a strong hint.

Search Console warnings don’t always indicate a search problem

One of the recurring challenges with Search Console reports is that they can reveal technical conditions that seem distressing but actually have little or no effect on search performance. For example, 404 error reporting is useful for a variety of reasons, but often a 404 response from the server is the correct response, and it’s not really an “error” that needs to be fixed.

Take away

Mueller’s response reinforces the idea that not all Search Console warnings require action to be taken to fix something, although in this specific case there may be something to fix in the form of internal links to web pages that use cart URL parameters. If these links with cart URL parameters are absolutely necessary, then using a rel=”nofollow” link attribute will give Google a strong hint not to follow this link. The joy of technical SEO!

Featured image by Shutterstock/Orange Line Media



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *