Google Exposes Fundamental Flaw in LLMs.txt File

Google’s John Mueller and Martin Splitt spoke about LLMs.txt and markdown, with Mueller offering a surprising fact about the original purpose of LLMs.txt and also explaining why the proposed standards have serious flaws.

What is discovery and why it matters

In the context of information retrieval (search), discovery is for a search engine to discover that a specific web page exists. Discovery is part of the overall architecture of the search engine.

Search engine architecture:

Discovery
Discover the URL (add it to the crawl).
Crawling
Downloading and analyzing content.
Indexing
The process of analyzing raw data and storing it in a structured database optimized for retrieval.
Ranking
The part that interests everyone.
Portion
This is the final step of serving the ranked web pages in search results.

The above is a simplified overview of what search is and discovery is the very first part of the process which ultimately ends with ranking and serving links to websites.

The takeaway here is that discovery is an essential part of getting a web page queued to be crawled, indexed, ranked, and ultimately displayed in search results. Without Discovery, a web page is invisible.

Now here’s why it’s important: Discovery is not part of the proposed LLMs.txt standard. to use

Original intent of the LLMs.txt file

John Mueller said he met with one of the people responsible for creating the LLMs.txt proposal and said the creator explained that LLMs.txt was never intended to make a site discoverable, it was never intended to be part of that process.

This is an important point because many site owners spend time, money, and effort generating LLM.txt with the goal of being discovered and ranked in LLMs. This means that the reason people use LLMs.txt conflicts with the actual purpose of LLMs.txt, which has nothing to do with Discovery.

Mueller explained:

“So I spoke with, I think, one of the people who created this proposal a while ago. And the idea was really not to create something that would make it easier for search engines or LLM systems to discover all your content, but almost more so that if an LLM already knows your site and wants to discover what else is here, then that might be one approach.

And I think using that as a way to optimize discovery by AI systems or discovery by search systems makes no sense.

Mueller then explained that many people use LLMs.txt in hopes of making the discovery process easier despite the fact that that is not the purpose of LLMs.txt.

He then highlighted the fact that LLM.txt files are not inherently trustworthy because it involves a site owner saying what their site’s content is about, which may or may not match what’s in the actual HTML.

He continued:

“Because you’re basically telling these systems that I have the best website ever. And here are all the pages that everyone needs to go to. And you need to buy all of my products or whatever you put on there.”

So in an LLM system, it…basically, by design, can’t trust what’s here as a way to differentiate between different websites.

Agent Instructions

Mueller then says that some of these proposed standards could be useful in helping an AI agent, which perhaps seems to be talking about the Web Model Context Protocol (WebMCP).

He explained:

“If someone is already on your website, maybe some sort of automated system is helpful. If that happens, I want to go to Martin’s Splitt and buy a photo, then the LLM system can go to your website and look around, like, how do I buy a photo? Maybe it has some guidelines for me as an agent for buying photographs. That makes sense.

But by saying: I want to buy a photograph, which website has one, the system is not going to go to your website and five others and ask: who has automated information? But rather, they try, will try to find the best website…”

LLMs.txt is not about being discovered by AI

Mueller returned to how people misinterpret LLMs.txt as a way to be discovered by AI systems.

He reasoned on this point:

“I think from that perspective, optimization as a way to get discovered doesn’t make sense.

But what happens when an agent is on your website? I think this generally seems to be an open area of discussion at the moment, in that there is LLMs.txt as a proposal. There are different JSON files and well-known file types that are under discussion.

There’s WebMCP, which I think tries to do something similar, where they say, well, you’re on this page now, but we have an API for it, a specific URL added or a specific mechanism.

I think these are almost different discussions then.

Discovery and ranking are still tied to HTML

Mueller completed his reflection by emphasizing that Discovery is at the HTML level.

He explained:

“So the generic SEO angle of how do I find a website that sells me a photo will be almost entirely related to HTML pages and normal web pages.

And then, if a user decides to go to a specific service, within that service, then there’s a little more room to maybe help an agent or an LLM system find the right approach.

But what’s interesting, of course, is the multitude of ideas. And none of these things have fundamentally crystallized as being the one thing that everyone will use. So I’m sure over the next few months, I don’t know, six months, a year or maybe longer, it’s going to take a little bit. And some of these agent systems will sort of unify around a standard file type or mechanism or something like that.

Mueller wasn’t pushing the WebMCP standard, but if AI agents become a way for users to interact with websites, it will be something like WebMCP and not LLMs.txt that will be useful for websites, especially e-commerce sites.

WebMCP is naturally better suited to e-commerce, as it aims to give AI agents actionable capabilities, such as how to filter products, search and identify products, help compare different products, and help the AI add a product to a cart.

AI agents are able to navigate using website HTML designed for humans. WebMCP makes it easier for AI agents to interact with the website, which LLMs.txt does not do.

Although neither LLMs.txt nor WebMCP helps a website get discovered by AI, neither was created for this purpose. The Discovery part, first stage of the ranking, takes place in HTML. If so, what is your next move?

Listen to Google Search of The Record episode 111

Featured image by Shutterstock/Master1305

Source link