LLMs “wouldn’t exist” without Reddit data

Reddit CEO Steve Huffman said major language patterns “would not exist as we know them” without Reddit content. He called the platform’s user-generated data the “modern oil” for AI.

Huffman made the comments during a interview at Fast Company’s Most Innovative Companies Summit.

What Huffman said about Reddit’s value to AI

Huffman described the position Reddit data holds in the AI ecosystem.

Huffman said:

“LLMs would not exist as we know them without Reddit. Reddit is one of the largest sources of training data for LLMs and Reddit continues to be one of the largest sources of training data and we are also the most cited platform, the most cited in all models.”

He attributed the citation request to Profound, a company that tracks AI citation data.

Huffman explained why AI companies depend on content.

“There is no artificial intelligence without real intelligence. Ultimately, these models are quite simple. They regurgitate on an absolutely massive scale what they have consumed elsewhere and a lot of that consumption is actually just human conversation on Reddit because it’s natural and it covers virtually every topic imaginable.”

Agreements for some, prosecutions for others

Reddit announced data licensing agreements with Google And OpenAI in 2024. Huffman called them two of Reddit’s original AI data deals and announced no additional deals.

“Since we did the first two deals with Google and OpenAI, which was over two years ago, so we’ve learned a lot. They’ve learned a lot. The whole world has learned a lot. In particular, how valuable and useful Reddit data is. And so we’re, I think, very deliberate and selective. But yeah, we’re open and open for business.”

For companies that did not agree to the licensing terms, Reddit took legal action. The company continued Anthropic in California Superior Court, alleging unauthorized use of Reddit content and violations of the Reddit Terms. Reddit filed a federal lawsuit against Perplexity in the Southern District of New York, alongside three data scraping companies, alleging anti-circumvention violations of the DMCA and related claims.

Huffman drew a line between the two groups.

“At companies like Google and OpenAI, with whom we had good relationships, we can make an agreement and put safeguards in place on the use and access of our data on behalf of our users, but then collaborate on creating products for the next generation of the Internet. »

He added that “not all companies are willing to collaborate and so unfortunately we have to take the opposite route, that of legal action.”

Huffman told the audience that Reddit’s position on commercial use is simple. “Commercial use of our data requires commercial terms,” he said. Reddit started charging for commercial access to APIs in 2023, a decision that preceded the current licensing agreements.

Huffman said Reddit still provides free access to data to researchers and universities and tries to remain flexible for non-commercial use.

What changed the opening of Reddit

According to Huffman, Reddit’s willingness to freely share data changed when the AI industry moved away from open research. Like SEJ previously reportedReddit has restricted access to many search engine bots while Google has remained an exception.

“Historically, Reddit has been like we were born out of an open Internet and Reddit has been open and very permissive about access to its data. And honestly, I think we would be in a different position today if AI companies were still fundamentally open source and doing open research.”

Huffman said the problem was that Reddit could no longer track how its data was being used. “People are using our data and we don’t know what it’s for,” he told the audience.

Beyond commercial terms, Huffman said Reddit wants to prevent its data from being used to identify users, target them with ads, or to replace or disintermediate the platform.

Reddit’s own AI efforts

Huffman recognized what he called a “paradox.” Reddit’s content powers external AI systems, but the company also uses AI on its platform.

The most visible product is Reddit Answers, a search function based on LLM. It reads posts and comments, then organizes them into responses constructed from verbatim quotes from users. Huffman noted that it is designed for questions without definitive answers.

“What Reddit Answers does is a few things that are unique to Reddit. One, it basically only answers with verbatim quotes from real people. And then the second thing it does is try to present multiple perspectives, because the bottom line is, if you’re on Reddit, you want the human perspective.”

Behind the scenes, Reddit uses AI for content moderation and classification. LLMs can assess whether a comment turns into bullying, something Huffman described as previously being difficult because of the subjectivity involved.

Huffman presented AI moderation as a way to reduce exposure to the worst content, not as a replacement for Reddit’s community moderation model.

“It used to be that the worst job on the Internet was looking at the worst content on the Internet and deciding whether it could be online or not,” Huffman said. “This job is disappearing.”

The gray area of messages written by AI

Huffman also addressed the challenge of users writing content with AI tools and pasting it into Reddit. This is different from the automated activity of robots, he emphasized.

“The most annoying thing I see not only on Reddit, but all over the internet, is someone who wrote their post or comment with ChatGPT and then pasted it into Reddit. Like, is this a robot? It certainly looks like a robot, but there’s a human behind the idea.”

Huffman framed the issue as one of intent. “It’s very important to us that there is a human behind the idea, behind the content, behind the prompt,” Huffman said. But he also noted that “the writing sucks” when users rely on AI to craft their posts.

Rather than creating a policy to address this issue, Huffman said Reddit would let its community handle the issue. Users are already voting against AI-written content and denouncing it in the comments. Huffman said Reddit will “further empower users and subreddits to completely reject this type of content.”

He compared the broader question to calculators in math classes. “Kids these days are just learning to write with AI. What are we going to do about it?” he said. “We kind of have to learn, I think, along with everyone else.”

Why it matters

Huffman’s comments reinforce Reddit’s argument that user discussions are a vital contribution to AI systems.

The AI-written content problem described by Huffman is an SEJ covered as part of broader YouTube AI slop investigation. Reddit’s decision to let community voting manage AI-generated posts, rather than building detection tools, is a different path than platforms that have deployed automated labeling.

Looking to the future

Huffman told Fast Company that Reddit was “in the market and talking to people all the time” about new data deals, although he did not hint at a third deal.

Reddit’s lawsuits against Anthropic and Perplexity are both ongoing. The Anthropic case was the subject of a referral hearing in Federal Court in March.

Source link

LLMs “wouldn’t exist” without Reddit data

What Huffman said about Reddit’s value to AI

Agreements for some, prosecutions for others

What changed the opening of Reddit

Reddit’s own AI efforts

The gray area of messages written by AI

Why it matters

Looking to the future

Leave a ReplyCancel Reply

6 reasons why consistency trumps virality in terms of long-term influence

The Most Valuable AI Skill Takes 10 Minutes a Week

Selling AI as a replacement attracts attention and kills trust

What Huffman said about Reddit’s value to AI

Agreements for some, prosecutions for others

What changed the opening of Reddit

Reddit’s own AI efforts

The gray area of ​​messages written by AI

Why it matters

Looking to the future

Leave a ReplyCancel Reply

Trending now

6 reasons why consistency trumps virality in terms of long-term influence

The Most Valuable AI Skill Takes 10 Minutes a Week

Selling AI as a replacement attracts attention and kills trust

The gray area of messages written by AI