Reddit Sues Perplexity AI for “Industrial-Scale Data Theft

0
Reddit-perplexity

Key Points:

  • Reddit filed lawsuit in New York federal court on October 22, 2025, against Perplexity AI and three data-scraping firms
  • Perplexity allegedly bypassed security protections to illegally scrape Reddit user posts for training AI search engine
  • Reddit caught Perplexity “red-handed” using honeypot test post that appeared in Perplexity results within hours
  • Citations of Reddit content in Perplexity increased 40-fold after cease-and-desist letter was issued
  • Co-defendants include Lithuanian firm Oxylabs, former Russian botnet AWMProxy, and Texas-based SerpApi
  • Reddit has licensing agreements with Google and OpenAI but Perplexity operates without authorization
  • Second major AI lawsuit after Reddit sued Anthropic in June 2025; both cases allege copyright violations

New Delhi: Reddit initiated legal action against Perplexity AI and three data-scraping companies in Manhattan federal court on October 22, 2025, alleging “industrial-scale, unlawful” data theft from its platform. The lawsuit specifically targets San Francisco-based Perplexity, which operates an AI-powered “answer engine” competing with Google Search and ChatGPT, along with Lithuanian data-scraping company Oxylabs UAB, web domain AWMProxy (described by Reddit as a “former Russian botnet”), and Texas-based startup SerpApi. Reddit alleges these companies conspired to circumvent both Reddit’s anti-scraping protections and Google’s security controls to illegally harvest user-generated content from Google search engine results pages.

“Marked Bills” Sting Operation

Reddit claims it caught Perplexity in a digital sting operation using what it described as “the equivalent of marked bills”. According to the lawsuit, Reddit created a test post that was exclusively accessible through Google search results and nowhere else online. Within hours of posting this honeypot content, the material appeared in responses generated by Perplexity’s answer engine. “The only way Perplexity could have obtained that Reddit content and then used it in an ‘answer’ is if Perplexity and/or the Co-Defendants scraped the SERPs for the Reddit content, which Perplexity then quickly ingested into its answer engine,” the lawsuit stated. This evidence demonstrates that scrapers were extracting Reddit content directly from Google’s search engine rather than accessing Reddit’s platform legitimately.

The Data Laundering Ecosystem

Reddit’s Chief Legal Officer Ben Lee characterized the alleged scheme as part of a broader “industrial-scale ‘data laundering'” industry fueled by AI companies’ desperate competition for high-quality human-generated content. The lawsuit alleges that data-scraping intermediaries like Oxylabs, AWMProxy, and SerpApi extract Reddit content from Google search results by “masking their identities, hiding their locations, and disguising their web scrapers” to evade detection. These companies then resell the stolen data to AI firms like Perplexity that are “hungry for training material” but unwilling to enter legitimate licensing agreements.

Reddit compared the defendants to “would-be bank robbers” who, unable to breach the bank vault directly, instead target the armored truck transporting money. Lee stated, “Because they are unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search. Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself”.

Cease-and-Desist Ignored

Reddit asserts it sent Perplexity a cease-and-desist letter demanding the company stop using its content. However, rather than complying, citations of Reddit content in Perplexity’s search results surged by forty times following the legal warning. The lawsuit notes that Perplexity had previously scraped Reddit data directly without payment but agreed to stop after receiving the initial cease-and-desist order. The dramatic increase in Reddit citations after the warning suggests Perplexity may have simply shifted to purchasing scraped data from third-party intermediaries rather than accessing it directly.

Legitimate vs. Illegitimate Data Access

Reddit emphasized it has established formal licensing agreements with major AI companies, including Google and OpenAI, granting them limited, authorized access to use Reddit content for training language models. These agreements represent Reddit’s business model of monetizing its vast repository of human conversations, 100,000 active interest-based “subreddit” communities containing discussions that AI researchers value for making chatbot responses sound more natural. The company claims it has invested “tens of millions of dollars” in anti-scraping technological protections to enforce these exclusive agreements.

In contrast, Perplexity operates without any licensing arrangement, allegedly exploiting content without permission or compensation. Reddit stated in its lawsuit that user posts have become the most frequently referenced source for AI-generated responses on Perplexity, making the alleged theft particularly damaging to Reddit’s business interests.

Perplexity’s Defense

Perplexity strongly denied the allegations in a statement posted directly on Reddit, characterizing the lawsuit as “extortion” and an attack on internet openness. “Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest,” the company stated. Perplexity claimed its answer engine does not train AI models on Reddit-specific content but instead “summarizes and cites discussions that are publicly available on Reddit, just like any user linking or quoting Reddit posts might do”.

The company further argued that entering a licensing agreement with Reddit is “impossible” because Perplexity does not develop foundational language models it merely uses existing models to process search results. Perplexity accused Reddit of attempting to undermine open internet principles by demanding licensing fees, suggesting Reddit’s true motivation is to use the lawsuit as leverage in negotiations with Google and OpenAI over training data pricing.

Legal Demands and Broader Context

Reddit is seeking unspecified monetary damages and a court injunction to permanently prevent Perplexity and the co-defendant scraping companies from accessing or using its data in violation of federal copyright law, unfair competition statutes, and unjust enrichment principles. The lawsuit represents Reddit’s second major legal action against an AI company in 2025, following a similar complaint filed against Anthropic (maker of the Claude chatbot) in June that remains pending with a hearing scheduled for January 2026.

This case illustrates the intensifying conflict between content creators and AI companies over unauthorized use of copyrighted materials to train large language models. Along with digitized books and news articles, platforms like Wikipedia and Reddit represent deep repositories of written human language patterns that are invaluable for teaching AI assistants. As Ben Lee noted, “Reddit is a prime target because it’s one of the largest and most dynamic collections of human conversation ever created”. The outcome of this litigation could establish important precedents for how AI companies must negotiate access to proprietary content in the rapidly evolving artificial intelligence industry.

Advertisement