Perplexity accused of scraping websites despite explicit blocks: What this means for AI and the internet

AI startups like Perplexity may bypass explicit website restrictions to scrape data, raising ethical concerns.

AI News & Big Tech Correspondent

Alex Carter writes for Aiholics, keeping readers updated on the fast-paced world of AI and Big Tech. He breaks down important news and developments from the...

- AI News & Big Tech Correspondent

Published: August 4, 2025

5 Min Read

It turns out that some AI startups might be pushing the boundaries — or outright ignoring the rules — when it comes to gathering data online. I recently discovered that Perplexity, an AI startup, has been accused of scraping content from websites that explicitly asked not to be crawled. According to a report from internet infrastructure giant Cloudflare, Perplexity‘s bots have been circumventing restrictions set by site owners, including ignoring Robots.txt files that tell crawlers where they’re allowed to go.

This discovery shines a light on an ongoing issue in the AI world: how companies collect the massive amounts of data needed to power their large language models and other AI products without clear permission.

Here’s what Cloudflare observed

Cloudflare’s researchers noticed that Perplexity didn’t just scrape content; they actively hid their crawling activities. Instead of transparently identifying themselves as a bot, Perplexity’s systems reportedly masked their identity by changing their “user agent” — a piece of information websites use to figure out who’s visiting. They even switched the network routes, known as autonomous system numbers (ASNs), to avoid detection. Essentially, they wore disguises to sneak into websites that explicitly said, “Don’t crawl here.”

Cloudflare found these tactics happening across tens of thousands of domains, sending millions of requests every day. By combining machine learning techniques with network data, they were able to fingerprint the crawler linked to Perplexity.

“We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.”

In response, Perplexity’s spokesperson dismissed these findings, suggesting the data didn’t prove any unauthorized access. They even claimed the bot in question wasn’t theirs. However, Cloudflare had received complaints from its customers, who had put up blocks and rules to stop Perplexity’s bots — only to still see them crawling the sites.

Why is this such a big deal?

AI models rely fundamentally on huge datasets to learn — scraping text, images, and videos from the web is a common way they build those datasets. But scraping data without permission, especially when site owners clearly block it, raises serious ethical, legal, and business model questions.

Many websites use the Robots.txt standard to communicate their preferences about being indexed or scraped, and these standards are widely respected by traditional search engines. But AI crawlers are disrupting that respect for boundaries — and it’s upsetting the balance many rely on to make money, especially publishers.

Cloudflare itself has recently been vocal about how AI is breaking the internet’s business model, particularly for content creators and publishers who struggle to monetize their work when AI scrapes and reuses it without compensation. In fact, Cloudflare has even launched a marketplace for website owners to start charging AI scrapers, signaling just how serious this issue has become.

Perplexity and the bigger picture

This isn’t the first time Perplexity has been under the spotlight for allegedly scraping content without authorization. Last year, some news outlets accused the startup of plagiarism — a charge that their CEO didn’t fully address when pressed at a major tech conference. Given how much AI depends on web data, and how many content creators rely on clear rules and protections, this ongoing tension will shape the debate around AI’s growth and responsibility.

What’s clear is that AI startups face a tough balancing act: they need data to innovate, but they also have to respect the wishes of those who create that content. The ways companies like Perplexity handle this challenge will probably influence how the web itself evolves in the coming years.

Key takeaways

Robots.txt and other web standards are increasingly ignored by some AI crawlers, complicating data ethics.
Tech giants like Cloudflare are stepping in to help protect websites and publishers from unauthorized scraping.
The tension between AI innovation and respecting content ownership is a defining issue for the future of the internet.

At the end of the day, no one wants an internet where AI companies freely raid content without permission — but they also can’t advance without data. The big question is: how will the ecosystem evolve to ensure everyone’s interests are balanced? I’ll be watching closely as this story unfolds.

GPT-5.5 arrives with stronger reasoning, coding and agentic workflows

Inside Grok 4.1: When AI chatbots validate delusions and what that means for mental health

US moves to block Chinese companies from exploiting American AI models

China's DeepSeek launches AI model V4: What it means for the global AI race

Google's eighth generation TPUs: Powering AI's agentic era with two specialized chips

Archives

Categories

TikTok's getting a makeover: AI Avatars promise global reach

Perplexity accused of scraping websites despite explicit blocks

AI startups like Perplexity may bypass explicit website restrictions to scrape data, raising ethical concerns.

Here’s what Cloudflare observed

Why is this such a big deal?

Perplexity and the bigger picture

Key takeaways

Leave a Reply Cancel reply

Making Chatgpt better for clinicians: A new era of AI-powered healthcare support

Trending

Sony AI's Ace robot takes on elite table tennis players: A new era for physical AI

Your may also like!

SpaceX's bold $60 billion bet: What acquiring Cursor means for AI coding tools

Making Chatgpt better for clinicians: A new era of AI-powered healthcare support

The 10 stages of Artificial Intelligence

How AI cost cuts could unlock $22 billion for the gaming industry

Quick Links

Socials

Archives

Categories

Here’s what Cloudflare observed

Why is this such a big deal?

More Read

Perplexity and the bigger picture

Key takeaways

Sign Up for the Daily AI Pulse

One email a day. All the stories that matter.

Leave a Reply Cancel reply

Trending

Your may also like!

Socials