It turns out that some AI startups might be pushing the boundaries — or outright ignoring the rules — when it comes to gathering data online. I recently discovered that Perplexity, an AI startup, has been accused of scraping content from websites that explicitly asked not to be crawled. According to a report from internet infrastructure giant Cloudflare, Perplexity’s bots have been circumventing restrictions set by site owners, including ignoring Robots.txt files that tell crawlers where they’re allowed to go.
This discovery shines a light on an ongoing issue in the AI world: how companies collect the massive amounts of data needed to power their large language models and other AI products without clear permission.
Here’s what Cloudflare observed
Cloudflare’s researchers noticed that Perplexity didn’t just scrape content; they actively hid their crawling activities. Instead of transparently identifying themselves as a bot, Perplexity’s systems reportedly masked their identity by changing their “user agent” — a piece of information websites use to figure out who’s visiting. They even switched the network routes, known as autonomous system numbers (ASNs), to avoid detection. Essentially, they wore disguises to sneak into websites that explicitly said, “Don’t crawl here.”
Cloudflare found these tactics happening across tens of thousands of domains, sending millions of requests every day. By combining machine learning techniques with network data, they were able to fingerprint the crawler linked to Perplexity.
“We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.”
In response, Perplexity’s spokesperson dismissed these findings, suggesting the data didn’t prove any unauthorized access. They even claimed the bot in question wasn’t theirs. However, Cloudflare had received complaints from its customers, who had put up blocks and rules to stop Perplexity’s bots — only to still see them crawling the sites.
Why is this such a big deal?
AI models rely fundamentally on huge datasets to learn — scraping text, images, and videos from the web is a common way they build those datasets. But scraping data without permission, especially when site owners clearly block it, raises serious ethical, legal, and business model questions.
Many websites use the Robots.txt standard to communicate their preferences about being indexed or scraped, and these standards are widely respected by traditional search engines. But AI crawlers are disrupting that respect for boundaries — and it’s upsetting the balance many rely on to make money, especially publishers.
Cloudflare itself has recently been vocal about how AI is breaking the internet’s business model, particularly for content creators and publishers who struggle to monetize their work when AI scrapes and reuses it without compensation. In fact, Cloudflare has even launched a marketplace for website owners to start charging AI scrapers, signaling just how serious this issue has become.
Perplexity and the bigger picture
This isn’t the first time Perplexity has been under the spotlight for allegedly scraping content without authorization. Last year, some news outlets accused the startup of plagiarism — a charge that their CEO didn’t fully address when pressed at a major tech conference. Given how much AI depends on web data, and how many content creators rely on clear rules and protections, this ongoing tension will shape the debate around AI’s growth and responsibility.
What’s clear is that AI startups face a tough balancing act: they need data to innovate, but they also have to respect the wishes of those who create that content. The ways companies like Perplexity handle this challenge will probably influence how the web itself evolves in the coming years.
Key takeaways
- Robots.txt and other web standards are increasingly ignored by some AI crawlers, complicating data ethics.
- Tech giants like Cloudflare are stepping in to help protect websites and publishers from unauthorized scraping.
- The tension between AI innovation and respecting content ownership is a defining issue for the future of the internet.
At the end of the day, no one wants an internet where AI companies freely raid content without permission — but they also can’t advance without data. The big question is: how will the ecosystem evolve to ensure everyone’s interests are balanced? I’ll be watching closely as this story unfolds.


