AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

77 points by hilux a day ago

dangle1 a day ago

Related to previous:

https://news.ycombinator.com/item?id=42725147

dang a day ago

Thanks! Macroexpanded:
Nepenthes is a tarpit to catch AI web crawlers - https://news.ycombinator.com/item?id=42725147 - Jan 2025 (263 comments)

AI haters? I don't hate AI I just don't want things I've created being used to enrich multi-billion dollar companies for free. These companies are behaving poorly and they should expect this kind of push back.

someone_eu a day ago

It is not even just the copyright issue.
The article completely misses the point that AI scrapers are not a "future threat of AI domination". They already do damage by DDOSing site's networking infrastructure and inflicting very real costs to a site hoster.
Even when the data is completely free, like in case of Wikipedia or OpenstreetMaps, scraping it is unethical and should be illegal. Most of the open data resources have procedures, which allow downloading of the data in the archived form, without need for scraping. They are built with sharing in mind.
So the arguments the article tries to use (what if it is for public good?) has no sense. 1) it is not 2) there are many ways to fetch the open data properly and respectfully.
trod1234 a day ago

Well its soft propaganda, what do you expect, they want to force a false dichotomy to shape public perception.
It is immaterial that these AI companies ignore contractual obligations (TOS), and are in fact performing attacks on said sites (DDOS is an attack).
In the last 3 months, there have been 4 or 5 small project that I regularly frequent where their sites that have been knocked offline as a result of this type of bad behavior, where they definitely are not following the robots.txt.
The article is just bad shilled journalism.
rsynnott 19 hours ago

> I don't hate AI I just don't want things I've created being used to enrich multi-billion dollar companies for free.
I mean I think in the minds of AI evangelists (particularly of the quasi-religious "LLMs will bring forth a benevolent god-like superintelligence" variety), those are essentially the same thing.
(Yeah, it's ridiculous characterisation, but given the source it shouldn't be _surprising_ characterisation.)
lowlevel a day ago

Yeah, I think the title of this post and the article are a little tilted… we have every right to say no or not cooperate with something we don’t believe in.
doublerabbit a day ago

I hate people who call ML, AI.
It's a glorified librarian at best
- grajaganDev a day ago
  
  AI is the broader concept.
  ML is an application of AI.
  - OutOfHere a day ago
    
    Token prediction is now the unified concept, although perhaps to be replaced by direct sequence prediction. Token prediction is both ML and AI.
  - metalman a day ago
    
    Something that is named intelligent, does not get "tricked", or need to be defended, or qualified, or defined, aaaand can take a trillion dollar market colapse in stride, pull up its big artificial undies all by itself,right! Way ,way to late for some kind of academic splaining things away. People bet a lot lot more than the meer trillion dollars.Primary national stratigic objectives have been put into play, on the promise of hardware and AI based dominence, and ZAM, just like that, North Korea etall, are jammin that artificial jive. So all of the main hypsters, are going to be promoted to somewhere, quiet and dry. It is realy troubleing that last week, ai tarpits were amusing, inconsequential, pranks, and now its "AI haters".The author of the ai tarpit was self depreciating, and recognised that there actions were symbolic at best.
  - doublerabbit a day ago
    
    Yes, one could say ML could be an application for AI but it's still not Artificial Intelligence.
    As an engine is an application for a car, you don't call an engine a car. ML is programmed information, and theres nothing artificial about that.
    What is AI about it? Show me something AI from ML.
    All algorithms are defined, all tokens are defined. All guardrails are defined.
    All of it has been programmed by humans that which isn't artificial.
    If it was AI then it would generate itself for itself.
    
    grajaganDev a day ago
    
    I am offering the generally accepted definitions of these terms as some HN readers might not be familiar with them.
    If you don't think that ML is AI, I am fine with that.
    
    doublerabbit a day ago
    
    Cool. I hate the term of calling AI because its an excuse to throw out a flawed pretentious innovation where by it can be flogged as a product to sucker those who don't know better.
    "Cloud Servers" anyone? Glorified dedicated servers.
    
    jopicornell a day ago
    
    Well, artificial means made by humans, not naturally.
    
    doublerabbit a day ago
    
    you got me on that one.
    I wouldn't classify it as intelligent though. It's still reading a scripture of words.
- eesmith a day ago
  
  It's not a librarian.
senko a day ago

Is this because they are multi-billion dollar companies, or because they behave poorly, or because you haven't been properly compensated for your contribution to the content on the internet?
It is very likely that voting, or voting with your wallet, or probably any kind of activism, would have more impact than withdrawing from the (online) public life.
- JTyQZSnP3cQGa8B a day ago
  
  > because they are multi-billion dollar companies, or because they behave poorly
  Both for me. They should be spanked for their behavior and lack of respect. I don’t want compensation though because I write open-source applications, but I want them to respect the license (which they don’t obviously).
  Also I don’t understand why you feel anybody is withdrawing from the internet. It’s only a tarpit and I’m sure most of those who react don’t have ChatGPT subscriptions.
- klabb3 a day ago
  
  > because you haven't been properly compensated for your contribution to the content on the internet
  Copyright is - like it or not - the way we regulate commercial intellectual "property". I can see different IP doctrines, and I don't necessarily defend the current one. It's not derived from real property rights, but rather in ensuring economic incentives for people to make stuff that otherwise wouldn't have been made, such as pharmaceuticals and hollywood movies, to simulate property rights. It's an imperfect solution which is there to ensure economic incentives and balance, and most importantly, it's the one we got.
  But then, multi-billion dollar corporations feed your copyright protected (you thought) works straight into their supply chain, wouldn't you be pissed? It's no a small part either, but their models would be extremely nerfed without copyrighted data. Forget AI, forget tech, just look at it from a purely economic ecosystem perspective. Crying "fair use" during a highway robbery probably don't sit right with many, I hope.
- nemomarx a day ago
  
  what voting with your wallet are you envisioning?
  I already don't pay for any ai services or touch any models. but Increasingly services that used to be helpful for me are throwing them in - YouTube premium has some sort of ai summarize things, etc. how do I signal that I don't want companies scraping content there?
- BitterCritter a day ago
  
  I’m very confused as to what my $23.45 wallet has to do with what billion dollar Ai companies do.
- BizarreByte 15 hours ago
  
  > or because you haven't been properly compensated for your contribution to the content on the internet?
  This comes across as snark, but I will assume you are well meaning. I have put code, guides, and videos on the internet for other people to consume for free in the hope that those people find that stuff useful.
  I did not put stuff on the internet for it to be hoovered up and frankly stolen by massive AI companies to enrich themselves. If they are going to use my things in their commercial product than yes, they should be compensating me for that.
  > Is this because they are multi-billion dollar companies, or because they behave poorly
  It's also both of these.
  > It is very likely that voting, or voting with your wallet, or probably any kind of activism, would have more impact than withdrawing from the (online) public life.
  The only true control I have is withdrawing. I don't give these companies money, I don't live in a country that can meaningfully legislate against them, and I would consider withdrawing a form of activism.
  I refuse to support these AI companies in any way (as long as they continue to be bad actors) and I have taken down all Youtube videos I've created, my personal website, and I have moved all my code to a self hosted, private Git service in order to deny them my work.
- smt88 a day ago
  
  I live in the US, and I didn't have an option to vote against Big Tech. Both parties were deeply sycophantic toward that industry.
- ToucanLoucan a day ago
  
  You pose this as though there isn't a long and proud history of the Internet reacting with (sometimes unhinged) hostility to bad actors that goes ALLLLLL the way back to the BBS era.
  One of the first persons that tried to scam users out of money by asking for help with his tuition was doxxed by his own ISP after the aforementioned ISP got so much hatemail it crashed their servers when his message was posted to every message board by a script which caused prolific BBS users to download it possibly several hundred times, paying for the privilege of each message.

marginalia_nu a day ago

To be fair there's tens of thousands of content farms filling the web with ai slop. That's far more likely to harm AI scrapers than these hijinks.

Most crawlers use some form of timeout mechanism, usually informed by some priority scheduling. This deals reasonably well with crawler traps.

Since Nephentes-like traps are getting so common now (and in particular, not always behind robots.txt), I added a clause to Marginalia's crawler that prevents it from extracting links from pages that are less than 2 Kb and take more than 9 seconds to load. It's 4 lines of code and means the crawler doesn't get stuck at all.

I totally get the frustration though. My sites get an insane amount of bot traffic as well. I think roughly 1% of the search traffic to the html endpoint is human, and that's while providing a free API they could use instead. ... I just don't think this is going to fix anything.

uberman a day ago

Article seems super biased to me. Why are tarpits repeatedly characterized as attacks rather than as defense?

grajaganDev a day ago

Agreed. Some commenters where pretty upset about the tarpits the last time this was posted on HN.
simondanerd a day ago

In my opinion, if you access my content or web service, you accept the risk of not asking me first or knowing what's going to happen to the content when you do access it. It's mine, I've given you access, I'm not going to guarantee it's quality or how many times you'll need to try to get it before your scraper gets blocked for too many requests.

Dwedit a day ago

There's also the other kind of AI haters who do not give any anti-bot indicators about their tarpits (no robots.txt entry, no "nofollow", etc), and want to intentionally feed them poisoned data.

selfhoster a day ago

That's a great idea! Since there are probably not any laws against this yet, it sounds like a great way to fight back!! Poison the AI well, I like it.
mvdtnz a day ago

I'm not obligated to design my websites with AI bots (or search engine bots) in mind. If I want to publish mountains of Markhov generated data on my own servers that's my right. There's no rule that I need to use robots.txt, I didn't ask your shitty bot to crawl my site.
dylan604 a day ago

what are the odds that one of these people are running a server that is popular enough to garner that kind of attention?
- grajaganDev a day ago
  
  The odds a pretty good if the people running the server work for an intelligence agency.

freitasm a day ago

> AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

> That's likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.

Website owners aren't "haters" if bots ignore robots.txt, consuming resources that translate to expenses and a bad experience for legitimate website visitors.

Why would the website owner have to commission much larger server(s), pay more for this traffic and get nothing in return? At least search engines send human visitors your way.

It's not "AI haters". It's exploitation hating.

anotherhue a day ago

I wonder if we could run the cheapest smallest 1b model (or smaller) llm to generate this data in a way that's just plausible enough to be ingested.

A little note in robots.txt offering commercial terms would also be available.

Dwedit a day ago

Usually it's just a Markov Chain rather than an LLM to generate the garbage data.

rglover a day ago

I've yet to do it, but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go...hard.

Calling people who don't want certain things hoovered up by an LLM "AI haters" is a level of manipulation I'd think was only reserved for someone with a vested interest in the tech. Just encourages devious behavior instead of a more diplomatic approach of respecting people's wishes (read: robots.txt).

rsynnott 19 hours ago

> but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go
Might I suggest https://tvtropes.org/pmwiki/pmwiki.php/Literature/BelindaBli...
As a bonus, beyond introducing the magic robot to the wonderful world of really poorly written erotica, it's also very anachronistic (widespread use of fax machines, pagers, smartphones, LinkedIn, and East Germany appear to coexist at a single moment in time), so will cause further confusion.

bsnnkv a day ago

I run a service[1] that was getting hit pretty hard by these crawlers.

Ultimately instead of going down this path, I decided to just start charging for access to the service (it was long overdue)[2].

Users who are logged out can still see old cached content (which is a single DB read op), but to aggregate new content requires an account. I feel like this is a good (enough) middleground solution for now.

[1]: https://kulli.sh

[2]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...

nickphx a day ago

I block them because they're not paying me to use my resources. They would block me if I made a similar volume of requests.

amelius a day ago

What if the scrapers use breadth-first search?

dinobones a day ago

ok...

if depth > 5 and if sem_hash(content) in hist: return

rossdavidh a day ago

It's a radar gun/radar detector kind of situation. You can always change your strategies; countless top-level links that are shown on the menu in a way that no human sees them (due to font color or size or etc). Small number of pages with a few that go on forever, in a way that any human would stop reading but a bot may have trouble detecting. Real (for humans) text in images, with endless invisible text for bots. Etc.
I think it will probably make it harder for screen readers, unfortunately.
- dylan604 a day ago
  
  > countless top-level links that are shown on the menu in a way that no human sees them (due to font color or size or etc).
  wouldn't this lower their page rankings? that's the kind of shenanigans of the old days with meta key word stuff and what not
  - smaudet a day ago
    
    I'm not sure the old search engines are any good anymore, though, so we don't really care...
    That's kind of the point. AI has just been more terrible news after terrible news.
    It is NOT a good thing. Unless you know, you like being covered in oil and lit on fire...
  - rossdavidh a day ago
    
    If you've got your robots.txt set to tell search engines to go away, or at least don't look at those pages, then you probably aren't worried about that, I assume?
    
    dylan604 a day ago
    
    there's a difference between an ai-bot and a search engine bot though
anotherhue a day ago

I can generate more random content than you can store.

andrewfromx a day ago

"trap crawlers in infinite mazes of gibberish data, potentially increasing AI training costs and poisoning datasets. While their effectiveness is debated, creators see them as a form of resistance against unchecked AI development."

Der_Einzige a day ago

[flagged]

crazydoggers a day ago

I hope you’re being sarcastic. Today’s “AI” is as close to being a general intelligence with sentience as a toaster is to a laptop.
You might as well be concerned about revenge for all the electronics you sent to the dump or the software lines you deleted.
Why anyone would think some future sentient AI would concern itself with what you did in the past with your software bots or chat bots??
People really need to get a grip.
- rsynnott 19 hours ago
  
  > Why anyone would think some future sentient AI would concern itself with what you did in the past with your software bots or chat bots??
  This is some version of one of the more ridiculous things that has been thrown up by the nerd new religious movement on LessWrong and similar sites: https://en.wikipedia.org/wiki/Roko%27s_basilisk
  (It's somewhat unclear to me whether anyone actually believes in this, but some of the more fanatical AI-will-change-everything people _seem_ to more or less accept the premise.)

andrewmutz a day ago

Regardless of your views on AI, LLMs are going to be influential in the future. If you work to keep your content away from models, it's hard to see how you benefit.

25 years ago, if you had blocked the googlebot scraper because you resented google search, it would only have worked to marginalize the information you were offering up on the internet. Avoiding LLM training datasets will lead to similar outcomes.

grajaganDev a day ago

I think this is a weak analogy.
What benefit is gained by allowing AI companies to train on your content? LLMs work on a token by token basis.
- andrewmutz a day ago
  
  Depends on who you are and what your content is, but shutting yourself off is unlikely to benefit you.