AI haters? I don't hate AI I just don't want things I've created being used to enrich multi-billion dollar companies for free. These companies are behaving poorly and they should expect this kind of push back.
The article completely misses the point that AI scrapers are not a "future threat of AI domination". They already do damage by DDOSing site's networking infrastructure and inflicting very real costs to a site hoster.
Even when the data is completely free, like in case of Wikipedia or OpenstreetMaps, scraping it is unethical and should be illegal. Most of the open data resources have procedures, which allow downloading of the data in the archived form, without need for scraping. They are built with sharing in mind.
So the arguments the article tries to use (what if it is for public good?) has no sense. 1) it is not 2) there are many ways to fetch the open data properly and respectfully.
Well its soft propaganda, what do you expect, they want to force a false dichotomy to shape public perception.
It is immaterial that these AI companies ignore contractual obligations (TOS), and are in fact performing attacks on said sites (DDOS is an attack).
In the last 3 months, there have been 4 or 5 small project that I regularly frequent where their sites that have been knocked offline as a result of this type of bad behavior, where they definitely are not following the robots.txt.
> I don't hate AI I just don't want things I've created being used to enrich multi-billion dollar companies for free.
I mean I think in the minds of AI evangelists (particularly of the quasi-religious "LLMs will bring forth a benevolent god-like superintelligence" variety), those are essentially the same thing.
(Yeah, it's ridiculous characterisation, but given the source it shouldn't be _surprising_ characterisation.)
Yeah, I think the title of this post and the article are a little tilted… we have every right to say no or not cooperate with something we don’t believe in.
Something that is named intelligent, does not get "tricked", or need to be defended, or qualified, or defined, aaaand can take a trillion dollar market colapse in stride, pull up its big artificial undies all by itself,right!
Way ,way to late for some kind of academic splaining things away.
People bet a lot lot more than the meer trillion dollars.Primary national stratigic objectives have been put into play, on the promise of hardware and AI based dominence, and ZAM, just like that, North Korea etall, are jammin that artificial jive.
So all of the main hypsters, are going to be promoted to somewhere, quiet and dry.
It is realy troubleing that last week, ai tarpits
were amusing, inconsequential, pranks, and now its
"AI haters".The author of the ai tarpit was self depreciating, and recognised that there actions were symbolic at best.
Cool. I hate the term of calling AI because its an excuse to throw out a flawed pretentious innovation where by it can be flogged as a product to sucker those who don't know better.
Is this because they are multi-billion dollar companies, or because they behave poorly, or because you haven't been properly compensated for your contribution to the content on the internet?
It is very likely that voting, or voting with your wallet, or probably any kind of activism, would have more impact than withdrawing from the (online) public life.
> because they are multi-billion dollar companies, or because they behave poorly
Both for me. They should be spanked for their behavior and lack of respect. I don’t want compensation though because I write open-source applications, but I want them to respect the license (which they don’t obviously).
Also I don’t understand why you feel anybody is withdrawing from the internet. It’s only a tarpit and I’m sure most of those who react don’t have ChatGPT subscriptions.
> because you haven't been properly compensated for your contribution to the content on the internet
Copyright is - like it or not - the way we regulate commercial intellectual "property". I can see different IP doctrines, and I don't necessarily defend the current one. It's not derived from real property rights, but rather in ensuring economic incentives for people to make stuff that otherwise wouldn't have been made, such as pharmaceuticals and hollywood movies, to simulate property rights. It's an imperfect solution which is there to ensure economic incentives and balance, and most importantly, it's the one we got.
But then, multi-billion dollar corporations feed your copyright protected (you thought) works straight into their supply chain, wouldn't you be pissed? It's no a small part either, but their models would be extremely nerfed without copyrighted data. Forget AI, forget tech, just look at it from a purely economic ecosystem perspective. Crying "fair use" during a highway robbery probably don't sit right with many, I hope.
I already don't pay for any ai services or touch any models. but Increasingly services that used to be helpful for me are throwing them in - YouTube premium has some sort of ai summarize things, etc. how do I signal that I don't want companies scraping content there?
> or because you haven't been properly compensated for your contribution to the content on the internet?
This comes across as snark, but I will assume you are well meaning. I have put code, guides, and videos on the internet for other people to consume for free in the hope that those people find that stuff useful.
I did not put stuff on the internet for it to be hoovered up and frankly stolen by massive AI companies to enrich themselves. If they are going to use my things in their commercial product than yes, they should be compensating me for that.
> Is this because they are multi-billion dollar companies, or because they behave poorly
It's also both of these.
> It is very likely that voting, or voting with your wallet, or probably any kind of activism, would have more impact than withdrawing from the (online) public life.
The only true control I have is withdrawing. I don't give these companies money, I don't live in a country that can meaningfully legislate against them, and I would consider withdrawing a form of activism.
I refuse to support these AI companies in any way (as long as they continue to be bad actors) and I have taken down all Youtube videos I've created, my personal website, and I have moved all my code to a self hosted, private Git service in order to deny them my work.
You pose this as though there isn't a long and proud history of the Internet reacting with (sometimes unhinged) hostility to bad actors that goes ALLLLLL the way back to the BBS era.
One of the first persons that tried to scam users out of money by asking for help with his tuition was doxxed by his own ISP after the aforementioned ISP got so much hatemail it crashed their servers when his message was posted to every message board by a script which caused prolific BBS users to download it possibly several hundred times, paying for the privilege of each message.
To be fair there's tens of thousands of content farms filling the web with ai slop. That's far more likely to harm AI scrapers than these hijinks.
Most crawlers use some form of timeout mechanism, usually informed by some priority scheduling. This deals reasonably well with crawler traps.
Since Nephentes-like traps are getting so common now (and in particular, not always behind robots.txt), I added a clause to Marginalia's crawler that prevents it from extracting links from pages that are less than 2 Kb and take more than 9 seconds to load. It's 4 lines of code and means the crawler doesn't get stuck at all.
I totally get the frustration though. My sites get an insane amount of bot traffic as well. I think roughly 1% of the search traffic to the html endpoint is human, and that's while providing a free API they could use instead. ... I just don't think this is going to fix anything.
In my opinion, if you access my content or web service, you accept the risk of not asking me first or knowing what's going to happen to the content when you do access it. It's mine, I've given you access, I'm not going to guarantee it's quality or how many times you'll need to try to get it before your scraper gets blocked for too many requests.
There's also the other kind of AI haters who do not give any anti-bot indicators about their tarpits (no robots.txt entry, no "nofollow", etc), and want to intentionally feed them poisoned data.
I'm not obligated to design my websites with AI bots (or search engine bots) in mind. If I want to publish mountains of Markhov generated data on my own servers that's my right. There's no rule that I need to use robots.txt, I didn't ask your shitty bot to crawl my site.
> AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
> That's likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.
Website owners aren't "haters" if bots ignore robots.txt, consuming resources that translate to expenses and a bad experience for legitimate website visitors.
Why would the website owner have to commission much larger server(s), pay more for this traffic and get nothing in return? At least search engines send human visitors your way.
I've yet to do it, but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go...hard.
Calling people who don't want certain things hoovered up by an LLM "AI haters" is a level of manipulation I'd think was only reserved for someone with a vested interest in the tech. Just encourages devious behavior instead of a more diplomatic approach of respecting people's wishes (read: robots.txt).
> but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go
As a bonus, beyond introducing the magic robot to the wonderful world of really poorly written erotica, it's also very anachronistic (widespread use of fax machines, pagers, smartphones, LinkedIn, and East Germany appear to coexist at a single moment in time), so will cause further confusion.
I run a service[1] that was getting hit pretty hard by these crawlers.
Ultimately instead of going down this path, I decided to just start charging for access to the service (it was long overdue)[2].
Users who are logged out can still see old cached content (which is a single DB read op), but to aggregate new content requires an account. I feel like this is a good (enough) middleground solution for now.
It's a radar gun/radar detector kind of situation. You can always change your strategies; countless top-level links that are shown on the menu in a way that no human sees them (due to font color or size or etc). Small number of pages with a few that go on forever, in a way that any human would stop reading but a bot may have trouble detecting. Real (for humans) text in images, with endless invisible text for bots. Etc.
I think it will probably make it harder for screen readers, unfortunately.
If you've got your robots.txt set to tell search engines to go away, or at least don't look at those pages, then you probably aren't worried about that, I assume?
"trap crawlers in infinite mazes of gibberish data, potentially increasing AI training costs and poisoning datasets. While their effectiveness is debated, creators see them as a form of resistance against unchecked AI development."
> Why anyone would think some future sentient AI would concern itself with what you did in the past with your software bots or chat bots??
This is some version of one of the more ridiculous things that has been thrown up by the nerd new religious movement on LessWrong and similar sites: https://en.wikipedia.org/wiki/Roko%27s_basilisk
(It's somewhat unclear to me whether anyone actually believes in this, but some of the more fanatical AI-will-change-everything people _seem_ to more or less accept the premise.)
Regardless of your views on AI, LLMs are going to be influential in the future. If you work to keep your content away from models, it's hard to see how you benefit.
25 years ago, if you had blocked the googlebot scraper because you resented google search, it would only have worked to marginalize the information you were offering up on the internet. Avoiding LLM training datasets will lead to similar outcomes.
Related to previous:
https://news.ycombinator.com/item?id=42725147
Thanks! Macroexpanded:
Nepenthes is a tarpit to catch AI web crawlers - https://news.ycombinator.com/item?id=42725147 - Jan 2025 (263 comments)
AI haters? I don't hate AI I just don't want things I've created being used to enrich multi-billion dollar companies for free. These companies are behaving poorly and they should expect this kind of push back.
It is not even just the copyright issue.
The article completely misses the point that AI scrapers are not a "future threat of AI domination". They already do damage by DDOSing site's networking infrastructure and inflicting very real costs to a site hoster.
Even when the data is completely free, like in case of Wikipedia or OpenstreetMaps, scraping it is unethical and should be illegal. Most of the open data resources have procedures, which allow downloading of the data in the archived form, without need for scraping. They are built with sharing in mind.
So the arguments the article tries to use (what if it is for public good?) has no sense. 1) it is not 2) there are many ways to fetch the open data properly and respectfully.
Well its soft propaganda, what do you expect, they want to force a false dichotomy to shape public perception.
It is immaterial that these AI companies ignore contractual obligations (TOS), and are in fact performing attacks on said sites (DDOS is an attack).
In the last 3 months, there have been 4 or 5 small project that I regularly frequent where their sites that have been knocked offline as a result of this type of bad behavior, where they definitely are not following the robots.txt.
The article is just bad shilled journalism.
> I don't hate AI I just don't want things I've created being used to enrich multi-billion dollar companies for free.
I mean I think in the minds of AI evangelists (particularly of the quasi-religious "LLMs will bring forth a benevolent god-like superintelligence" variety), those are essentially the same thing.
(Yeah, it's ridiculous characterisation, but given the source it shouldn't be _surprising_ characterisation.)
Yeah, I think the title of this post and the article are a little tilted… we have every right to say no or not cooperate with something we don’t believe in.
I hate people who call ML, AI.
It's a glorified librarian at best
AI is the broader concept.
ML is an application of AI.
Token prediction is now the unified concept, although perhaps to be replaced by direct sequence prediction. Token prediction is both ML and AI.
Something that is named intelligent, does not get "tricked", or need to be defended, or qualified, or defined, aaaand can take a trillion dollar market colapse in stride, pull up its big artificial undies all by itself,right! Way ,way to late for some kind of academic splaining things away. People bet a lot lot more than the meer trillion dollars.Primary national stratigic objectives have been put into play, on the promise of hardware and AI based dominence, and ZAM, just like that, North Korea etall, are jammin that artificial jive. So all of the main hypsters, are going to be promoted to somewhere, quiet and dry. It is realy troubleing that last week, ai tarpits were amusing, inconsequential, pranks, and now its "AI haters".The author of the ai tarpit was self depreciating, and recognised that there actions were symbolic at best.
Yes, one could say ML could be an application for AI but it's still not Artificial Intelligence.
As an engine is an application for a car, you don't call an engine a car. ML is programmed information, and theres nothing artificial about that.
What is AI about it? Show me something AI from ML.
All algorithms are defined, all tokens are defined. All guardrails are defined.
All of it has been programmed by humans that which isn't artificial.
If it was AI then it would generate itself for itself.
I am offering the generally accepted definitions of these terms as some HN readers might not be familiar with them.
If you don't think that ML is AI, I am fine with that.
Cool. I hate the term of calling AI because its an excuse to throw out a flawed pretentious innovation where by it can be flogged as a product to sucker those who don't know better.
"Cloud Servers" anyone? Glorified dedicated servers.
Well, artificial means made by humans, not naturally.
you got me on that one.
I wouldn't classify it as intelligent though. It's still reading a scripture of words.
It's not a librarian.
Is this because they are multi-billion dollar companies, or because they behave poorly, or because you haven't been properly compensated for your contribution to the content on the internet?
It is very likely that voting, or voting with your wallet, or probably any kind of activism, would have more impact than withdrawing from the (online) public life.
> because they are multi-billion dollar companies, or because they behave poorly
Both for me. They should be spanked for their behavior and lack of respect. I don’t want compensation though because I write open-source applications, but I want them to respect the license (which they don’t obviously).
Also I don’t understand why you feel anybody is withdrawing from the internet. It’s only a tarpit and I’m sure most of those who react don’t have ChatGPT subscriptions.
> because you haven't been properly compensated for your contribution to the content on the internet
Copyright is - like it or not - the way we regulate commercial intellectual "property". I can see different IP doctrines, and I don't necessarily defend the current one. It's not derived from real property rights, but rather in ensuring economic incentives for people to make stuff that otherwise wouldn't have been made, such as pharmaceuticals and hollywood movies, to simulate property rights. It's an imperfect solution which is there to ensure economic incentives and balance, and most importantly, it's the one we got.
But then, multi-billion dollar corporations feed your copyright protected (you thought) works straight into their supply chain, wouldn't you be pissed? It's no a small part either, but their models would be extremely nerfed without copyrighted data. Forget AI, forget tech, just look at it from a purely economic ecosystem perspective. Crying "fair use" during a highway robbery probably don't sit right with many, I hope.
what voting with your wallet are you envisioning?
I already don't pay for any ai services or touch any models. but Increasingly services that used to be helpful for me are throwing them in - YouTube premium has some sort of ai summarize things, etc. how do I signal that I don't want companies scraping content there?
I’m very confused as to what my $23.45 wallet has to do with what billion dollar Ai companies do.
> or because you haven't been properly compensated for your contribution to the content on the internet?
This comes across as snark, but I will assume you are well meaning. I have put code, guides, and videos on the internet for other people to consume for free in the hope that those people find that stuff useful.
I did not put stuff on the internet for it to be hoovered up and frankly stolen by massive AI companies to enrich themselves. If they are going to use my things in their commercial product than yes, they should be compensating me for that.
> Is this because they are multi-billion dollar companies, or because they behave poorly
It's also both of these.
> It is very likely that voting, or voting with your wallet, or probably any kind of activism, would have more impact than withdrawing from the (online) public life.
The only true control I have is withdrawing. I don't give these companies money, I don't live in a country that can meaningfully legislate against them, and I would consider withdrawing a form of activism.
I refuse to support these AI companies in any way (as long as they continue to be bad actors) and I have taken down all Youtube videos I've created, my personal website, and I have moved all my code to a self hosted, private Git service in order to deny them my work.
I live in the US, and I didn't have an option to vote against Big Tech. Both parties were deeply sycophantic toward that industry.
You pose this as though there isn't a long and proud history of the Internet reacting with (sometimes unhinged) hostility to bad actors that goes ALLLLLL the way back to the BBS era.
One of the first persons that tried to scam users out of money by asking for help with his tuition was doxxed by his own ISP after the aforementioned ISP got so much hatemail it crashed their servers when his message was posted to every message board by a script which caused prolific BBS users to download it possibly several hundred times, paying for the privilege of each message.
To be fair there's tens of thousands of content farms filling the web with ai slop. That's far more likely to harm AI scrapers than these hijinks.
Most crawlers use some form of timeout mechanism, usually informed by some priority scheduling. This deals reasonably well with crawler traps.
Since Nephentes-like traps are getting so common now (and in particular, not always behind robots.txt), I added a clause to Marginalia's crawler that prevents it from extracting links from pages that are less than 2 Kb and take more than 9 seconds to load. It's 4 lines of code and means the crawler doesn't get stuck at all.
I totally get the frustration though. My sites get an insane amount of bot traffic as well. I think roughly 1% of the search traffic to the html endpoint is human, and that's while providing a free API they could use instead. ... I just don't think this is going to fix anything.
Article seems super biased to me. Why are tarpits repeatedly characterized as attacks rather than as defense?
Agreed. Some commenters where pretty upset about the tarpits the last time this was posted on HN.
In my opinion, if you access my content or web service, you accept the risk of not asking me first or knowing what's going to happen to the content when you do access it. It's mine, I've given you access, I'm not going to guarantee it's quality or how many times you'll need to try to get it before your scraper gets blocked for too many requests.
There's also the other kind of AI haters who do not give any anti-bot indicators about their tarpits (no robots.txt entry, no "nofollow", etc), and want to intentionally feed them poisoned data.
That's a great idea! Since there are probably not any laws against this yet, it sounds like a great way to fight back!! Poison the AI well, I like it.
I'm not obligated to design my websites with AI bots (or search engine bots) in mind. If I want to publish mountains of Markhov generated data on my own servers that's my right. There's no rule that I need to use robots.txt, I didn't ask your shitty bot to crawl my site.
what are the odds that one of these people are running a server that is popular enough to garner that kind of attention?
The odds a pretty good if the people running the server work for an intelligence agency.
> AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt
> That's likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.
Website owners aren't "haters" if bots ignore robots.txt, consuming resources that translate to expenses and a bad experience for legitimate website visitors.
Why would the website owner have to commission much larger server(s), pay more for this traffic and get nothing in return? At least search engines send human visitors your way.
It's not "AI haters". It's exploitation hating.
I wonder if we could run the cheapest smallest 1b model (or smaller) llm to generate this data in a way that's just plausible enough to be ingested.
A little note in robots.txt offering commercial terms would also be available.
Usually it's just a Markov Chain rather than an LLM to generate the garbage data.
I've yet to do it, but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go...hard.
Calling people who don't want certain things hoovered up by an LLM "AI haters" is a level of manipulation I'd think was only reserved for someone with a vested interest in the tech. Just encourages devious behavior instead of a more diplomatic approach of respecting people's wishes (read: robots.txt).
> but my pending defense strategy was to download plain text copies of erotica novels and when I detect a chat bot crawler, just redirect it to that folder and let it go
Might I suggest https://tvtropes.org/pmwiki/pmwiki.php/Literature/BelindaBli...
As a bonus, beyond introducing the magic robot to the wonderful world of really poorly written erotica, it's also very anachronistic (widespread use of fax machines, pagers, smartphones, LinkedIn, and East Germany appear to coexist at a single moment in time), so will cause further confusion.
I run a service[1] that was getting hit pretty hard by these crawlers.
Ultimately instead of going down this path, I decided to just start charging for access to the service (it was long overdue)[2].
Users who are logged out can still see old cached content (which is a single DB read op), but to aggregate new content requires an account. I feel like this is a good (enough) middleground solution for now.
[1]: https://kulli.sh
[2]: https://lgug2z.com/articles/in-the-age-of-ai-crawlers-i-have...
I block them because they're not paying me to use my resources. They would block me if I made a similar volume of requests.
What if the scrapers use breadth-first search?
ok...
if depth > 5 and if sem_hash(content) in hist: return
It's a radar gun/radar detector kind of situation. You can always change your strategies; countless top-level links that are shown on the menu in a way that no human sees them (due to font color or size or etc). Small number of pages with a few that go on forever, in a way that any human would stop reading but a bot may have trouble detecting. Real (for humans) text in images, with endless invisible text for bots. Etc.
I think it will probably make it harder for screen readers, unfortunately.
> countless top-level links that are shown on the menu in a way that no human sees them (due to font color or size or etc).
wouldn't this lower their page rankings? that's the kind of shenanigans of the old days with meta key word stuff and what not
I'm not sure the old search engines are any good anymore, though, so we don't really care...
That's kind of the point. AI has just been more terrible news after terrible news.
It is NOT a good thing. Unless you know, you like being covered in oil and lit on fire...
If you've got your robots.txt set to tell search engines to go away, or at least don't look at those pages, then you probably aren't worried about that, I assume?
there's a difference between an ai-bot and a search engine bot though
I can generate more random content than you can store.
"trap crawlers in infinite mazes of gibberish data, potentially increasing AI training costs and poisoning datasets. While their effectiveness is debated, creators see them as a form of resistance against unchecked AI development."
[flagged]
I hope you’re being sarcastic. Today’s “AI” is as close to being a general intelligence with sentience as a toaster is to a laptop.
You might as well be concerned about revenge for all the electronics you sent to the dump or the software lines you deleted.
Why anyone would think some future sentient AI would concern itself with what you did in the past with your software bots or chat bots??
People really need to get a grip.
> Why anyone would think some future sentient AI would concern itself with what you did in the past with your software bots or chat bots??
This is some version of one of the more ridiculous things that has been thrown up by the nerd new religious movement on LessWrong and similar sites: https://en.wikipedia.org/wiki/Roko%27s_basilisk
(It's somewhat unclear to me whether anyone actually believes in this, but some of the more fanatical AI-will-change-everything people _seem_ to more or less accept the premise.)
Regardless of your views on AI, LLMs are going to be influential in the future. If you work to keep your content away from models, it's hard to see how you benefit.
25 years ago, if you had blocked the googlebot scraper because you resented google search, it would only have worked to marginalize the information you were offering up on the internet. Avoiding LLM training datasets will lead to similar outcomes.
I think this is a weak analogy.
What benefit is gained by allowing AI companies to train on your content? LLMs work on a token by token basis.
Depends on who you are and what your content is, but shutting yourself off is unlikely to benefit you.