103 comments

  • neonate 2 days ago ago
  • 486sx33 2 days ago ago

    It’s unfortunate and kind of dystopian. We have an opportunity to properly archive all of the worlds online data and catalog it for very very low cost (historically), so that the future of our planet will have a much better reference point for the past.

    Instead of that, companies are sucking up as much crap as possible, and tokenizing it and then scrubbing it, and adding “safety” to it.

    Reality is always much stranger than fiction.

    • Spivak 2 days ago ago

      We have billions of people, we can accomplish two maybe three things at a time. This is a valid use as any of that archived data. The part that sucks isn't that people are doing unusual things with it like training AI, but that copyright & capitalism make it so that everyone has to go get their own data themselves to the annoyance of web admins.

      The biggest technical hurdle to sharing the work among interested parties is the web only authenticates the pipe, not the content.

      • jrochkind1 2 days ago ago

        CommonCrawl tries to archive the web and share it openly so everyone doesn't have to scrape it themselves.

        "Our goal is to democratize the data so that everyone, not just big companies, can do high-quality research and analysis."

        Because they share it openly including with those doing AI, they wind up on "AI crawler" lists, which are increasingly used by blocking tools that just "use the AI list", by people who don't like AI, or, quite ironically, people who are trying to prevent the excess traffic that poorly mannered AI crawlers cause. (Common Crawl's crawler is well mannered, uses good user-agent, respects robots.txt including crawl-delay, etc)

        https://commoncrawl.org/

      • anileated 2 days ago ago

        Copyright & capitalism is the crucial part of how we have the technical foundation that got us ML and most of the material used for training it. Big tech companies that want to monetize it at scale would like us to not think about that (or any long term consequences that do not affect shareholder value beyond current management), of course. If anything, the problem with intellectual property law is that they feel it’s safe for them to ignore it when it comes to ordinary people’s work (good luck suing ClosedAI).

      • throw10920 2 days ago ago

        > This is a valid use as any of that archived data.

        No, it's really not, as most of the people who actually spend the time and effort to produce that content did not consent to it being used to train AI.

        > copyright & capitalism

        That's a really disingenuous way to say "the creators of that data didn't consent to training or commercial use and I want to steal their effort".

        • immibis a day ago ago

          I don't consent to paying rent, but I still have to. If it's legal for one party it should be legal for all parties. The law shouldn't pick favourites. If ChatGPT (owned by Microsoft) can copy my data I can download unlicensed Windows. If I can't, it can't.

          • throw10920 a day ago ago

            Yes, I completely agree that the law shouldn't pick favorites.

            To clarify: the creators of the majority of online content haven't consented to their content being used to build AI models for any company or organization. For US-based "creators", that includes both domestic companies like Anthropic, OpenAI, Google, and foreign companies like ByteDance.

        • Spivak 2 days ago ago

          I was actually going for the dynamic where sharing isn't caring in this space. Because in theory it would be great if there were a few good companies who crawled the internet for you and sold access to it but in practice those companies are pushed to charge an arm and a leg which drives med-large companies to be incentivized to have to get it themselves.

  • jgrahamc 2 days ago ago

    Stuff like this is why Cloudflare launched the AI Audit feature and the ability to block "AI bots". We're about to launch a feature that'll enforce your robots.txt.

    • andrethegiant 2 days ago ago

      I’m working on a platform[1] (built on Cloudflare!) that lets devs deploy well-behaved crawlers by default, respecting robots.txt, 429s, etc. The hope is that we can introduce a centralized caching layer to alleviate network congestion from bot traffic.

      [1] https://crawlspace.dev

      • zebomon 2 days ago ago

        I love the sentiment, but the real issue is one of incentives and not ability. The problem crawlers have more than enough technical ability to minimize their impact. They just don't have a reason to care right now.

    • notachatbot123 2 days ago ago

      It would be nice to share this tooling free and open-source so that anyone can protect themselves.

  • Ironlikebike 2 days ago ago

    In my last job, we observed ByteDance scraping TBs of OS testing data using the restful API that our OSS community front-end was using to serve it's CI result to the OSS community. The scraping was so relentless it was causing performance problems. We were also worried they were going to cause us large network egress fees as well. We specifically locked down the API after that, and anyone who wanted to use results had to ask explicit permission and be granted access.

  • Havoc 3 days ago ago

    Going to be hard to enforce anything against this if it’s happening across jurisdiction like this.

    I don’t see how copyright survives long term in this sort of context

    • fny 2 days ago ago

      Punitive measures.

      Narrow tariffs, competitive subsidies, sanctions, divestment, export restrictions are all viable deterrents.

      Aside from this behavior, China has been subsidizing industries and dumping products into economic rivals for years. Never mind all the IP theft. It’s absurd the US has responded so weakly for the last 30 years.

      • VHRanger 2 days ago ago

        The proper response to a foreign country subsidizing something is to just buy a lot of it. They're losing money on the unit economics, you can profit from it by buying.

        We've known for decades that protectionist policies and subsidies make industries less competitive, not more. It is literally textbook stuff [1].

        A typical result of protectionism is somethint like GM in the US, where they grow uncompetitive as they don't have to compete with foreign markets for domestic demand.

        You get similar uncompetitive dependent behavior from subsidies - just look at Intel right now.

        1. https://web.pdx.edu/~ito/Krugman-Obstfeld-Melitz/8e-text-PDF...

        Note the author here is Paul Krugman, who literally won a Nobel Prize for his work on infant industry protection.

        • fny 2 days ago ago

          1. Krugman admits at many points in the article you sent that protecting infant industries at times works. We're also not talking about infant industries.

          2. VCs routinely use the strategy of subsidizing their startups to "disrupt" industries until they dominate a market. China does the same thing.

          3. The costs of sacrificing domestic supply chains and development capacities do not fit neatly into macroeconomic models. National security issues present similar difficulties. Do arguments around comparative advantage apply to hostile adversaries that routinely break laws (i.e. ByteDance, IP theft) and provide natural resources to enemies?

          4. While the US did not succeed with Intel, China has routinely subsidized industries while enforcing antitrust with far more success than the US. See the Alibaba breakup or the recently implemented antimonopoly laws as examples: https://www.gibsondunn.com/antitrust-in-china-2023-year-in-r...

          • lazystar 2 days ago ago

            > See the Alibaba breakup or the recently implemented antimonopoly laws as examples

            An argument could be made that any increase in competition is a side-effect, rather than the main goal, of their antimonopoly changes. Until China explains the full Jack Ma story, anything alibaba-related will be seen as political driven, rather than economic.

          • VHRanger 2 days ago ago

            1. Absolutely correct, it works "sometimes", but not in the general case, and especially not in real life in the general case given how it skews the incentive structure of the firms.

            The point remains that in general if a foreign country is over subsidizing an industry it's a good idea -- even if you don't like them -- to just buy a ton of the stuff.

            2. It remains to be proven how good of an idea the silicon valley VC model has been in the zero-interest rate environment since it changed. Uber has had all of 5 profitable quarters in 15 years. Twitter had something like 4.

            Many of those VC hypergrowth companies, except a dozen or so, are effectively a big game of hot potato The gap between investment and profit made is often still in the 9 or 10 figure range.

            I'd wait another decade or so before proclaiming it's a good strategy. Predatory pricing doesn't even work in theory - there's was effectively a chapter in an industrial organization class I took, though I'd have to find the material again. It might work in practice if there's other effects not taken into account in the theoretical model, though.

            3. I would agree with you there, and I think both banning tiktok and subsidizing intel (foundries only) are ideas I agree with even though controversial.

            4. I wouldn't argue that the alibaba breakup was a good example - this sort of move creates a huge chilling effect on investors and entrepreneurs in China. The breakup was much more about Xi consolidating his grasp on power than anything else to be realpolitik about it.

        • hash872 2 days ago ago

          I don't think that macroeconomics is an empirical field, so when people say 'it's textbook stuff' that doesn't impress me. I don't believe the textbook. People want to pretend that economics is like physics or chemistry, but it simply isn't true. Imagine if I said something was 'textbook sociology', would you have to then drop all objections?

          Reminder that in the last 30 years economists have variously told us that there's a (high) natural rate of unemployment that we couldn't change (has recently been completely debunked). That raising the minimum wage costs jobs. That bank deregulation is good. And so on. It's just not an empirical field, so I don't believe what's in the textbook. It's also open to lobbying from for-profit entities for specific viewpoints, in a way that a real science usually isn't

          • lazystar 2 days ago ago

            > I don't think that macroeconomics is an empirical field, so when people say 'it's textbook stuff' that doesn't impress me.

            Agreed, especially when the textbook being referenced was written by a polarizing figure like Kruggman. I wonder, how much "textbook stuff" was removed from textbooks after 2008?

            • VHRanger 2 days ago ago

              Krugman won his nobel prize in economics for his work on international trade, and especially protectionism of infant industries:

              https://www.nobelprize.org/prizes/economic-sciences/2008/kru...

              His NYT column might be controversial, but his work in international trade absolutely isn't.

              It's like saying Chomsky is controversial to counter argue his work in linguistics. Chomsky might be a political hack, but his opinion on formal grammars is probablyt sound.

              > I wonder, how much "textbook stuff" was removed from textbooks after 2008?

              Basically nothing, to be honest? What should have changed?

              The banks that collapsed into a financial crisis were effectively committing fraud. The federal reserve were publishing opinions that the housing sector was at risk as early as 2005-2006.

              Also, it's difficult for a central bank to know the extent of the mispricing when there's active concealment of risks (eg. backroom deals with insurers and risk assessors) - you need full on auditing to spot that.

              The bigger problem with 2008 is that almost no one responsible went to jail.

          • VHRanger 2 days ago ago

            > I don't think that macroeconomics is an empirical field

            If that's your opinion it's pretty clear your engagement with the field of macroeconomics is several degrees removed from the actual research.

            Assuming you're here in good faith, I would ask you to actually browse a dozen or so of any recent, randomly picked papers in the field you claimed is "not empirical", skim them and note if it's empirical work or theoretical work.

            Then come back here and seriously argue that the field is "not empirical". I'll give you a jump start, here's two good sources for recent macro papers:

            NBER Macro preprints: https://www.nber.org/topics/macroeconomics?page=1&perPage=50

            AEJ Macro: https://www.aeaweb.org/journals/mac/forthcoming

            Of course that won't be your current view of the field if your knowledge comes from the opinion section of newspapers and HN comments. But, again, I'm assuming you want to challenge your views in good faith here.

            > Reminder that in the last 30 years economists have variously told us that there's a (high) natural rate of unemployment that we couldn't change (has recently been completely debunked).

            Not sure where you get that opinion from, the NAIRU published by the CBO went from a high of 6.2% in the energy crisis of the 1970s to around 4.4% today:

            https://fred.stlouisfed.org/series/NROU

            30 years ago the NAIRU was 5.4% and today it's 4.4%, saying it was "completely debunked" makes no sense and I'm seriously wondering which source you got this claim from.

            Moreover, the concept of a natural rate of unemployment that's somewhere above 0% is uncontroversial: there's naturally a time gap when looking for a new job, even in an economy at "full employment capacity".

            > That raising the minimum wage costs jobs.

            Unless your economics education stopped at the first week of microeconomics 101, or comes entirely from the political discourse or reddit, this isn't something that is the position of basically any economist.

            Seriously, here's the first recent (2024) highly cited research review I could find from 4 seconds of googling:

            https://www.nber.org/system/files/working_papers/w32878/w328...

            First, note the review is 123 pages long. There's clearly some subtility past "minimum wage bad, unemployment high!" But we can skim and jump to the conclusion. To quote:

            """ While the evidence is not unanimous, a reasonable conclusion from the existing literature is that minimum wage policies have had limited direct employment effects while significantly increasing the earnings of low-wage workers—at least at certain levels and in particular economic contexts.

            """

            Also, by the way, the minimum wage labor effect is studied in your labor economics class, which is micro, not macro. Which points again to the question of where you're sourcing your claims from.

        • belorn 2 days ago ago

          There isn't any major car manufacturer (that I know of) which isn't deeply tied to the government of their country of origin, enjoying everything from massive amount of subsidies to access to military intelligence networks. Sometimes even specific laws are written explicitly to support that specific company. Like aircraft manufacturers, car manufacturers are also generally military manufacturers so the internal lines within those that distinguish between private company, government, and military get very blurry.

          I don't see a proper response for other countries when dealing with such entities. Most likely it going to be an equal blurry mess of trade policies, foreign policy, and military policy.

        • nonethewiser 2 days ago ago

          To a degree. But it can hurt your domestic industries to a degree which may not be acceptable.

        • georgeburdell 2 days ago ago

          What subsidies has Intel received to date?

          • dymk 2 days ago ago
            • georgeburdell 2 days ago ago

              I wrote, what subsidies has Intel received? That press release does not indicate Intel has received any money. Furthermore, any subsidy received in 2024 would not explain why Intel has been falling behind for the past 8 years, per your assertion that subsidies encourage waste.

              • dymk 2 days ago ago

                “Per your assertion” Where did I make that assertion?

      • mschuster91 2 days ago ago

        At the point and scale we're at, I'd classify China's actions as cyberwarfare and would add similar responses to your list.

      • sofixa 2 days ago ago

        US-based companies do the same (scraping content and training models on it, regardless of copyright or licenses or attribution).

        US subsidises and protects tons of industries (agriculture, chips, automobiles, aerospace).

        Does that mean that other countries can impose tariffs and sanctions on the US to punish this obviously anticompetitive and anti-free market behaviour? Or is it just the normal stuff we'd expect a country to do?

        • capitainenemo 2 days ago ago

          Personally it wasn't licenses or attribution that was the problem with ByteDance's scraping, it was that, unlike every other robot visiting our system they completely ignored robots.txt to the point of overloading systems.

          Which is why their chunk of amazon asia is currently behind a ban.

          I kinda feel like when people say "indiscriminate" they really mean it. There is no regard for courtesy or common sense.

          • red_admiral 2 days ago ago

            For someone who has the resources, I can think of a lot more fun things than a ban.

            I think there was a story a while ago, possibly apocryphal, about someone who ran a disposable email service with a bunch of random looking domains, and they noticed bot traffic repeatedly hitting the page that shows one of their domains but not clicking through to actually activate an address. Guessing that the scraper was trying to find and block all these domains from being used to sign up for their services, the admin of the disposable email site added a function where if it detected bot traffic it would occasionally return domains like "gmail.com" in the text field.

        • nonethewiser 2 days ago ago

          > Does that mean that other countries can impose tariffs and sanctions on the US to punish this obviously anticompetitive and anti-free market behaviour?

          Of course they can. They already do.

      • indymike 2 days ago ago

        > Narrow tariffs, competitive subsidies, sanctions, divestment, export restrictions are all viable deterrents.

        The only thing that stops this is when a nation has more to lose than to gain... and that will happen soon as other emerging economies follow the grand tradition of cheating their way to prosperity. Then slowing down the comepetition will be the only play.

        All of us born after 1970 have seen Japan, Taiwan, Hong Kong (before being re-absorbed), and China run the cheat your way to the top playbook and will see it at least a few more times.

    • jfoster 3 days ago ago

      Copyright isn't required once any work can be created faster than you can snap your fingers.

      It was originally a way to motivate creation of artistic works, since they used to involve a lot of effort.

      • neilv 2 days ago ago

        > Copyright isn't required once any work can be created faster than you can snap your fingers.

        Copyright isn't required if you use a tool built upon violating copyright?

        Breathing isn't required if someone strangles everyone to death.

        (Now we can all transcend breathing, in the new post-living higher plane of existence. Which surely is viable and great, and totally won't be abused to enrich the worst people, to the detriment of everyone else.)

        • jfoster 2 days ago ago

          That's a very valid point if courts around the world are about to rule against every AI company in hundreds or thousands of court cases.

          Do you believe that is going to happen?

      • icehawk 2 days ago ago

        It was originally created because printers needed a constant stream of new works because once they published something their competitors could immediately copy it and republish it without the initial cost of making the work.

        That's why its copyright and not artistright.

      • blibble 2 days ago ago

        it still requires a lot of effort, but by other people

        not a lot of effort by the parasites

        • jfoster 2 days ago ago

          OK, but put aside whether you like or dislike this for just a minute or two in order to think about it objectively.

          You are aware of the way things are trending, right? Is the trend showing any sign that it might reverse, for the rest of human civilization's time?

          I liked when things were simpler too, but the reality (for better or worse) seems to be that AI is not going away.

          • johneth 2 days ago ago

            > but the reality (for better or worse) seems to be that AI is not going away.

            AI is in a hype cycle at the moment. Once tech companies realise that they're not going to be able to recoup the billions of dollars they've dumped into the money hole, they'll either raise prices or withdraw products (or a mixture of both).

            Consumers, by and large, don't like generative AI. Or at least they don't like it enough to make it pay for itself.

          • blibble a day ago ago

            > I liked when things were simpler too, but the reality (for better or worse) seems to be that AI is not going away.

            we'll see what happens once the parasites have killed the host

      • diggan 2 days ago ago

        > It was originally a way to motivate creation of artistic works, since they used to involve a lot of effort.

        So true, then abstract expressionism appeared and suddenly copyright wasn't a thing anymore.

      • 2 days ago ago
        [deleted]
    • lovethevoid 2 days ago ago

      You don't see how large companies with vast resources are going to protect their copyright? You really believe that ByteDance et al. are going to make their data freely and publicly available for everyone?

      This is a way to permanently entrench their positions while maintaining ownership. Not an eradication of copyright.

    • sct202 2 days ago ago

      Bytedance has physical presences in most major markets now for ad sales/support so there are measures that can be taken and money is flowing that could be halted if needed.

  • shellac 2 days ago ago

    I'm pretty sure this bot has been operating for much longer than the article suggests (April this year), and truly is a pain. I work in academia and see a lot of ill considered web scraping by ML / AI researchers, but Bytespider is in a league of its own.

  • MaKey 3 days ago ago

    Somehow the headline made me think of a parent with a TikTok account.

    • smittywerben 2 days ago ago

      Title edited out the apostrophe from the headline, plus a few small differences shown in brackets.

      TikTok[’s] parent launched a [web] scraper [that’s] gobbling up [the] world’s [online] data 25x[-times] faster than OpenAI

    • skrebbel 3 days ago ago

      Maybe the people running the crawler are also parenting tiktokers

      • bilekas 3 days ago ago

        Leave no bot behind!

    • beAbU 3 days ago ago

      All of a sudden the title makes so much more sense. Thanks. Now I might read the article actually.

  • benreesman 3 days ago ago

    Indiscriminate scraping is a dick move.

    But if you’re going to do it, do it properly. I would have hung it off the Like button with an ungodly ZooKeeper ensemble and trained a GBDT on which parts of which URLs I could just obliterate with Proxygen.

    We’d have it all in about 4 days. Don’t ask me how I know.

    The second worse thing about the AI megacorps after being evil is being staffed by people who use Cursor.

    Edit: on the back of the valued feedback of a valued commenter I’d like to acknowledge that I made a sloppy mistake and have corrected in haste, making no excuses. It would be super great if the largest private institutions in history of the world took the care with give or take everything that I do with trolling on a forum.

    • vasco 2 days ago ago

      > But if you’re going to do it, do it proerly

      Top shelf unintentional irony.

      • Rinzler89 2 days ago ago

        Exactly. That line of reasoning just feels like established players kicking the ladder from under them in order to maintain their moat, when competitors start to catch up: "Hey, web scraping and data mining should only be allowed the right way, where the right way = our way."

        "Free market" to them = the market where they get to write the rulebook.

    • viraptor 3 days ago ago

      > is being staffed by people who use Cursor.

      Any specific reason?

      • trunch 2 days ago ago

        Not OP but other than what core functionality they can demo to investors, every AI company seems to have extremely lacking:

        - web design (basic features take years to implement, and when done break the website on mobile)

        - UI/UX patterns (cookie cutter component library elements forced into every interface without any tailoring to suit how the product is actually used, also makes a Series C venture indistinguishable from something setup in a weekend)

        - backend design (turns out they've been hemorrhaging money on serverless Vercel function calling instead of using Lambda and spending a minute implementing caching for repeat requests)

        - developer docs (even when crucial to business model, often seems AI generated, incomplete, incoherent)

        And this usually comes from hiring much less developers than is needed, and those that are hired are 10x Cursor/GPT developers which trust it to have done a comprehensive job at what seems like a functional interface on the surface, and have little frame of reference or training for what constitutes good design in any of these aspects.

        • benreesman 2 days ago ago

          dawg I ChatGPT’d that license, busy building rn.

        • benreesman 2 days ago ago

          I was the guy trolling, downvote me.

          Don’t downvote the person who submitted a substantial comment far more valuable than it’s GP.

        • raverbashing 2 days ago ago

          > (turns out they've been hemorrhaging money on serverless Vercel function calling instead of using Lambda and spending a minute implementing caching for repeat requests)

          Oh but why can't the AI do basic backend programming anymore? /s

      • benreesman 3 days ago ago

        Plenty of smart people use Cursor I shouldn’t have been dismissive.

        I meant people who don’t work at Cursor.

      • 3 days ago ago
        [deleted]
    • sshine 2 days ago ago

      > people who use Cursor

      What's Cursor?

    • t_6_t 2 days ago ago

      [flagged]

      • trunch 2 days ago ago

        Only on HN could Lex Fridman's endorsement mean anything when it comes to an IDE

      • benreesman 2 days ago ago

        I remember when people said things like “X is endorsed by John Carmack”.

        But Engine John is just a guest on the real show.

        I didn’t hang onto the first fortune I made in this game, which in general is a real nuisance, but it has the silver lining that I’m still working and likely will be when Zuckerberg realizes that he hasn’t graduated a class of legitimate E5s since 2017.

        • t_6_t 2 days ago ago

          [flagged]

          • benreesman 2 days ago ago

            ^ it’s considered polite to use an “Edit:” annotation when dramatically changing what someone replied to.

            Cursor is a great product, done by brilliant people.

            I very much doubt that they set out to dramatically amplify the byte-denominated output and performance-cycle clout of the exact group of people who want to generate their code but find emacs macros or protobuf or any of the more sophisticated zero-temperature codegen mechanisms too high a complexity bar to clear before blasting metric tonnage of generated slop into our collective lives.

            The target audience of Cursor is the last group of people you want jizzing on your codebase at scale while simultaneously capturing mindshare proportional to NVIDIA’s market capitalization.

            It’s somewhere between dorky and cringe when Jensen signs some girl’s breasts like he’s Mick Jagger or something. That vibe in your repository is my job security in ten years.

  • jl6 3 days ago ago

    I have observed this bot requesting URLs that haven’t been live for over a decade, and to which no reference can now be found in search engines. I imagine there must be a private trade in URL lists.

    • is_true 2 days ago ago

      maybe they are using commoncrawl, webarchive, yandex as indexes?

      • jefozabuss 2 days ago ago

        In addition to those it's also possible they just found a website that published a scraped list back then and got de-indexed for obvious spammy content.

        I would not be surprised if there are still some auto generated link directories left from the "golden ages" of blackhat.

  • koolba 3 days ago ago

    > The Bytespider bot, much like those of OpenAI and Anthropic, does not respect robots.txt, the research shows. Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website’s data.

    Does any of these scrapers uniquely and unambiguously identify themselves as a bot?

    Or are those days long over?

    • netdevnet 3 days ago ago

      Some of the scrapers used by big companies do identify themselves as bots by using unique user agents. Of course, it does not mean that they don't have other bots running around without the bot user agent name.

      Whether those days are over or not will greatly depend on the outcome of the ongoing New York Times vs OpenAI lawsuit. If OpenAI wins, then it pretty much green lights all the other scrappers to feast upon the web

    • rockwotj 2 days ago ago

      I worked for a short time on SearchGPT, and I can tell you OpenAI does respect robots.txt , at least when I was there and does now. They are also careful to shard per domain and only crawl each domain at a small rate (~1 qps) as to not ddos the site. OpenAI also uses User Agent strings to identify itself: https://platform.openai.com/docs/bots

      They have dedicated user agents for search crawling, when a user directly asks about a site and for training data.

      • jsheard 2 days ago ago

        > They are also careful to shard per domain and only crawl each domain at a small rate (~1 qps) as to not ddos the site.

        Maybe that's their intent, but this was only a month ago: https://www.gamedeveloper.com/business/-this-was-essentially...

        > "The homepage was being reloaded 200 times a second, as the [OpenAI] bot was apparently struggling to find its way around the site and getting stuck in a continuous loop," added Coates. "This was essentially a two-week long DDoS attack in the form of a data heist."

        • jefozabuss 2 days ago ago

          Maybe someone went against the rule of deploying on a Friday, ouch.

    • spiderfarmer 3 days ago ago

      This one does and I blocked them categorically from all my domains.

    • jeroenhd 3 days ago ago

      Most of the good ones will tag themselves in the user agent and follow robots.txt.

      The ones that don't are the ones people are trying to block the most. Sometimes Google or Bing go crazy and start scraping the same resource over and over again, but most scraping tools causing load peaks are the badly written/badly configured/malicious ones.

      • stroupwaffle 2 days ago ago

        Im thinking a lot of those issues might be related to “smart” scraping which parses JavaScript. Could lean in to the bot and just make it easier for them to scrape by removing JavaScript from the websites.

        I realize this is somewhat off-topic, but the big companies kind of destroyed the internet with all the JavaScript frameworks and whatnot.

    • diggan 3 days ago ago

      > Does any of these scrapers uniquely and unambiguously identify themselves as a bot?

      It seems like all of them do, yeah: https://github.com/eob/isai/blob/b9060db7dc1a7789b322b8c2838...

      Not sure if they're really "scrapers" though, if they're initiated by a user for a single webpage/website, more like "user-agents" in that case, unless it automatically fans out from there to get more content.

      • 3 days ago ago
        [deleted]
  • bilekas 3 days ago ago

    > does not respect robots.txt research shows.

    It would be nice then for the investigators to help people with the identifying markers for such crawlers. Apart from a mention of darkvisitors, which it seems is a paid service to "Block agents who try to ignore your robots.txt"

    I'm not sure how much that could be trusted given their business model also.

  • buro9 2 days ago ago

    Also the Facebook hit scraper.

    Which does not respect robots.txt and definitely is just scraping.

    AS blocks are the only really effective tool now, there are many scrapers that do not even respect user agent

    • wiredfool 2 days ago ago

      Facebook and someone who’s using a Firefox UA are the big hitters for me today, each at a sustained 6 req per second for the last 24 hours on one site.

      Today is actually pretty good, there’s some real looking UA traffic in the top 10.

  • 2 days ago ago
    [deleted]
  • wtk 2 days ago ago
  • kgen 2 days ago ago

    To be honest, it's probably not enough to just block these scrapers if they are acting maliciously, people should just start serving generated content back to it and see how long it takes for them to catch on and fix the problem

  • OuterVale 2 days ago ago
  • 2 days ago ago
    [deleted]
  • nubinetwork 2 days ago ago

    > The China-based parent company of video app TikTok released its own web crawler or scraper bot, dubbed Bytespider, sometime in April

    Uh, no... bytespider has been around for a long time...

  • sflefties 2 days ago ago

    [dead]

  • aaron695 3 days ago ago

    [dead]

  • sieabahlpark 2 days ago ago

    [dead]

  • OutOfHere 2 days ago ago

    Just how would a scraper catch up with the internet if not by accelerating the rate? It is to be expected if the scraping is to succeed.

  • welder 2 days ago ago

    So what, who cares? Is this newsworthy? It's definitely not something to get upset about, web scraping is a normal part of the internet.