54 comments

  • Tiberium 10 hours ago ago

    Important to note that the author assumes that this is ByteDance, but the ASN belongs to their cloud solution BytePlus, which could be used by other companies.

    https://x.com/sauceo_/status/1842866301066518875

    https://www.byteplus.com/en

    • edouard-harris 9 hours ago ago

      The author does address this possibility in a reply:

      > it's very unlikely to be someone else because pricing is astronomical. you also have to "contact sales" to get access to anything outside of a free trial. no one would pay that much for a block of ips with terrible reputation

      https://x.com/uwukko/status/1842866807763308615

      • rfoo 8 hours ago ago

        > pricing is astronomical. you also have to "contact sales" to get access to anything outside of a free trial

        You don't have to contact sales if you are a Chinese-speaking customer. And pricing is fine. ByteDance has a different brand for their cloud services in China: https://www.volcengine.com/ [1]. But of course the underlying infrastructure are all the same.

        This is very likely done by a Chinese customer using ByteDance's cloud service.

        [1] Well, Alibaba Cloud did this too, and ByteDance is copying Alibaba 1:1 (who in turn is copying AWS) so I'm not surprised. But at least Alibaba named their international brand "Alibaba Cloud" and their CN one "AliCloud", similar enough.

      • Tiberium 9 hours ago ago

        Yes, but this is also pure speculation, since the product clearly exists, has customers, and even has a free trial.

    • teractiveodular 9 hours ago ago

      This. Bytedance's official spider has a clear User-Agent tagged Bytespider, but OP didn't mention what they're seeing.

      • jsheard 9 hours ago ago

        This isn't spider traffic though, the traffic pattern indicates that it's a special-purpose bot designed to hit Cobalts internal API in particular. A generic spider probably wouldn't even be able to find the API endpoints that are only referenced by Javascript, nevermind consistently hit the API with a valid video URL from a residential proxy then switch to a different IP address to download the result every time.

  • Thomashuet 10 hours ago ago

    Short version: a service known for evading YouTube's bot protection is complaining that ByteDance is bypassing their own protections. I agree that it's not nice from ByteDance but I find it hypocrite from Cobalt to call it evil.

    • lunarmony 10 hours ago ago

      > cobalt was created for public benefit, to protect people from ads and malware pushed by its alternatives

      can't say the same for bytedance, which is designed to exploit users with various ads

      • appendix-rock 9 hours ago ago

        I feel like you’re missing the point on purpose? Cobalt is asserting that it’s doing good based on the shadier behaviour of its competitors. But can you justify Cobalt in isolation any more than you can justify whoever was scraping it?

      • whywhywhywhy 8 hours ago ago

        It was created for donation money, lets not do mental gymnastics to justify one type of scraping and vilify another. Scraping is scraping and it's either all fair game or it's not all fair game.

    • h4x0rr 10 hours ago ago

      You can't compare that... cobalt doesn't DDOS YouTube

      • jsheard 10 hours ago ago

        Cobalt is also completely free, without ads or any other monetization besides donations, it's purely meant to help normal people download videos for normal people purposes. It's not like they're a for-profit data harvesting outfit complaining about getting abused by another for-profit data harvesting outfit.

        • Thomashuet 10 hours ago ago

          You're just saying that Cobalt is small and non-profit so they must be good and YouTube and ByteDance are big and rich so they must be evil. But if you only look that what they are actually doing here, it's very similar: bypassing protections to use a service in a way that the service provider doesn't like.

          • phoronixrly 10 hours ago ago

            Bytedance and youtube are evil, but not beacause they are big and rich. Cobalt is good, but not because they are small and a non-profit.

          • loloquwowndueo 9 hours ago ago

            If bytedance are so big and rich why don’t they implement their own scraping solution instead of abusing a small service like cobalt.

            • sangnoir 5 hours ago ago

              ...Because someone scraping from a Bytedance IP range is not necessarily Bytedance, just like requests from an AWS IP do not imply Amazon authored the spider

          • snvy 9 hours ago ago

            Cobalt is bypassing protections to allow legitimate Youtube users to download single videos without causing harm and with no monetary incentives. Bytedance is mass downloading thounsands of videos, all for monetary incentives while heavily breaking the TOS and potentially ignoring copyright laws. Similar, but one is doing way more harm than the other.

            • whywhywhywhy 8 hours ago ago

              > and with no monetary incentives

              Donations are a monetary incentive

              > while heavily breaking the TOS and potentially ignoring copyright laws

              Cobalt also breaks the TOS and ignores copyright laws, personally I don't think that matters but having a double standard when one company does it "It's ok when they do it" and when one you don't like does it you try to use copyright laws and TOS as a weapon just makes me think it really isn't about TOS or copyright is it.

              Also just gives YouTube ammunition to impose stricter protection against smaller violators like cobalt, like self running yt-dlp

      • criddell 10 hours ago ago

        Cobalt didn’t say the DDOS was evil, they said:

        “bytedance's scraper was specifically built to go around cloudflare & other web security solutions, which is just genuinely evil”

        So I would say it’s a fair comparison.

        • dewey 10 hours ago ago

          > built to go around cloudflare

          Then they either didn't set up CF correctly or they just use the mode in most headless browsers that bypasses default CF protection when CF is not in attack mode.

    • afavour 8 hours ago ago

      I don't see the hypocrisy here. Cobalt is a small, free service that results in Google (or so the argument goes) making less profit. ByteDance are a giant money printing machine using that free service for their own ends. They have more than enough resources to not abuse a free one.

  • conradfr 10 hours ago ago

    Some time ago I noticed the ByteDance spider very aggressively scraping my modest side project and, more importantly, modest server.

    I wrote to them to please stop (I think the address was in the user agent or something), they replied sorry and actually stopped.

    Not sure why all these crawlers can't pace themselves.

    • throwaway98797 8 hours ago ago

      devs are promoted on how fast they get done

      faster, bigger, MOAR

      sometimes it’s hard to have nice things

  • xbmcuser 10 hours ago ago

    I think Chinese isp can't store some data as they might get in trouble with Chinese censors so they dont cache it. And then if gets slightly viral you see huge traffic from 1 IP that might be a vpn. On reddit torrent channel you get similar question when ahem Linux iso is downloaded 1000s of times from same ip

    • lithiumii 9 hours ago ago

      That could be a completely different problem. In China many people run PCDN (p2p CDN) for profit. The ISPs detect (and ban) such PCDN nodes by checking your uploaded / downloaded ratio. To increase this ratio thus avoid being detected, these people download popular torrents again and again without uploading at all.

  • 3np 10 hours ago ago

    Interesting timing. The last ~month or so we've seen a drastic shift in YouTube availability. Stricter enforcement of authentication tokens (including breaking some legacy clients) and IP blocking. Loads of Invidious instances either shut down or not able to serve videos anymore. yt-dlp not working at all over an increasing number of VPNs and proxies.

    Maybe this is some ByteDance engineers getting really desperate and resorting to abusing every youtube proxy service they can because apparently they do have a residential proxy network which doesn't cut it anymore?

    Unless it's just a cost-optimization measure (residential proxy traffic is relatively pricey).

    • A4ET8a8uTh0 10 hours ago ago

      Yeah, I noticed this as well. I think the window of what some might remember as old youtube is closing forever sooner than anticipated. As I may have suggested on this forum before, if you have anything in particular you want to archive, you would be wise to have a plan to do it sooner rather than later. Space is cheap enough and I assume most people won't want to archive the entire net ( I know data hoarders exist and god bless them, but I assume they will be ok ).

      • Wowfunhappy 9 hours ago ago

        Short of full-on using Widevine/eme for all videos (which I assume would lock out too many devices), how much more could Youtube do? As long as the data is being streamed to your computer, there will be a way to capture it, right?

        • 3np 9 hours ago ago

          I can very much imagine site-wide requiring Weidevine/eme for anything better quality than 480 and crusty audio not that long into the future.

          That's already the case for some (anecdotally increasing ) number of videos.

        • treyd 8 hours ago ago

          This would encourage a lot more people to want to break Widevine. :)

        • A4ET8a8uTh0 9 hours ago ago

          Qualified yes is probably in order ( and more knowledgeable person can likely chime-in if I misstate something ). It is and always has been a cat and mouse game not completely unlike with game or movie piracy. As you stated, if you can see it on your PC, there is likely means to capture it.

          Still, notice how most of the low effort avenues are slowly being cut off one by one. I will use non youtube example. Not that long ago, I was able to rip blurays using off the shelf external bluray writer, but new firmware on currently sold drives remove that ability.

          Now, Google typically won't be ( and isn't ) everyone's hardware provider, but there are ways they could degrade 'non-sanctioned' experience in browser they can ( and do ) control.

          Granted, in Firefox ( and other non-google browsers ) it may not be as simple, but future there is not as straightfoward either given Mozilla's trajectory and financial dependence ( and moves ).

          In short, I agree with you but note that initially it was genuinely trivial to download youtube videos. This has changed over the years.

  • horsebridge 10 hours ago ago

    Anybody running a site with data that is useful for AI will learn how horrible bytedance is.

  • HeralFacker 9 hours ago ago

    Blackhole Bytedance's ASNs. Cobalt is an end-user tool, so there's not much legitimacy to a cloud service accessing it.

  • seanhunter 9 hours ago ago

    A few big sites that I'm familiar with have seen in the last six months ByteDance become by far the most agressive scraper in their logs.

  • lawrenceyan 3 hours ago ago

    Who else just found out Cobalt exists from this post? Wow, this is lit.

  • FatalLogic 10 hours ago ago

    >i can safely assume that bytedance was scraping youtube videos by abusing our private api

    I'm not doubting the OP. But why is ByteDance doing this? What does that company get out of scraping YouTube?

  • ulrischa 10 hours ago ago

    ByteDance ist also massively scraping official governemnetal sites with strange url patterns

  • sergiotapia 9 hours ago ago

    why would they use cobalt instead of ytp-dl? is it to mask their origination IPs and such?

  • jsheard 10 hours ago ago

    @uwukko's full thread for those who don't have a Twitter account:

    earlier today i noticed very elevated traffic to cobalt api that looked a lot like ddos. it turned out to be bytedance!

    we can't tell what videos they were downloading or where the original request comes from as it's built to go around all limiters, but there's still a pattern

    first request: json post with content url & settings from a residential proxy

    second request: tunnel with pseudo microsoft edge on windows user agent & youtube origin/referer, from byteplus ip

    third request: same tunnel with aria2 user agent & no referer, also from byteplus ip

    cobalt is a media downloader, mostly known for supporting youtube even at worst times. cobalt's tunnel is either a proxy stream or ffmpeg live render

    considering all of this, i can safely assume that bytedance was scraping youtube videos by abusing our private api

    with release of v10 we implemented cloudflare turnstile, but later disabled it due to access issues by a chunk of our users

    enabling it back brought the server load to normal levels and stopped bytedance from choking our servers cuz they didn't account for this (yet)

    before resorting to turnstile, i attempted using other cloudflare services, but none of them seemed to help much

    my theory is that bytedance's scraper was specifically built to go around cloudflare & other web security solutions, which is just genuinely evil

    this incident caused a few minutes of api unavailability, but taught me that cobalt (and probably anything else) can no longer exist without active bot/scraping protection

    im really glad that cloudflare turnstile exists because i don't know what i'd do without it here

    byteplus AS that was spamming requests is 150436 and last seen ip range was 207.166.160.0/21

    the amount of unique users on cloudflare analytics rapidly increased by 2.25 times and didn't go down since, while web analytics (plausible) show no increase whatsoever

    • gnfargbl 10 hours ago ago

      Sounds like they're using residential proxies for set-up in order to look like normal users, but then switching back to their own ASN for content because residential proxies are expensive.

      > im really glad that cloudflare turnstile exists because i don't know what i'd do without it here

      Why not just blackhole the byteplus ASN?

      • sandworm101 9 hours ago ago

        And how many of those residential IPs belong to work-from-home bytedance employees running work laptops? Any large company these days has direct access to a pool of innocent residential IPs. The weaponization of that pool may be more evil than the actual scraping imho.

    • miki123211 9 hours ago ago

      Bytedance seems to have increased its scraping efforts significantly.

      I've posted a canary token[1] URL as a Mastodon post, to check how scrape-resistant Mastodon actually is (it is not resistant at all), and have been getting quite a few hits from the ByteDance spider recently.

      Last hit is from 47.128.114.151, Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)

      Edit: added missing footnote.

      [1] https://canarytokens.org

      • diggan 8 hours ago ago

        > to check how scrape-resistant Mastodon actually is (it is not resistant at all)

        That's expected, no? It's a social network that is explicitly designed to be as open as possible, as it's using ActivityPub. To be "scraping resisting" would be to go against the very goal of Mastodon.

        • miki123211 2 hours ago ago

          Yes and no.

          If you look at the technical side of things, you're absolutely right. If you look at the social side, however, there's a lot of talk on there about opting out of scraping, scrapers being bad, not wanting to be part of AI training and so on. Naming-and-shaming people who have been caught scraping is a routine practice.

          I think that many Mastodonians believe that defederating from scraper-friendly instances and blocking scraper-like requests on their own protects them, this was a way to show that this very much isn't true.

        • Aachen 8 hours ago ago

          Exactly, this is how I want it to be. I post there because it's not another walled garden that profits from lock-in

    • 8 hours ago ago
      [deleted]
  • 8 hours ago ago
    [deleted]