6 comments

  • solardev 12 hours ago ago

    Web dev here, but not cybersec focused... if I'm wrong, someone will be along to correct me shortly :)

    That said, I'm reasonably confident that what you want isn't doable/practical, unfortunately :(

    While there are certainly companies that make valuable datasets available over the web, the usual way they prevent mass scraping is by enforcing account limits, making retrieval expensive and also limited to only one tiny slice of data at a time. An example industry that does this are the mass data harvesting/targeting companies like Meta, Alphabet, or political companies (NGPVan, Actblue, etc.). They cross-reference a lot of PII floating around the internet, and/or harvest their own and then sell that to advertisers or political campaigns, but only a slice at a time, and at prices that they determine. You can of course pay to scrape any one slice of it, but if you wanted the whole dataset, you'd probably end up paying more than the entire company's worth.

    That, or their data is inherently time-sensitive, such that older copies of it aren't as valuable. Stocks, real estate sites, news tickers, etc. come to mind, where sure, you can scrape their stuff, but unless you perform some sort of value-added collation/analysis on top of it, it's going to be stale by the time you serve it to your own users. The data originators are always one step ahead of you.

    If your data isn't proprietary to begin with (i.e. you're not the one making it and adding updates) AND you want it to be publicly accessible without an account... it's only a matter of time before some botnet or another scrapes all of it.

    You can do things to slow down the scraping, such as adding Cloudflare, but realistically, bots and labor are very cheap in much of the world, and if someone really wants your data, they'll get it. It's essentially free to them, especially if you've done all the hard work of collecting it and putting it all on a single website.

    It will always take more time for you add to manually add filter permutations than it takes a script & botnet to enumerate through them. They can just tweak parameters and send them through thousands of headless browsers running in dispersed instances across the world.

    You can require account signup and verification before accessing the data, but that's also trivially faked unless you're requiring real payments.

    Identifying real users vs bots is anything BUT trivial. Google and Cloudflare and hCaptcha have spent decades trying to solve that with huge teams and world-class researchers. And even they only have limited success rates, especially since anybody can spend pennies to hire real humans to run through your captchas. And that problem is only going to get harder, much harder, with all the advancements in machine learning, natural language processing, and machine vision.

    Sorry for the bad news =/ I hope I'm wrong, but I'm fairly confident you can't really accomplish this.

    • markden 11 hours ago ago

      While you are right, this isn’t what I was hoping to hear :), I do really appreciate the helpful response. Thank you!

      • solardev 11 hours ago ago

        You're welcome, but also keep in mind that it's just my opinion :) Someone else might come along and tell you all the ways I'm wrong.

        Also, it's not a black & white situation. If your dataset isn't super valuable, or if it's just niche enough, it's possible that adding Cloudflare by itself would be "good enough" protection. It's a LOT better than nothing, and also much better protection than what most people can DIY on their own.

        • markden 11 hours ago ago

          Yeah, and that’s kind of exactly what I am looking for. This is niche enough that I am likely overly concerned someone would do real work to “steal” it. But I also always lock my car, even if someone can still smash the window. :)

          • solardev 10 hours ago ago

            That's a good analogy. If you have the first mover advantage and can earn user loyalty through good UX or whatever, it might not really matter thwt much even if someone does steal your data. Worth a shot?

            • markden 9 hours ago ago

              Haha still thinking through that. But potentially!