47 comments

  • Aurornis 2 days ago ago

    Cool way to self-host archives.

    What I'd really like is a plugin that automatically pulls from archives somewhere and replaces deleted comments and those bot-overwritten comments with the original context.

    Reddit is becoming maddening to use because half the old links I click have comments overwritten with garbage out of protest for something. Ironically the original content is available in these archives (which are used for AI training) but now missing for actual users like me just trying to figure out how someone fixed their printer driver 2 years ago.

    • anonymous908213 2 days ago ago

      That would only really be ironic if the reason for people overwriting their comments was out of protest for LLM training, but the main reason that resulted in by far the biggest wave of deletions was Reddit locking down their API. If the result of their protest is that the site is less useful for you, the user, then in fact it served its purpose, as the entire point was an attempt to boycott Reddit, ie. get people to stop using it by removing the user contributions that give the site its only value in the first place.

      • Aurornis 2 days ago ago

        > If the result of their protest is that the site is less useful for you, the user, then in fact it served its purpose, as the entire point was an attempt to boycott Reddit, ie. get people to stop using it by removing the user contributions that give the site its only value in the first place.

        In practice I just give them more page views because I have to view more threads before I find the answer.

        Reddit's DAU numbers have only gone up since the protest.

        • swed420 a day ago ago

          > Reddit's DAU numbers have only gone up since the protest.

          And so has the bot activity.

        • anonymous908213 2 days ago ago

          I did phrase it as "an attempt". In the end the protest probably wasn't as effective as protestors might have hoped, and it didn't get Reddit to change course on their enshittification decisions. I do think it was good that there was an attempt at pushback, at least, when most software users just accept enshittification as normal and continue tolerating whatever abuse their masters throw at them.

    • Gander5739 2 hours ago ago
    • accrual a day ago ago

      Just offering another perspective because I see those missing comments too. The author decided they didn't want to participate in public discourse anymore and their comment is gone. So be it. I don't search archives or use tools to undermine their effort. I move onto the next thing.

      I read "it's maddening because ... they decided to use their autonomy and..." and I stop there. So be it.

      • hrimfaxi a day ago ago

        People use their autonomy to maddening ends—how does the fact that it is of their own volition offer you any comfort? I ask genuinely. Is it something along the lines of recognizing the things you can't change?

        • dzelzs a day ago ago

          In this case - recognition of an attempt at doing something. Downplaying that is similar to Downplaying protests for not achieving anything. At the very least it might have brought attention to the topic of contention for more people which can be a spark for change. If you have apathy and disdain for attempts at change - it might be worth evaluating what the consequences might be of that at a societal level when that apathy is the norm for harder to change things (like politics, big corp practices etc.)

  • NickNaraghi 2 days ago ago

    Data is available via torrent in this section: https://github.com/19-84/redd-archiver?tab=readme-ov-file#-g...

  • diggings 2 days ago ago

    This is a neat project, nice work.

    You've probably come across this already but there are alternative archives to PushShift that may have differing sets of posts and comments (perhaps depending on removal request coverage?)

    One is Arctic Shift: https://github.com/ArthurHeitmann/arctic_shift/releases

    Another is PullPush: https://pullpush.io/

  • m463 a day ago ago

    I wonder if you could use this to "Seed" a new distributed social media thing and just take over from there.

    sort of like forking a project.

  • feconroses a day ago ago

    Very cool project! Quick question: is the underlying Pushshift dataset updated with new Reddit data on any regular cadence (daily/weekly/monthly), or is this essentially a fixed historical snapshot up to a certain date? Just want to understand if self-hosters would need to periodically re-download for fresh content or if it's archival-only.

  • alcroito 2 days ago ago

    I tried spinning up the local approach with docker compose, but it fails.

    There's no `.env.example` file to copy from. And even if the env vars are set manually, there are issues with the mentioned volumes not existing locally.

    Seems like this needs more polish.

  • elSidCampeador 2 days ago ago

    I wonder if this can be hooked up with the now-dead Apollo app in some way, to get back a slice of time that is forever lost now?

    • 19-84 2 days ago ago

      the API should allow for a lot of different integrations

  • twobitshifter a day ago ago

    If reddit was a squeaky clean place, or if I could pick certain subs, maybe I would be interested, but I really wouldn't want ALL of reddit on my machine even temporarily.

    • 19-84 a day ago ago

      the torrent has data for the top 40,000 subs on reddit. thanks to watchful1 splitting the data by subreddit, you can download only the subreddit you want from the torrent

      • Imustaskforhelp a day ago ago

        I am going to be honest and this looks really cool.

        40,000 subs are good numbers and I hope that the number can be spread to even more subreddits

        Perhaps we can finally migrate all or much of the data to lemmy instances as well to finally get the lemmy instance up and running as well.

        Thank you for creating this. It opens up a lots of interesting opportunities.

  • leshokunin 5 hours ago ago

    Is there a docker compose?

  • bkovacev 2 days ago ago

    Is there any way to check if a subreddit that was made private (2-3 years ago) is in the data dump?

  • vivzkestrel a day ago ago

    - slightly offtopic here but does anyone have a similar data set of all youtube channels out there?

    - details probably include the 400 million youtube accounts, channel id, name, creator url, etc

  • blks a day ago ago

    Does it also contains countless NSFW content?

  • blks a day ago ago

    Opened the live demo, went into programming subreddit, felt like I was showered with liquid shit. I tend to forget what kind of edgelord hellhole Reddit was (and stil is sometimes).

  • dvngnt_ 2 days ago ago

    I want to do the same thing for tiktok. I have 5k videos starting from the pandemic downloaded. want to find a way to use AI to tag and categorize the videos to scroll locally.

  • drob518 a day ago ago

    This is a great way to participate in arguments you missed three years ago.

  • kylehotchkiss 2 days ago ago

    _Hacker News collectively grabs the dataset to train their models on how to become effective reddit trolls_

    • layer8 2 days ago ago

      Don’t we have enough of those already? ;)

    • 19-84 2 days ago ago

      the API and MCP server is very powerful ;)

  • justsomehnguy a day ago ago

    Appreciated.

    EDIT: Is there any cheap way to search? I have MS TechNet archive which is useless without search, so I realky want to know a way to have a cheap local search w/o grepping everyting.

    • 19-84 a day ago ago

      redd-archiver uses postgres full text search. for static search you could use lunr.js

  • syngrog66 2 days ago ago

    Did you pay all the people who created its content?

    • nullandvoid 2 days ago ago

      Did anyone ever comment on reddit with an expectation of pay?

      It's an open forum - similar to here, whatever I post I it's in the public forum and therefore I expect it to be used / remixed however anyone wants.

      • nozzlegear 2 days ago ago

        > Did anyone ever comment on reddit with an expectation of pay?

        Maybe Gallowboob

        • Sohcahtoa82 a day ago ago

          That's a name I haven't seen in a LONG time.

    • devilsdata 2 days ago ago

      I have no problem with this being downloaded for personal use, in fact that's a good thing. But of course we both know it'll be used to train AI.

    • antisthenes a day ago ago

      Reddit didn't pay me for posting either. Not that I posted in the last decade.