HTTrack Website Copier

(github.com)

71 points | by iscream26 13 hours ago ago

15 comments

  • Felk 9 hours ago ago

    Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.

    If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive

    Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!

    • smashed 8 hours ago ago

      I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.

  • xnx 11 hours ago ago

    Great tool. Does it still work for the "modern" web (i.e. now that even simple/content websites have become "apps")?

    • alganet 10 hours ago ago

      Nope. It is for the classic web (the only websites worth saving anyway).

      • freedomben 9 hours ago ago

        Even for classic web, if it's behind cloudflare, then HTTrack no longer works.

        It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of

        • alganet 8 hours ago ago

          If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.

          If cloudflare ddos protection is an issue, you can throttle httrack requests.

          • acheong08 6 hours ago ago

            > you can use a browser with JS to grab the cookie, then feed it to httrack headers

            They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites

  • corinroyal 10 hours ago ago

    One time I was trying to create an offline backup of a botanical medicine site for my studies. Somehow I turned off depth of link checking and made it follow offsite links. I forgot about it. A few days later the machine crashed due to a full disk from trying to cram as much of the WWW as it could on there.

  • oriettaxx 8 hours ago ago

    I don't get it: last release 2017 while in github I see more releases...

    so, did developer of the github repo took over and updating/upgrading? very good!

  • subzero06 4 hours ago ago

    i use this to double check which of my web app folder/files are publicly accessible.

  • Alifatisk 10 hours ago ago

    Good ol' days

  • woutervddn 3 hours ago ago

    Also known as: static site generator for any original website platform...

  • dark-star 11 hours ago ago

    oh wow that brings back memories. I have used httrack in the late 90s and early 2000's to mirror interesting websites from the early internet, over a modem connection (and early DSL)

    Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then

    • dspillett 8 hours ago ago

      > now that the web is much more dynamic I guess it's not as useful anymore as it was back then

      Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).

      Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.