Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.
Even for classic web, if it's behind cloudflare, then HTTrack no longer works.
It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of
If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.
If cloudflare ddos protection is an issue, you can throttle httrack requests.
> you can use a browser with JS to grab the cookie, then feed it to httrack headers
They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites
One time I was trying to create an offline backup of a botanical medicine site for my studies. Somehow I turned off depth of link checking and made it follow offsite links. I forgot about it. A few days later the machine crashed due to a full disk from trying to cram as much of the WWW as it could on there.
oh wow that brings back memories. I have used httrack in the late 90s and early 2000's to mirror interesting websites from the early internet, over a modem connection (and early DSL)
Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then
> now that the web is much more dynamic I guess it's not as useful anymore as it was back then
Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).
Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.
Funny seeing this here now, as I _just_ finished archiving an old MyBB PHP forum. Though I used `wget` and it took 2 weeks and 260GB of uncompressed disk space (12GB compressed with zstd), and the process was not interruptible and I had to start over each time my hard drive got full. Maybe I should have given HTTrack a shot to see how it compares.
If anyone wanna know the specifics on how I used wget, I wrote it down here: https://github.com/SpeedcubeDE/speedcube.de-forum-archive
Also, if anyone has experience archiving similar websites with HTTrack and maybe know how it compares to wget for my use case, I'd love to hear about it!
I've tried both in order to archive EOL websites and I've had better luck with wget, it seems to recognize more links/resources and do a better job so it was probably not a bad choice.
Great tool. Does it still work for the "modern" web (i.e. now that even simple/content websites have become "apps")?
Nope. It is for the classic web (the only websites worth saving anyway).
Even for classic web, if it's behind cloudflare, then HTTrack no longer works.
It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of
If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.
If cloudflare ddos protection is an issue, you can throttle httrack requests.
> you can use a browser with JS to grab the cookie, then feed it to httrack headers
They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites
One time I was trying to create an offline backup of a botanical medicine site for my studies. Somehow I turned off depth of link checking and made it follow offsite links. I forgot about it. A few days later the machine crashed due to a full disk from trying to cram as much of the WWW as it could on there.
That is awesome.
I don't get it: last release 2017 while in github I see more releases...
so, did developer of the github repo took over and updating/upgrading? very good!
i use this to double check which of my web app folder/files are publicly accessible.
Good ol' days
Also known as: static site generator for any original website platform...
oh wow that brings back memories. I have used httrack in the late 90s and early 2000's to mirror interesting websites from the early internet, over a modem connection (and early DSL)
Good to know they're still around, however, now that the web is much more dynamic I guess it's not as useful anymore as it was back then
> now that the web is much more dynamic I guess it's not as useful anymore as it was back then
Also less useful because the web is so easy to access, I remember using it back then to draw things down over the university link for reference in my room (1st year, no network access at all in rooms) or house (or per-minute costed modem access).
Sites can vanish easily of course still these days, so having a local copy could be a bonus, but they just as likely go out of date or get replaced, and if not are usually archived elsewhere already.