Building a personal archive of the web, the slow way

gwern · 2025-05-20T14:43:36 1747752216

OP's workflow might be much more efficient with use of https://github.com/gildas-lormeau/SingleFile/

It can handle most of what they describe for things like private/paywalled pages or media enclosures or completely self-contained archives that live locally or easy to use or editing before saving or ensuring lazy-loaded images are there, you can view it immediately to check for breakage, it automatically works with adblock and NoScript and when you delete stuff in the DOM using the picker so they can clean each page very efficiently (create a bunch of rules in your adblock by picking elements like in ublock, so you never have to do those again, then quickly mouse any remainder), and it stores the final DOM so you can interact with stuff to make sure it is visible or archived.

So what I do ( https://gwern.net/archiving#preemptive-local-archiving ) is I have a script which calls SingleFile-CLI in a headless Chrome browser to automatically archive everything, and then opens up the original URL + snapshot in my normal Firefox, and look at the snapshot then original. If the snapshot looks good, I simply close the 2 tabs after a few seconds and I'm done; if the snapshot looks bad, then I look at the original and make edits: use Ublock Origin to define any necessary rules (assuming the page isn't cleaned up by all the rules I previously defined), make any minor tweaks to the DOM, and then SingleFile-browser-extension it manually.

If you use enough adblock rules, then you get a similar effect to the 'templates' described, since it looks like OP is mostly just trying to remove as much as possible. But since you're archiving the final DOM, you can do anything you like. Something I've done a few times is opening up multiple pages and copy-pasting the key DOM node from each of them into the first one, to create a single consolidated master page, in a way which is a lot easier & more reliable than messing around with the serialized HTML in Emacs.

You can also post-process them. (Because we use these local archives for 'previews' on Gwern.net, and a fully static self-contained HTML page can easily be 100MB+ with all its fonts and images and stuff, we take the SingleFile snapshots and for the large ones, we 'split' them back up, so loading the .html file doesn't necessarily load everything else: https://github.com/gwern/gwern.net/blob/master/build/deconst... And then you can save a lot of space by running standard optimization tools on the split-out files, eg OptiPNG on the revealed PNGs will save gigabytes of space because so many people fail to do the standard image optimizations.)

Compared to "it typically takes me a few minutes to save a page", I handle the majority of pages in a few seconds, and even the nastiest page where I have to delete a lot is usually like a minute. And since I do like 10 URLs a day, this is quite manageable at scale. (I'm up to >15k snapshots, although an unknown fraction are from an initial bulk archiving so may not be of high quality.)