Mirroring a site running AmuseWiki

Title: Mirroring a site running AmuseWiki

Topic: How-To

Publication date: May 6, 2018

Download the whole site (the easy way)

Be nice with the servers

Starting with AmuseWiki version 2.031, released on October 14, 2017, each AmuseWiki site provides a /mirror/ path offering a static version of the site, suitable for mirroring, backup and batch download. See e.g. https://amusewiki.org/mirror

Starting with AmuseWiki version 2.2, released on March 20, 2018, the list of files to download is provided on two URLs: /mirror.txt (basic version) and /mirror.ts.txt (advanced)

E.g. https://amusewiki.org/mirror.txt and https://amusewiki.org/mirror.ts.txt

Download the whole site (the easy way)

If you have a GNU/Linux box, wget is already installed and mirroring is as easy as running this command (using https://amusewiki.org as example):

wget -q -O - https://amusewiki.org/mirror.txt | wget -x -N -q -i -

Explanation:

The first wget call will download the list of file and pipe it (-O -) to the second call which is going to download the piped list (-i -), create the needed directories (-x) and check the timestamps (-N), so it will not download again the files if not modified. All this is happening quietly (-q).

Windows

If you don’t have wget installed or you can’t pipe commands, the procedure is a bit different.

First you need to install wget. See https://www.gnu.org/software/wget/, https://www.gnu.org/software/wget/faq.html#download and https://eternallybored.org/misc/wget/

Please keep in mind that this is a command line utility, so you are going to need the Windows command prompt.

Go to the directory where you want to create the mirror. Download https://amusewiki.org/mirror.txt and fetch that list:

wget https://amusewiki.org/mirror.txt
wget -x -N -i mirror.txt

And that’s it.

Private sites

Private sites are not exposing /mirror/ for obvious reasons. However, they can be mirrored with wget providing the credentials to the HTTP authentication.

wget -q -O - --user=user --password=password \
     https://private.amusewiki.org/mirror.txt | \
     wget --user=user --password=password -x -N -q -i -

Advanced

Filtering

Creative people can also additionally filter the file list to exclude formats they don’t want or get only a specific format, editing (locally or on the fly) the file list passed to wget.

Example: download all the EPUB files and put them in the current directory (no directory tree):

wget -q -O - https://amusewiki.org/mirror.txt | grep '\.epub$' |  wget -N -i -

Building ZIM file

Mirror can be converted to ZIM file format for offline reading.

Download all files, excluding bare HTML format:

wget -q -O - https://amusewiki.org/mirror.txt | \
     grep -v '\.bare.html$' | \
     wget -x -N -q -i -

Compile ZIM file using zimwriterfs:

zimwriterfs -w index.html \
            -f site_files/favicon.ico \
            -l EN \
            -t Amusewiki \
            -d "Amusewiki" \
            -c "Amusewiki" \
            -p "Amusewiki" \
            amusewiki.org/mirror/ amuse.zim

Be nice with the servers

The above described techniques are good for a one-time job, they don’t create much traffic if there are no changes, but they still hammer the sites with a lot of requests.

For this purpose, another file list is provided at /mirror.ts.txt, which include the timestamp of the files (without the full URL). The format is: one filename, hash symbol, timestamp. One file per line. E.g.:

titles.html#1525363603
topics.html#1525363603
authors.html#1525363603

This can be easily parsed and a client can check the local timestamp before doing the request.

See https://github.com/melmothx/amusewiki/blob/master/script/mirror-site.pl for a simple (and usable) implementation.

Older texts

Newer texts

Debian Packages GitHub Install Upgrade Muse markup manual Customize