Mirroring a site running AmuseWiki
Starting with AmuseWiki version 2.031, released on October 14, 2017,
each AmuseWiki site provides a
/mirror/ path offering a static
version of the site, suitable for mirroring, backup and batch
download. See e.g. https://amusewiki.org/mirror
Starting with AmuseWiki version 2.2, released on March 20, 2018, the
list of files to download is provided on two URLs:
(basic version) and
Download the whole site (the easy way)
If you have a GNU/Linux box,
wget is already installed and mirroring
is as easy as running this command (using https://amusewiki.org as
wget -q -O - https://amusewiki.org/mirror.txt | wget -x -N -q -i -
wget call will download the list of file and pipe it (
-O -) to the second call
which is going to download the piped list (
-i -), create the needed directories (
check the timestamps (
-N), so it will not download again the files if not modified.
All this is happening quietly (
If you don’t have
wget installed or you can’t pipe commands, the
procedure is a bit different.
First you need to install
Please keep in mind that this is a command line utility, so you are going to need the Windows command prompt.
Go to the directory where you want to create the mirror. Download https://amusewiki.org/mirror.txt and fetch that list:
wget https://amusewiki.org/mirror.txt wget -x -N -i mirror.txt
And that’s it.
Private sites are not exposing
/mirror/ for obvious reasons.
However, they can be mirrored with
wget providing the credentials to
the HTTP authentication.
wget -q -O - --user=user --password=password \ https://private.amusewiki.org/mirror.txt | \ wget --user=user --password=password -x -N -q -i -
Creative people can also additionally filter the file list to exclude
formats they don’t want or get only a specific format, editing
(locally or on the fly) the file list passed to
Example: download all the EPUB files and put them in the current directory (no directory tree):
wget -q -O - https://amusewiki.org/mirror.txt | grep '\.epub$' | wget -N -i -
Building ZIM file
Download all files, excluding bare HTML format:
wget -q -O - https://amusewiki.org/mirror.txt | \ grep -v '\.bare.html$' | \ wget -x -N -q -i -
Compile ZIM file using zimwriterfs:
zimwriterfs -w index.html \ -f site_files/favicon.ico \ -l EN \ -t Amusewiki \ -d "Amusewiki" \ -c "Amusewiki" \ -p "Amusewiki" \ amusewiki.org/mirror/ amuse.zim
Be nice with the servers
The above described techniques are good for a one-time job, they don’t create much traffic if there are no changes, but they still hammer the sites with a lot of requests.
For this purpose, another file list is provided at
which include the timestamp of the files (without the full URL). The format is:
one filename, hash symbol, timestamp. One file per line. E.g.:
titles.html#1525363603 topics.html#1525363603 authors.html#1525363603
This can be easily parsed and a client can check the local timestamp before doing the request.
See https://github.com/melmothx/amusewiki/blob/master/script/mirror-site.pl for a simple (and usable) implementation.