How to download complete items from archive.org

Download all files attached to a item page at archive.org
Navigate to the item page you want to download all the files from.
Download the XML filelist (named as the item, get the file ending with “_files.xml”).
Parse the filelist for the files (quick and ugly):

grep "file name=" someitem_files.xml | sed s:\<file\ name=:\<a\ href=:g | sed s:\>:\>file\<\/a\>:g

This will keep the lines containing “file name=” and create a output only containing (relative, as in the file list) html links to each file.

Redirect the output to a file (I assume you know how), then download with wget:

wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -F -i someitem.items -B "https://archive.org/download/someitem/"

For more advanced downloading, I have created a set of script (not yet released) that allow downloads of a complete collection (of other item pages) or download of everything uploaded by a specific user. My scripts will also create ‘md5sum -c’ compatible lists from the _files.xml files, execute the checking and optionally delete corrupt files for re-downloading.

Leave a Reply

Your email address will not be published. Required fields are marked *