56tvmao: How-to instructions you can trust. Linux How to Create a Web Archive With Archivebox

How to Create a Web Archive With Archivebox

Archivebox is an easy-to-use archival program that allows you to create an accurate snapshot of any website. This can be helpful for archivists and users that want to preserve information online. Not only that, Archivebox is also incredibly simple and easy to use. For example, you can run the program both as a command line tool and as a web app that you can access anywhere.

Content

Why Should You Archive Websites?

Over the years, the World Wide Web enabled individuals across the globe to easily share and communicate information with each other. One issue with the Web, however, is that websites do not hold up over time.

Image source:
web.archive.org

Most websites only stay active for around two to five years. After that, they either go offline completely or are replaced by a different website altogether. For example, there are little to no websites from the 1990s that are still online today.

Image source:
cameronsworld.net

Alternatively, you can also use the WayBack machine to archive websites – no installation required.

Archivebox’s Requirement

Before you can install Archivebox, you need to make sure that you have the following resources:

  • A machine that you can access from outside your home network. This can either be a machine at home that you can port-forward or a rented remote VPS.
  • Your machine needs to have an adequate amount of storage space. In most cases, a 1TB disk should be able to store between 100,000 to 1,000,000 individual webpages.
  • Your machine’s filesystem needs to either be EXT4 or ZFS for Archivebox to work properly.

Note: this tutorial focuses on installing and configuring Archivebox on a local Ubuntu 22.04 LTS machine.

Installing Archivebox

First, install the program’s dependencies. Open a terminal and type the following command:

sudo apt install python3 nodejs python3-pip nginx npm
npm install --no-audit --no-fund 'git+https://github.com/gildas-lormeau/SingleFile.git'
npm install --no-audit --no-fund 'git+https://github.com/ArchiveBox/readability-extractor.git'
npm install --no-audit --no-fund '@postlight/mercury-parser'

Install Archivebox through Python PIP:

pip3 install archivebox
PATH=$PATH:/home/$USER/.local/bin

Next, create a folder where Archivebox will save all of its data. In my case, I am creating my directory in my “/home/archivebox” directory:

mkdir /home/$USER/abox-data && cd /home/$USER/abox-data

Lastly, you can finalize your Archivebox instance by running the following command to download and configure all the Python patches that the program needs to run in your machine.

archivebox init --setup

You will be asked for the details of the first user.

Check whether you have installed Archivebox properly by running:

archivebox --version

Preparing the Web GUI

While Archivebox is perfectly usable as a command line utility, it is also possible to access the program through a web interface. This is useful if you want to either share Archivebox with other users or access the program outside your server.

To host a web GUI, you need to create an Nginx reverse proxy to redirect any incoming web traffic to the Archivebox daemon.

Create a new Nginx configuration file:

sudo nano /etc/nginx/sites-available/archivebox

Copy and paste the following code, changing server_name to your own domain name:

server {
       listen 80;
       listen [::]:80;
 
       root /home/archivebox/abox-data;
 
       server_name yetanotherarchivebox.xyz www.yetanotherarchivebox.xyz;
 
       location / {
                  proxy_pass http://127.0.0.1:8000;
       }
}

Enable the Archivebox configuration:

sudo ln -s /etc/nginx/sites-available/archivebox /etc/nginx/sites-enabled/

Restart Nginx and start the Archivebox daemon:

sudo systemctl restart nginx
archivebox server 0.0.0.0:8000

Archiving Your First Website

Open your web browser and access the Archivebox instance through your domain name. In my case, I am going to “yetanotherarchivebox.xyz.”

Click the “LOG IN” button in the webpage’s upper-right corner.

Enter your user credentials to log in to the utility.

Archive your first website by pressing the “Add” button on the page’s upper sidebar.

This will load a large dialog box, where you can add a list of web links that you would like to archive. In my case, I am adding “https://maketecheasier.com.”

Next, you can choose a variety of options to archive your website. For example, you can provide a set of tags for your links to sort them properly.

Further, you can tell Archivebox to save the contents of any immediate link in the page that you want to archive. This is useful in cases where you want to preserve the context of a website.

Click the “Add URLs and Archive” button to start the archiving process. In most cases, this should only take between one and two minutes.

Archiving a Website Using the Command Line

To archive a webpage from the command line, run the following commands:

cd /home/$USER/abox-data
archivebox add --depth=1 https://maketecheasier.com

Further, you can also use the add subcommand to archive a list of web links. For example, running the following command will tell Archivebox to save every link in my “bookmarks.txt” file:

archivebox add < /home/$USER/bookmarks.txt

Lastly, it is also possible to create a self-contained archive of a single webpage. To do this, run the following command:

archivebox oneshot https://maketecheasier.com

Customizing Archivebox

You can also customize how Archivebox obtains the pages that it saves. For example, it is possible to save only a screenshot of every web page that you archive.

This is helpful for users who want to save disk space while storing websites. To disable the other formats, you need to run the following commands:

archivebox config --set SAVE_WGET=False
archivebox config --set SAVE_WARC=False
archivebox config --set SAVE_PDF=False
archivebox config --set SAVE_SINGLEFILE=False
archivebox config --set SAVE_READABILITY=False
archivebox config --set SAVE_MERCURY=False

Adding a New User in Archivebox

To add a new user, go back to the web GUI and click the “ADMIN” button on the page’s upper bar.

Once inside the Admin Panel, go to the “Authentication and Authorization” category and select “Users.”

This will list all the active users in the system. Select the “Add User +” button in the page’s upper-right corner.

Similar to adding users to a Linux group, the user creation process in Archivebox can be complicated. Despite that, a new user only requires three things to function properly: username, password and a set of user permissions.

To create a new user, first provide a password.

After that, select the user permissions for that particular user. In most cases, you only need to toggle the following options for a regular user:

core | archive result | Can add archive result
core | archive result | Can change archive result
core | archive result | Can view archive result
core | snapshot | Can add snapshot
core | snapshot | Can change snapshot
core | snapshot | Can view snapshot
core | tag | Can add Tag
core | tag | Can change Tag
core | tag | Can view Tag
sessions | session | Can add session
sessions | session | Can change session
sessions | session | Can view session

Provide a username for the new user account. In my case, I am using the name “alice.”

Lastly, select the “SAVE” button on the page’s lower right corner to apply your changes.

Frequently Asked Questions

How can I solve a “Failed to install Python packages” error?

This happens due to a bug in Archivebox that prevents it from finding the binaries it is looking for. Despite that, this error only affects a minor part of the program and will not damage the integrity of your archive.

One way to mitigate this issue is by making sure that your installation is always up to date. Do that by running pip3 install --upgrade archivebox.

How can I fix the “HTTPSConnectionPool” error whenever I save a website?

This error happens whenever a website does not have a valid HTTPS version. Fix this issue by forcing Archivebox to archive through HTTP. For example, running archivebox add http://insecurewebsite.com will force the program to use HTTP.

What can I do when my new user account cannot archive a website?

This issue is most likely due to a missing permissions settings on your new user account. One way to quickly fix this issue is by making sure that your new user account has the core | snapshot | Can add snapshot permission.

Image credit: Unsplash. All alterations and screenshots by Ramces Red.


Ramces Red
Staff Writer

Ramces is a technology writer that lived with computers all his life. A prolific reader and a student of Anthropology, he is an eccentric character that writes articles about Linux and anything *nix.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Sign up for all newsletters.
By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time. Subscribe

Related Post