Archivebox is an easy-to-use archival program that allows you to create an accurate snapshot of any website. This can be helpful for archivists and users that want to preserve information online. Not only that, Archivebox is also incredibly simple and easy to use. For example, you can run the program both as a command line tool and as a web app that you can access anywhere.
Content
- Why Should You Archive Websites?
- Archivebox’s Requirement
- Installing Archivebox
- Preparing the Web GUI
- Archiving Your First Website
- Customizing Archivebox
- Frequently Asked Questions
Why Should You Archive Websites?
Over the years, the World Wide Web enabled individuals across the globe to easily share and communicate information with each other. One issue with the Web, however, is that websites do not hold up over time.
Image source:
web.archive.org
Most websites only stay active for around two to five years. After that, they either go offline completely or are replaced by a different website altogether. For example, there are little to no websites from the 1990s that are still online today.
Image source:
cameronsworld.net
Alternatively, you can also use the WayBack machine to archive websites – no installation required.
Archivebox’s Requirement
Before you can install Archivebox, you need to make sure that you have the following resources:
- A machine that you can access from outside your home network. This can either be a machine at home that you can port-forward or a rented remote VPS.
- Your machine needs to have an adequate amount of storage space. In most cases, a 1TB disk should be able to store between 100,000 to 1,000,000 individual webpages.
- Your machine’s filesystem needs to either be EXT4 or ZFS for Archivebox to work properly.
Note: this tutorial focuses on installing and configuring Archivebox on a local Ubuntu 22.04 LTS machine.
Installing Archivebox
First, install the program’s dependencies. Open a terminal and type the following command:
sudo apt install python3 nodejs python3-pip nginx npm npm install --no-audit --no-fund 'git+https://github.com/gildas-lormeau/SingleFile.git' npm install --no-audit --no-fund 'git+https://github.com/ArchiveBox/readability-extractor.git' npm install --no-audit --no-fund '@postlight/mercury-parser'
Install Archivebox through Python PIP:
pip3 install archivebox PATH=$PATH:/home/$USER/.local/bin
Next, create a folder where Archivebox will save all of its data. In my case, I am creating my directory in my “/home/archivebox” directory:
mkdir /home/$USER/abox-data && cd /home/$USER/abox-data
Lastly, you can finalize your Archivebox instance by running the following command to download and configure all the Python patches that the program needs to run in your machine.
archivebox init --setup
You will be asked for the details of the first user.
Check whether you have installed Archivebox properly by running:
archivebox --version
Preparing the Web GUI
While Archivebox is perfectly usable as a command line utility, it is also possible to access the program through a web interface. This is useful if you want to either share Archivebox with other users or access the program outside your server.
To host a web GUI, you need to create an Nginx reverse proxy to redirect any incoming web traffic to the Archivebox daemon.
Create a new Nginx configuration file:
sudo nano /etc/nginx/sites-available/archivebox
Copy and paste the following code, changing server_name
to your own domain name:
server { listen 80; listen [::]:80; root /home/archivebox/abox-data; server_name yetanotherarchivebox.xyz www.yetanotherarchivebox.xyz; location / { proxy_pass http://127.0.0.1:8000; } }
Enable the Archivebox configuration:
sudo ln -s /etc/nginx/sites-available/archivebox /etc/nginx/sites-enabled/
Restart Nginx and start the Archivebox daemon:
sudo systemctl restart nginx archivebox server 0.0.0.0:8000
Archiving Your First Website
Open your web browser and access the Archivebox instance through your domain name. In my case, I am going to “yetanotherarchivebox.xyz.”
Click the “LOG IN” button in the webpage’s upper-right corner.
Enter your user credentials to log in to the utility.
Archive your first website by pressing the “Add” button on the page’s upper sidebar.
This will load a large dialog box, where you can add a list of web links that you would like to archive. In my case, I am adding “https://maketecheasier.com.”
Next, you can choose a variety of options to archive your website. For example, you can provide a set of tags for your links to sort them properly.
Further, you can tell Archivebox to save the contents of any immediate link in the page that you want to archive. This is useful in cases where you want to preserve the context of a website.
Click the “Add URLs and Archive” button to start the archiving process. In most cases, this should only take between one and two minutes.
Archiving a Website Using the Command Line
To archive a webpage from the command line, run the following commands:
cd /home/$USER/abox-data archivebox add --depth=1 https://maketecheasier.com
Further, you can also use the add
subcommand to archive a list of web links. For example, running the following command will tell Archivebox to save every link in my “bookmarks.txt” file:
archivebox add < /home/$USER/bookmarks.txt
Lastly, it is also possible to create a self-contained archive of a single webpage. To do this, run the following command:
archivebox oneshot https://maketecheasier.com
Customizing Archivebox
You can also customize how Archivebox obtains the pages that it saves. For example, it is possible to save only a screenshot of every web page that you archive.
This is helpful for users who want to save disk space while storing websites. To disable the other formats, you need to run the following commands:
archivebox config --set SAVE_WGET=False archivebox config --set SAVE_WARC=False archivebox config --set SAVE_PDF=False archivebox config --set SAVE_SINGLEFILE=False archivebox config --set SAVE_READABILITY=False archivebox config --set SAVE_MERCURY=False
Adding a New User in Archivebox
To add a new user, go back to the web GUI and click the “ADMIN” button on the page’s upper bar.
Once inside the Admin Panel, go to the “Authentication and Authorization” category and select “Users.”
This will list all the active users in the system. Select the “Add User +” button in the page’s upper-right corner.
Similar to adding users to a Linux group, the user creation process in Archivebox can be complicated. Despite that, a new user only requires three things to function properly: username, password and a set of user permissions.
To create a new user, first provide a password.
After that, select the user permissions for that particular user. In most cases, you only need to toggle the following options for a regular user:
core | archive result | Can add archive result core | archive result | Can change archive result core | archive result | Can view archive result core | snapshot | Can add snapshot core | snapshot | Can change snapshot core | snapshot | Can view snapshot core | tag | Can add Tag core | tag | Can change Tag core | tag | Can view Tag sessions | session | Can add session sessions | session | Can change session sessions | session | Can view session
Provide a username for the new user account. In my case, I am using the name “alice.”
Lastly, select the “SAVE” button on the page’s lower right corner to apply your changes.
Frequently Asked Questions
How can I solve a “Failed to install Python packages” error?
This happens due to a bug in Archivebox that prevents it from finding the binaries it is looking for. Despite that, this error only affects a minor part of the program and will not damage the integrity of your archive.
One way to mitigate this issue is by making sure that your installation is always up to date. Do that by running pip3 install --upgrade archivebox
.
How can I fix the “HTTPSConnectionPool” error whenever I save a website?
This error happens whenever a website does not have a valid HTTPS version. Fix this issue by forcing Archivebox to archive through HTTP. For example, running archivebox add http://insecurewebsite.com
will force the program to use HTTP.
What can I do when my new user account cannot archive a website?
This issue is most likely due to a missing permissions settings on your new user account. One way to quickly fix this issue is by making sure that your new user account has the core | snapshot | Can add snapshot
permission.
Image credit: Unsplash. All alterations and screenshots by Ramces Red.
Ramces Red –
Staff Writer
Ramces is a technology writer that lived with computers all his life. A prolific reader and a student of Anthropology, he is an eccentric character that writes articles about Linux and anything *nix.
Subscribe to our newsletter!
Our latest tutorials delivered straight to your inbox
Sign up for all newsletters.
By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time. Subscribe