Why should practitioners use command line programming for digital archiving and preservation?
In the field of digital archiving and preservation, learning command line skills is an imperative. Though coding skills are invaluable and practitioners are eager to learn more about programming, we often don’t know where to start. To help readers consider potential uses of programming in our work and bring perspective to this conversation, I’d like to try to answer the question, “Why use the command line for digital archiving and preservation?”
When I started working as a digital archivist in 2008, the virtual world seemed obscure and opaque to me. I felt uncomfortable making promises about the longevity of collections in my care. Over time, I found ways to make data more transparent, easier to understand, quantify, identify, and verify. I worked with other archivists, IT folks, and nerdy friends to find tools that transformed me from an uncertain, early-career professional to a confident manager and steward of digital collections. It took a long time, and I’m grateful for so many people who helped me along the way.
I started using command line utilities for a couple of reasons. First, I quickly discovered nothing else really worked. Second, I found UNIX and GNU command line tools to be extremely powerful when used together in combination. Third, I soon learned that digital collections are enormous and command line utilities can be leveraged to run automated or batch processes to get things done quickly, efficiently, and consistently. I elaborate on each of these points (and a few others) in the discussion that follows.
1. Nothing Works as Well as CLI Tools
Command line (CLI) programs differ from software that is familiar to most users because it lacks a graphical user interface (GUI). There are many excellent GUI software utilities for data management that operate on popular computing platforms such as Windows, Mac, and Linux, but GUI tools are limited. I personally only felt fully independent and capable as an asset manager after I started using CLI utilities. Long before I’d mastered even one or two CLI programs, I felt comfortable running basic commands that allowed me to reliably and safely move, monitor, and create documentation about data. With a little bit of practice using the BASH programming language, I was soon able to control assets in my care, and began making promises and reasonable projections about the longevity and stability of our digital collections. The command line opened up a world of transparency and certainty. The program that first hooked me was the data backup utility, rsync.
Rsync is included with most UNIX-like operating system and was first released in 1996. It transfers and syncs data, and creates logs that give preservationists a paper trail of every action that it runs, such as an itemized list of files transferred and a summary of a “job” or operation when it is complete. Rsync is much more stable and reliable that drag-and-drop file transfers. I started using it exclusively to transfer data because the transfers I made in macOS’s default Finder file manager frequently yielded errors for large sets of files (over 500GB). I would set up a Finder drag-and-drop transfer to run overnight, and four mornings out of five, I’d arrive at my work station to discover the entire operation had failed because of a network drop or corrupted file. A colleague who was familiar with BASH suggested I use the rsync utility, which not only handles network drops and can skip bad files instead of derailing an entire transfer operation, but also produces logs listing every file that was transferred, and any errors along the way.
The more I dug into rsync and its options or “flags”, the more I appreciated how powerful it is when managing digital collections. I use rsync to retain important metadata for each file (like date and time stamps that could otherwise be lost), create checksums, produce transfer logs of all operations, include or exclude certain files from my job, and test sync with a mock transfer or “dry run” that reports which files would and would not be moved during a transfer. There are other programs (some with graphical user interfaces) with similar functionality, but many use rsync as the underlying engine to run their code. As someone who needs a lot from a file transfer utility, having access to all of the options available in rsync affords great control and creativity, and has made digital collections in my care a lot safer.
2. Classic UNIX and GNU Utilities are Super Powerful
In addition to rsync, there are lots of amazing CLI utilities that come pre-installed on most UNIX-like operating systems, and they are incredibly powerful when used together in combination. These programs (ls, cp, mv, diff, touch, date, find, chmod, grep, sort, to name a few) are not just disparate utilities, they are actually part of a suite of tools initially distributed with the UNIX operating system in the 1960s, and were specifically designed to be used together. Different sets of command line utilities have an interesting history of licensing and usage within various operating systems at platforms (UNIX, Linux, macOS, Windows, NeXT, and many others), but all are made to be interoperable and compatible with one another. When combined in creative ways, these tools can be leveraged to perform powerful operations.
For example, the find command can be used to locate all files on a disk starting with the word, “goat”, then the du command can create a report of the file sizes of each of these files, and the sort command can sort the results by size, showing a set of files named “goat”, sorted by size from largest to smallest. I would run the find, du, and sort commands with their corresponding options or “flags” in a single line and separate each one with a “pipe” tool, which is the character that looks like this… |
find . -iname "*goat*" | du -sch * | sort -r
NOTE: For a full breakdown of the commands and options used in the script above (or any script) check out explainshell.
When used together in this way, command line utilities are interoperable and flexible. Operations like this can sometimes be performed in a limited way in GUI applications, but folks who work in the command line are limited only by their creativity and imagination.
When I began using BASH long ago, I assumed GUI tools for archiving and preservation would eventually “catch up” and recreate or mimic the amazing functionality of CLI tools. Although many incredible GUI applications have been developed in the last decade, I now believe there is no substitute for the power of interoperable command-line tools and their execution of extremely granular, custom operations. The ease-of-use and accessibility of GUI tools is unmatched, however, and hopefully a community of incredible developers will someday create graphical applications that prove me wrong.
3. Automation and Batch Processing
When it comes to design and execution of batch processes or automated workflows, the command line really shines. Operations that might normally be performed manually by a human operator over a matter of hours can be executed in seconds or minutes with a little command line scripting.
A single script can be written to execute an operation on many files in a data set, or triggered to run in a “loop” to transform or manipulate data in a given directory or with a certain name. I often use “loop” scripts in my work to transcode all of the video files in a given folder from one file format to another. For example, we often receive AVCHD-formatted video as MTS files from freelance videographers that needs to be transcribed quickly. MTS files are large, take a long time to download, and generally won’t play in standard media player software, so they need to be converted for our transcriptionists. I use the following loop script to extract audio from MTS files and convert it to MP3 files, which I then send out for transcription:
#!/bin/sh
for f in *.MTS;
do ffmpeg -i "$f" -c:a mp3 "${f%.MTS}.mp3";
done
Batch scripts like the loop script above can be run one at a time, or on a timed schedule with the cron utility. Cron can run a script every five minutes, every day at 1am, once a week, or whenever a user specifies. I run a “cron job” to create a nightly list of every project folder on our server and report the size of each one. I also set a second job to run one hour after my list is created to compare tonight’s list to last night’s (with the diff command), showing the change of size of each project from the day before. I do this to keep track of new data on our servers. I also often use the sleep command to trigger timed operations, which tells an operation to run at a specified interval, for example three hours from now. I use sleep to transfer large sets of data to our servers overnight when there is no traffic from staff on our networks.
4. Other Reasons I <3 CLI for #DigiPres
Chatting with the Kernel & SSH
Additionally, I really love using the command line because it’s fun to communicate directly with a computer’s “kernel”. The kernel is considered the brain of your computer’s operating system, and it controls all hardware and software functions, relaying messages between them. When you send commands directly to the kernel as a user via commands in Terminal, you bypass the graphical user interface entirely (with the exception of the Terminal application itself). Working in this GUI-free environment, it’s much easier to manage assets remotely using CLI tools and SSH, or a “Secure Shell” remote session. I can open up an SSH Terminal session on any shared computer on a network and perform remote operations, which is super convenient. I can also use SSH to log into multiple clients simultaneously and run batch background processes on all of them.
Because Homebrew
I love using CLI tools in macOS specifically because I can easily download new utilities and programs using the Homebrew package manager. Homebrew installs and maintains command line programs and their dependencies (like ffmpeg libraries) that would be very difficult to install otherwise. Homebrew makes command line utilities accessible. Homebrew users can easily experiment with and test new programs, and it has an enormous user base to send feedback and improve the platform. If you’re interested in using CLI tools for macOS (and Linux), check out the Homebrew website:
https://brew.sh/
Online Community and Support
Since I started my career as a digital archivist, our community has grown exponentially. Fellow geeks are collaborative, and eager to share their code, successes and ideas. I love that we get to work in a field where we don’t compete, and have so much to gain from working together. Some of my best friends were once mentors and collaborators, and the web is now full of forums, tweets, and code that I reference in my daily work.
🥓 Money 🥓
Last but not least, learning about CLI utilities and improving programming skills will make any digital archivist a more valuable worker. Our profession requires a great deal of training and expertise, and considering the low wages archivists are paid when compared with counterparts in tech, I believe we are seriously undervalued. Adding programming skills to a CV helps us make a case for a well-deserved, higher wage.
Although I’ve spent many years accumulating CLI skills, attending trainings, teaching CLI skills in courses and workshops , suffering through online tutorials, trolling the internet and Stack Exchange for code and answers to all of my BASH-related questions, I still don’t consider myself a super advanced, high-level BASH programmer. Despite my deep interest and willingness to tool around, I think my lack of expertise is partly due to the fact that my non-expert skills have served me quite well on the job. With just a few commands in your pocket and a little patience, any digital archivist can successfully manage large sets of complex data with a lot more confidence.
How do you use command line programming in your work? Feel free to (kindly) drop a comment and share your (friendly) thoughts :)