TWO BIT DIGITAL PRESERVATION
BASH SCRIPT LIBRARY
PURPOSE
This library contains free and open source BASH-language scripts that I use in my daily work doing digital archiving and preservation. It contains example script "recipes" that you can run in your Mac or Linux Terminal. These scripts are used for digital collections management, data integrity and packaging, metadata and file analysis, web archiving, file conversion and transcoding, and basic bash operations. This library was developed using collections of born-digital audiovisual media (mostly video), but hopefully contains relevant recipes for any digital preservation project. The scripts can be used as single commands, together in combination using the "pipe" tool or double ampersand, or as shell script text files that can be used for automation.
COMMUNITY AND AUDIENCE
The resource was created for students and practitioners, including archivists, asset managers, conservators, and anyone with an interest in digital preservation. It is assumed that the user has basic familiarity with the BASH shell and command line. If you’re not familiar with BASH or the command line, check out Ryan’s Bash Scripting Tutorial, a great guide for beginners.
COMPUTING ENVIRONMENT
The commands below were tested on macOS and many of the programs used come pre-installed. Software that does not come pre-installed can be downloaded using the homebrew software repository. Programs installed with homebrew are noted.
JUST THE BASICS
This library contains only basic commands for digital preservation and asset management. It is designed as a reference and resource for beginners, and is meant to help familiarize introductory users with a few popular CLI tools that are used to manage digital collections and perform digital preservation.
SCRIPT LIBRARY INDEX
This library contains scripts used for digital preservation. Scripts are broken up into six categories:
1. Digital Asset Management & Reporting
cp, mv, diff, du, find, ls, mkdir, rm, touch, tree
2. Data Integrity
md5, sha-1, hashdeep, rsync, bagit
3. Metadata & File Analysis
Apache Tika, exiftool, mediainfo, sigfried, stat
4. Web Archiving
wget, youtube-dl
5. File Conversion & Transcoding
ffmpeg, imagemagick, ghostscript
1. Digital Asset Management & Reporting
This section contains commands that I use constantly in my day to day work to manage data on our servers in a sustainable way that reduces harm to data. These commands are used to copy files (cp), move files from one location (mv), run reports determining differences between CLI output data and other text files (diff), calculate the amount of space taken up by a give file or set of files on disk (du), locate files or directories in a given location (find), create new files and directories (touch, mkdir) remove files or directories (rm), and create indexes or lists of files in a give location (ls, tree).
CP: copy files and directories
Copy data from on location to another
cp /source/path /destination/path
MV: move (rename) files
Move updates the file system index so that data appears in a new location. Data is not actually moved on storage, but it appears to have been moved to the user.
mv /source/path /destination/path
DIFF: compare files line by line
Compare the text of "myfile_01.txt" to "myfile_02.txt"
diff myfile_01.txt myfile_02.txt
DU: estimate file space usage
Create disk usage report showing how much space is used by files and folders in the specified directory
du -sch /source/path/*
-s display only a total for each argument
-c produce a grand total
-h print sizes in human readable format (e.g., 1K 234M 2G)
Create disk usage reports with results in gigabytes
du -sg /source/path/*
-s display only a total for each argument
-g show sizes in gigabytes
Create disk usage reports with results in gigabytes and sort in reverse order (largest results appear first)
du -sg /source/path/* | sort -r
-s display only a total for each argument
-g show sizes in gigabytes
sort invoke the sort command
-r sort in reverse order
FIND: search for files in a directory hierarchy
Find a file named "filename.txt"
find . -name "filename.txt"
Find any file with a ".txt" extension
find . -name ".txt"
Find a file and ignore capitalization
find . -iname "filename.txt"
Find a file that contains the word "animals"
find . -name -type f "animals"
-type d File is of type d or directory/folder
Find a directory that contains the word "banana"
find . -name -type d "banana"
-type d File is of type d or directory/folder
LS: list directory contents
Create a recursive index of files and display output as one file per line
ls -1R /source/path
-1 List one file per line
-R List subdirectories recursively
Create a list of all files (including invisible files) with detailed output
ls -alh /source/path
-a do not ignore entries starting with .
-l use a long listing format
-h with -l, print sizes in human readable format (e.g., 1K 234M 2G)
MKDIR: make directories
Create the directory(ies), if they do not already exist.
mkdir /source/path/mydirectory
RM: remove files or directories
⚠️ Use caution when executing RM scripts ⚠️
Recursively remove directories and their contents
rm filename.txt
Recursively remove directories and their contents
rm -R /source/path
-R, --recursive remove directories and their contents recursively
TOUCH: create new file or change file timestamps
Create a new empty .txt file
touch myfile.txt
TREE: list contents of directories in a tree-like format
Create an indented index of directory contents
tree /source/path
2. Data Integrity
The following set of commands are used to ensure data integrity of digital collections. These include checksum algorithms to determine whether flipped bits, bit rot, or some form of intervention has compromised data (MD5 and SHA-1 ), a program used to create an initial “manifest” checksum log that can later be audited (hashdeep), a command used to perform safe data transfers that will preserve file integrity and create event logs (rsync), and the Library of Congress’ bagit tool, which generates standard digital preservation packages for long term storage and tracking within a digital repository.
MD5: create a message digest based on the md5 algorithm
md5 /path/to/file/filename.html
SHA-1: Create a message digest based on the SHA-1 algorithm
openssl sha1 /path/to/file/filename.mov
HASHDEEP: Perform batch checksums
Create an MD5 and SHA-1 checksum manifest file (.txt) for folder and its contents:
hashdeep -bre source_01/* > knownhashes_source_01.txt
-b "bare mode" Strips any leading directory information from displayed filenames.
-r "recursive mode" checks all files in subsequent directories.
-e show progress while script is running.
Next, compare the text file of known hashes to the folder "source_02":
hashdeep -r -x -k knownhashes_source_01.txt source_02/
-x "negative matching" only files NOT in the list of known hashes are displayed (unique files only). [NOTE: use -m instead of -x to get a list of duplicate files]
-k load a file of known hashes (ex: knownhashes_source_01.txt from above)
RSYNC: Perform a safe file transfer that maintains data integrity
Copy files from on location to another
rsync /source-files-or-directories /destination-directory
Copy files from on location to another, preserve attributes, and issue verbose output and progress information
rsync -va --progress /source-files-or-directories /destination-directory
-v --verbose increase the amount of information you are given during the transfer
-a recursively preserve all file attributes (metadata and permissions)
--progress print information showing the progress of the transfer.
Perform a "mock" transfer with verbose output. Use this command to see whether files in the source and destination locations will sync in the way you intended.
rsync -v --dry-run /source-files-or-directories /destination-directory
-v --verbose increase the amount of information you are given during the transfer
--dry-run perform a trial run with no changes made
BAGIT:
Create a full preservation package based on the Library of Congress' Bagit File Packaging Format, which contains a manifest of filenames, checksum values, and package identification information.
Create a bag in place
bagit baginplace path/to/bag
Verify that a bag is complete (all files are present)
bagit verifycomplete path/to/bag
Perform fixity checks on data contained in the bag
bagit verifypayloadmanifests path/to/bag
Verify that a bag is valid
bagit verifyvalid path/to/bag
3. Metadata & File Analysis
This section contains command that analyze any file’s embedded “EXIF” metadata (exiftool), a media file’s technical metadata (mediainfo), or identify a file’s format (Apache Tika and sigfried) and stat a program that shows file size, permission, timestamp, and location information.
APACHE TIKA: identify file formats
Create a file identity report
tika filename.txt
EXIFTOOL: read and write metadata in a file
Create a basic report of a file's "exif" metadata.
exiftool filename.mov
Create a basic report of a file's "exif" metadata and send the output to a .CSV file
exiftool filename.mov > exif_output.csv
Create a basic exif metadata report for all files in a given directory
exiftool /path/to/files/
Delete all exif metadata from a file (null all metadata)
exiftool -exif:all= filename.jpg
NOTE: Not all file formats support this action.
MEDIAINFO: Analyze a file and create a report of audiovisual format technical metadata
Create a basic report of an audiovisual file's technical metadata
mediainfo filename.mov
Create a detailed report of an audiovisual file's technical metadata
mediainfo -f filename.mov
--Full, -f Full information Display (all internal tags)
SIGFRIED: Analyze a file and create a file format identification report
Create a basic file format identification report using PRONOM, MIME-info, and FDD
sf filename.MXF
Calculate file checksum with hash algorithm and format identification report
sf -hash md5 filename.MXF
-hash Generate a checksum hash
md5 Use the md5 algorithm for this hash
Note: The version of Sigfried used in this example was installed using homebrew for macOS was installed using homebrew:
- brew install richardlehane/digipres/siegfried
STAT: display file or file system status
Show file size, permission, timestamp, and location information
stat /path/to/file.txt
Show created, access, and modification times for a given file
stat -x /path/to/file.txt
-x display created, access, and modification times
4. Web Archiving
The wget and youtube-dl programs set up web crawls to capture online content for web archiving and download video from the web, respectively.
WGET: download data from the web
Download website data recursively.
wget -r http://www.website.com/
-r Turn on recursive retrieving. The default maximum depth is 5.
Download website data with a password.
wget --user myusername --password mypassword http://www.website.com/
--user & --password Specify the username user and password password
Download one file type
wget -A .mp4 http:/www.website.com/
-A Accept files containing the stated variable (.mp4)
Example containing all options above
wget -r -A .mp4 --user myemail@email.com --password mypassword http://www.website.com
-r Turn on recursive retrieving.
-A Accept files containing .mp4
--user & --password Specify the username user and password
YOUTUBE-DL: Download video from youtube & other sites
Download video from YouTube
youtube-dl https://youtu.be/kJQP7kiw5Fk
Login with a username, password, and two-factor authentication to download video
youtube-dl -u myusername -p mypassword https://youtu.be/kJQP7kiw5Fk
-p Login with account password
-2 Two-factor authentication code
5. File Conversion & Transcoding
This set of scripts, including ffmpeg, and imagemagick, can be used to convert video, audio, or still image media formats.
For a comprehensive guide to using, installing, and understanding FFMPEG, check out ffimprovisr
FFMPEG: Media File Rewrap Command
"Rewrap" video and audio data from input file and drop it into new container or wrapper (.mov) in the output file.
ffmpeg -i input.mp4 -c:v copy -c:a copy output.mov
-c:a Specify audio codec
FFMPEG: Transcode Video to Apple ProRes
Transcode input file to specified type of ProRes format
ffmpeg -i input.mov -c:v prores -profile:v $NUMBER -an output.mov
For different flavors of ProRes replace $NUMBER with a single number from 0 to 3 where:
0 ProRes422 (Proxy)
1 ProRes422 (LT)
2 ProRes422 (Normal)
3 ProRes422 (HQ)
IMAGEMAGICK: Convert image formats
Convert a .TIFF file to .JPG
convert input.tiff output.jpg
convert leaf.gif -fill white -gravity North -pointsize 40 -annotate +0+100 'WHO DID THIS?’ leaf-quote.gif