TWO BIT DIGITAL PRESERVATION
BASH SCRIPT LIBRARY
WELCOME
This site contains a list of BASH programming language commands used in day to day digital archiving and preservation workflows. These workflows include operations like digital asset management, reporting and logging, web archiving, preservation packaging, metadata and file analysis, data format conversion, video transcoding, and creation of fixity checks. This set of commands were designed specifically with digital audiovisual collections in mind (audio, video, graphics, and complex formats), but many are general and can be used with any data type.
COMMUNITY AND AUDIENCE
The resource was created for students and practitioners, including archivists, asset managers, conservators, and anyone with an interest in digital preservation. It is assumed that the user has basic familiarity with the BASH shell and command line. If you’re not familiar with BASH or the command line, check out Ryan’s Bash Scripting Tutorial, a great guide for beginners.
COMPUTING ENVIRONMENT
The commands below were tested on macOS and many of the programs used come pre-installed. Software that does not come pre-installed can be downloaded using the homebrew software repository. Programs installed with homebrew are noted.
JUST THE BASICS
This library contains only basic commands for digital preservation and asset management. It is designed as a reference and resource for beginners, and is meant to help familiarize introductory users with a few popular CLI tools that are used to manage digital collections and perform digital preservation.
SCRIPT LIBRARY INDEX
This library contains scripts used for digital preservation. Scripts are broken up into six categories:
1. Digital Asset Management & Reporting
cp, mv, diff, du, find, ls, mkdir, rm, touch, tree
2. Data Integrity
md5, sha-1, hashdeep, rsync, bagit
3. Metadata & File Analysis
Apache Tika, exiftool, mediainfo, sigfried, stat
4. Web Archiving
wget, youtube-dl
5. File Conversion, Transcoding, Disk Imaging & Disk Formatting
ffmpeg, imagemagick, ghostscript
6. Cool Tricks
sleep
watch
1. Digital Asset Management & Reporting
This section contains commands that I use constantly in my day to day work to manage data on our servers in a sustainable way that reduces harm to data. These commands are used to copy files (cp), move files from one location (mv), run reports determining differences between CLI output data and other text files (diff), calculate the amount of space taken up by a give file or set of files on disk (du), locate files or directories in a given location (find), create new files and directories (touch, mkdir) remove files or directories (rm), and create indexes or lists of files in a give location (ls, tree).
CP: copy files and directories
Copy data from on location to another
cp /source/path /destination/path
MV: move (rename) files
Move updates the file system index so that data appears in a new location. Data is not actually moved on storage, but it appears to have been moved to the user.
mv /source/path /destination/path
DIFF: compare files line by line
Compare the text of "myfile_01.txt" to "myfile_02.txt"
diff myfile_01.txt myfile_02.txt
DU: estimate file space usage
Create disk usage report showing how much space is used by files and folders in the specified directory
du -sch /source/path/*
-s display only a total for each argument
-c produce a grand total
-h print sizes in human readable format (e.g., 1K 234M 2G)
Create disk usage reports with results in gigabytes
du -sg /source/path/*
-s display only a total for each argument
-g show sizes in gigabytes
Create disk usage reports with results in gigabytes and sort in reverse order (largest results appear first)
du -sg /source/path/* | sort -r
-s display only a total for each argument
-g show sizes in gigabytes
sort invoke the sort command
-r sort in reverse order
FIND: search for files in a directory hierarchy
Find a file named "filename.txt"
find . -name "filename.txt"
Find any file with a ".txt" extension
find . -name ".txt"
Find a file and ignore capitalization
find . -iname "filename.txt"
Find a file that contains the word "animals"
find . -name -type f "animals"
-type d File is of type d or directory/folder
Find a directory that contains the word "banana"
find . -name -type d "banana"
-type d File is of type d or directory/folder
The find command works really nicely with grep, which is used to add simple filters to find output.
Filter results from the find command to exclude files that were not in a folder marked “finals”
Filter find results (for files with .mov extension) to exclude results that were not located in "finals" folder
find . -iname "*mov" | grep exports
grep filter results
LS: list directory contents
Create a recursive index of files and display output as one file per line
ls -1R /source/path
-1 List one file per line
-R List subdirectories recursively
Create a list of all files (including invisible files) with detailed output
ls -alh /source/path
-a do not ignore entries starting with .
-l use a long listing format
-h with -l, print sizes in human readable format (e.g., 1K 234M 2G)
MKDIR: make directories
Create the directory(ies), if they do not already exist.
mkdir /source/path/mydirectory
RM: remove files or directories
⚠️ Use caution when executing RM scripts ⚠️
Recursively remove directories and their contents
rm filename.txt
⚠️ Recursively remove directories and their contents⚠️
rm -R /source/path
-R, --recursive remove directories and their contents recursively
⚠️ Run a batch process to delete lots of files⚠️
Use the "rm" command and pipe "yes" to agree to delete all files
yes "yes" | rm -R /path/to/directory/
TOUCH: create new file or change file timestamps
Create a new empty .txt file
touch myfile.txt
TREE: list contents of directories in a tree-like format
Create an indented index of directory contents
tree /source/path
2. Data Integrity
The following set of commands are used to ensure data integrity of digital collections. These include checksum algorithms to determine whether flipped bits, bit rot, or some form of intervention has compromised data (MD5 and SHA-1 ), a program that creates a “manifest” checksum log that can later be audited (hashdeep), a command to perform safe data transfers that preserves file integrity and creates event logs (rsync), and the Library of Congress’ bagit tool, which generates standard digital preservation packages for long term storage and tracking within a digital repository.
MD5: create a message digest based on the md5 algorithm
md5 /path/to/file/filename.html
SHA-1: Create a message digest based on the SHA-1 algorithm
openssl sha1 /path/to/file/filename.mov
HASHDEEP: Perform batch checksums
Create an MD5 and SHA-1 checksum manifest file (.txt) for folder and its contents:
hashdeep -bre source_01/* > knownhashes_source_01.txt
-b "bare mode" Strips any leading directory information from displayed filenames.
-r "recursive mode" checks all files in subsequent directories.
-e show progress while script is running.
Next, compare the text file of known hashes to the folder "source_02":
hashdeep -r -x -k knownhashes_source_01.txt source_02/
-x "negative matching" only files NOT in the list of known hashes are displayed (unique files only). [NOTE: use -m instead of -x to get a list of duplicate files]
-k load a file of known hashes (ex: knownhashes_source_01.txt from above)
RSYNC: Perform a safe file transfer that maintains data integrity
1. Copy files from on location to another
rsync /source-files-or-directories /destination-directory
2. Copy files from on location to another, preserve attributes, and issue verbose output and progress information
rsync -va --progress /source-files-or-directories /destination-directory
-v --verbose increase the amount of information you are given during the transfer
-a recursively preserve all file attributes (metadata and permissions)
--progress print information showing the progress of the transfer.
3. Perform a "mock" transfer with verbose output. Use this command to see whether files in the source and destination locations will sync in the way you intended.
rsync -v --dry-run /source-files-or-directories /destination-directory
-v --verbose increase the amount of information you are given during the transfer
--dry-run perform a trial run with no changes made
4. Perform a "mock" transfer with verbose output. Use this command to see which files in the source directory do not exist in the destination
rsync -va --progress --dry-run /source-files-or-directories/ /destination-directory/
*Note: make sure source and destination directories end with a forward slash. To test output, consider adding a file in the source directory that you know does not exist in the destination.
-v --verbose increase the amount of information you are given during the transfer
-a recursively preserve all file attributes (metadata and permissions)
--progress print information showing the progress of the transfer.
--dry-run perform a trial run with no changes made
BAGIT:
Create a full preservation package based on the Library of Congress' Bagit File Packaging Format, which contains a manifest of filenames, checksum values, and package identification information.
Create a bag in place
bagit baginplace path/to/bag
Verify that a bag is complete (all files are present)
bagit verifycomplete path/to/bag
Perform fixity checks on data contained in the bag
bagit verifypayloadmanifests path/to/bag
Verify that a bag is valid
bagit verifyvalid path/to/bag
3. Metadata & File Analysis
This section contains commands that analyze any file’s embedded “EXIF” metadata (exiftool), a media file’s technical metadata (mediainfo), or identify a file’s format (Apache Tika and sigfried) and stat a program that shows file size, permission, timestamp, and location information.
APACHE TIKA: identify file formats
Create a file identity report
tika filename.txt
EXIFTOOL: read and write metadata in a file
Create a basic report of a file's "exif" metadata.
exiftool filename.mov
Create a basic report of a file's "exif" metadata and send the output to a .CSV file
exiftool filename.mov > exif_output.csv
Create a basic exif metadata report for all files in a given directory
exiftool /path/to/files/
Delete all exif metadata from a file (null all metadata)
exiftool -exif:all= filename.jpg
NOTE: Not all file formats support this action.
MEDIAINFO: Analyze a file and create a report of audiovisual format technical metadata
Create a basic report of an audiovisual file's technical metadata
mediainfo filename.mov
Create a detailed report of an audiovisual file's technical metadata
mediainfo -f filename.mov
--Full, -f Full information Display (all internal tags)
SIGFRIED: Analyze a file and create a file format identification report
Create a basic file format identification report using PRONOM, MIME-info, and FDD
sf filename.MXF
Calculate file checksum with hash algorithm and format identification report
sf -hash md5 filename.MXF
-hash Generate a checksum hash
md5 Use the md5 algorithm for this hash
Note: The version of Sigfried used in this example was installed using homebrew for macOS was installed using homebrew:
- brew install richardlehane/digipres/siegfried
STAT: display file or file system status
Show file size, permission, timestamp, and location information
stat /path/to/file.txt
Show created, access, and modification times for a given file
stat -x /path/to/file.txt
-x display created, access, and modification times
4. Web Archiving
The wget and youtube-dl programs set up web crawls to capture online content for web archiving and download video from the web, respectively.
WGET: download data from the web
Download website data recursively.
wget -r http://www.website.com/
-r Turn on recursive retrieving. The default maximum depth is 5.
Download website data with a password.
wget --user myusername --password mypassword http://www.website.com/
--user & --password Specify the username user and password password
Download one file type
wget -A .mp4 http:/www.website.com/
-A Accept files containing the stated variable (.mp4)
Example containing all options above
wget -r -A .mp4 --user myemail@email.com --password mypassword http://www.website.com
-r Turn on recursive retrieving.
-A Accept files containing .mp4
--user & --password Specify the username user and password
YOUTUBE-DL: Download video from youtube & other sites
Download video from YouTube
youtube-dl https://youtu.be/kJQP7kiw5Fk
Login with a username, password, and two-factor authentication to download video
youtube-dl -u myusername -p mypassword https://youtu.be/kJQP7kiw5Fk
-p Login with account password
-2 Two-factor authentication code
5. File Conversion & Transcoding
This set of scripts, including ffmpeg, and imagemagick, are used to convert video, audio, or still image media formats.
For a comprehensive guide to using, installing, and understanding FFMPEG, check out ffimprovisr
FFMPEG: Media File Rewrap Command
"Rewrap" video and audio data from input file and drop it into new container or wrapper (.mov) in the output file.
ffmpeg -i input.mp4 -c:v copy -c:a copy output.mov
-c:a Specify audio codec
FFMPEG: Set Target Data Rate (Bitrate)
Transcode video with a specific bitrate (ex: 2mbps)
ffmpeg -i input.mov -c:v libx264 -b:v 2M -maxrate 2M -bufsize 1M output.mp4
-i – input file name
-c:v libx264 – set video codec to h.264
-b:v 2M – set bitrate to 2mbps
-maxrate 2M – set maximum bitrate to 2mbps
-bufsize 1M – set average bitrate to 1mbps
FFMPEG: Extract Audio
Extract audio from video to create an MP3 audio-only file
We use this frequently when sending interview audio to users with slow bandwith internet connections.
ffmpeg -i input.mov -vn -c:a mp3 output.mp3
-i – input file name
-vn – exclude video
-c:a mp3 – set audio codec to MP3
FFMPEG: Transcode Video to Apple ProRes
Transcode input file to specified type of ProRes format
ffmpeg -i input.mov -c:v prores -profile:v $NUMBER -an output.mov
For different flavors of ProRes replace $NUMBER with a single number from 0 to 3 where:
0 ProRes422 (Proxy)
1 ProRes422 (LT)
2 ProRes422 (Normal)
3 ProRes422 (HQ)
FFMPEG: Compressed 16:9 for Low Bandwidth Distribution
Transcode input file for low bandwidth internet connections. This recipe uses h.264/mp4 formatted video and aac audio to ensure compatibility with playback software. I use it as our compression file format for translators or subtitlers.
ffmpeg -i input.mp4 -c:v libx264 -c:a aac -b:v .5M -maxrate .5M -bufsize .25M -s 720x406 output.mp4
-c:v libx264 h.264 video codec
-c:a aac AAC audio codec
-b:v .5M bitrate .5mbps
-maxrate .5M maximum bitrate .5mbps
-bufsize .25M buffer size .25mbps
-s 720x406 frame size 720x406
IMAGEMAGICK: Convert image formats
Convert a .TIFF file to .JPG
convert input.tiff output.jpg
convert leaf.gif -fill white -gravity North -pointsize 40 -annotate +0+100 'WHO DID THIS?’ leaf-quote.gif
-fill white use white text
-gravity North orient text at top of frame
-pointsize 40 font size 40 point
-annotate +0=100 text to write into frame
DCFLDD: Create a Disk Image
Create and ISO Disk Image from an External Drive
- Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line"
diskutil list
- Create an ISO from external disk using DCFLDD
dcfldd if=/dev/disk4 of=~/Desktop/mydiskimage.iso status=on sizeprobe=if hash=md5 md5log=~/Desktop/mydiskimage_md5.txt
if=/path/to/disk – source disk location
of=/path/to/ISO – destination location for ISO file
status=on – enable status and progress updates
sizeprobe=if – show progress updates based on source disk data size
hash=md5 – create md5 hash of source disk
md5log=/path/to/hash-log.txt – create a text file with source disk hash
DISKUTIL: Drive Formatting and Encryption
Erase and reformat a hard drive and add password protected encryption
- Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "[How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line][2]"
diskutil list
Use diskutil to erase and reformat your drive for macOS APFS. If the number assigned to your disk in diskutil list is /dev/disk2, add this at end of your code here:
diskutil eraseDisk apfs DiskName /dev/disk#
To add an encrypted volume to this drive use the command below, where disk# is determined from diskutil list and "YourVolumeNameHere" is your new volume name and "YourPasswordHere" is your encryption password
diskutil apfs addVolume disk# APFS YourVolumeNameHere -passphrase YourPasswordHere
6. Cool Tricks
A few cool trick that will make your life easier.
SLEEP: delay for a specified amount of time
Wait for a given amount of time, then execute a command
- Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line"
sleep 15 && echo hello world
15 – number of seconds to sleep
&& – execute command after this command
WATCH: run a script repeatedly
Run a script at specified intervals over and over
- Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line"
watch -n 15 echo "I want pizza"
-n – set interval in seconds
15 – number of seconds
echo – run desired command (echo)