TWO BIT DIGITAL PRESERVATION
BASH SCRIPT LIBRARY

PURPOSE
This library contains free and open source BASH-language scripts that I use in my daily work doing digital archiving and preservation. It contains example script "recipes" that you can run in your Mac or Linux Terminal. These scripts are used for digital collections management, data integrity and packaging, metadata and file analysis, web archiving, file conversion and transcoding, and basic bash operations. This library was developed using collections of born-digital audiovisual media (mostly video), but hopefully contains relevant recipes for any digital preservation project. The scripts can be used as single commands, together in combination using the "pipe" tool or double ampersand, or as shell script text files that can be used for automation.

 
bash-logo.jpg
old-terminal.gif

COMMUNITY AND AUDIENCE
The resource was created for students and practitioners, including archivists, asset managers, conservators, and anyone with an interest in digital preservation. It is assumed that the user has basic familiarity with the BASH shell and command line. If you’re not familiar with BASH or the command line, check out Ryan’s Bash Scripting Tutorial, a great guide for beginners.

COMPUTING ENVIRONMENT
The commands below were tested on macOS and many of the programs used come pre-installed. Software that does not come pre-installed can be downloaded using the homebrew software repository. Programs installed with homebrew are noted.

JUST THE BASICS
This library contains only basic commands for digital preservation and asset management. It is designed as a reference and resource for beginners, and is meant to help familiarize introductory users with a few popular CLI tools that are used to manage digital collections and perform digital preservation.


SCRIPT LIBRARY INDEX

This library contains scripts used for digital preservation. Scripts are broken up into six categories:

1. Digital Asset Management & Reporting
cp, mv, diff, du, find, ls, mkdir, rm, touch, tree

2. Data Integrity
md5, sha-1, hashdeep, rsync, bagit

3. Metadata & File Analysis
Apache Tika, exiftool, mediainfo, sigfried, stat

 

4. Web Archiving
wget, youtube-dl

5. File Conversion & Transcoding
ffmpeg, imagemagick, ghostscript


1. Digital Asset Management & Reporting

This section contains commands that I use constantly in my day to day work to manage data on our servers in a sustainable way that reduces harm to data. These commands are used to copy files (cp), move files from one location (mv), run reports determining differences between CLI output data and other text files (diff), calculate the amount of space taken up by a give file or set of files on disk (du), locate files or directories in a given location (find), create new files and directories (touch, mkdir) remove files or directories (rm), and create indexes or lists of files in a give location (ls, tree).


cli-2.png
 

CP: copy files and directories

Copy data from on location to another

cp /source/path /destination/path

cli-2.png
 

MV: move (rename) files

Move updates the file system index so that data appears in a new location. Data is not actually moved on storage, but it appears to have been moved to the user.

mv /source/path /destination/path

cli-2.png
 

DIFF: compare files line by line

Compare the text of "myfile_01.txt" to "myfile_02.txt"

diff myfile_01.txt myfile_02.txt
 

cli-2.png
 

DU: estimate file space usage

Create disk usage report showing how much space is used by files and folders in the specified directory

du -sch /source/path/*

       -s display only a total for each argument
       -c produce a grand total       
       -h print sizes in human readable format (e.g., 1K 234M 2G)

Create disk usage reports with results in gigabytes

du -sg /source/path/*

       -s display only a total for each argument
       -g show sizes in gigabytes       
      

Create disk usage reports with results in gigabytes and sort in reverse order (largest results appear first)

du -sg /source/path/* | sort -r

       -s display only a total for each argument
       -g show sizes in gigabytes       
sort invoke the sort command
-r sort in reverse order


cli-2.png
 

FIND: search for files in a directory hierarchy

Find a file named "filename.txt"

find . -name "filename.txt"


Find any file with a ".txt" extension

find . -name ".txt"
       -name match pattern that is case sensitive

Find a file and ignore capitalization

find . -iname "filename.txt"
       -iname match pattern that is case insensitive

Find a file that contains the word "animals"

find . -name -type f "animals"
       -name match pattern that is case sensitive
       -type d File is of type d or directory/folder       

Find a directory that contains the word "banana"

find . -name -type d "banana"
       -name match pattern that is case sensitive
       -type d File is of type d or directory/folder       


cli-2.png
 

LS: list directory contents

Create a recursive index of files and display output as one file per line

ls -1R /source/path

       -1 List one file per line
       -R List subdirectories recursively

Create a list of all files (including invisible files) with detailed output

ls -alh /source/path

       -a do not ignore entries starting with .
       -l use a long listing format       
       -h with -l, print sizes in human readable format (e.g., 1K 234M 2G)


cli-2.png
 

MKDIR: make directories

Create the directory(ies), if they do not already exist.

mkdir /source/path/mydirectory

cli-2.png

RM: remove files or directories

⚠️ Use caution when executing RM scripts ⚠️

Recursively remove directories and their contents

rm filename.txt

Recursively remove directories and their contents

rm -R /source/path

       -R, --recursive remove directories and their contents recursively


cli-2.png

TOUCH: create new file or change file timestamps

Create a new empty .txt file

touch myfile.txt


cli-2.png

TREE: list contents of directories in a tree-like format

Create an indented index of directory contents

tree /source/path


2. Data Integrity

The following set of commands are used to ensure data integrity of digital collections. These include checksum algorithms to determine whether flipped bits, bit rot, or some form of intervention has compromised data (MD5 and SHA-1 ), a program used to create an initial “manifest” checksum log that can later be audited (hashdeep), a command used to perform safe data transfers that will preserve file integrity and create event logs (rsync), and the Library of Congress’ bagit tool, which generates standard digital preservation packages for long term storage and tracking within a digital repository.


cli-2.png

MD5: create a message digest based on the md5 algorithm

md5 /path/to/file/filename.html

cli-2.png

SHA-1: Create a message digest based on the SHA-1 algorithm

openssl sha1 /path/to/file/filename.mov

cli-2.png

HASHDEEP: Perform batch checksums

Create an MD5 and SHA-1 checksum manifest file (.txt) for folder and its contents:

hashdeep -bre source_01/* > knownhashes_source_01.txt

       -b "bare mode" Strips any leading directory information from displayed filenames.
       -r "recursive mode" checks all files in subsequent directories.       
       -e show progress while script is running.

Next, compare the text file of known hashes to the folder "source_02":

hashdeep -r -x -k knownhashes_source_01.txt source_02/
       -r "recursive mode" checks all files in subsequent directories.
       -x "negative matching" only files NOT in the list of known hashes are displayed (unique files only). [NOTE: use -m instead of -x to get a list of duplicate files]       
       -k load a file of known hashes (ex: knownhashes_source_01.txt from above)


cli-2.png

RSYNC: Perform a safe file transfer that maintains data integrity

Copy files from on location to another

rsync /source-files-or-directories /destination-directory


Copy files from on location to another, preserve attributes, and issue verbose output and progress information

rsync -va --progress /source-files-or-directories /destination-directory

       -v --verbose increase the amount of information you are given during the transfer
       -a recursively preserve all file attributes (metadata and permissions)       
       --progress print information showing the progress of the transfer.

Perform a "mock" transfer with verbose output. Use this command to see whether files in the source and destination locations will sync in the way you intended.

rsync -v --dry-run /source-files-or-directories /destination-directory

       -v --verbose increase the amount of information you are given during the transfer
       --dry-run perform a trial run with no changes made


cli-2.png

BAGIT:

Create a full preservation package based on the Library of Congress' Bagit File Packaging Format, which contains a manifest of filenames, checksum values, and package identification information.

Create a bag in place

bagit baginplace path/to/bag


Verify that a bag is complete (all files are present)

bagit verifycomplete path/to/bag


Perform fixity checks on data contained in the bag

bagit verifypayloadmanifests path/to/bag


Verify that a bag is valid

bagit verifyvalid path/to/bag

3. Metadata & File Analysis

This section contains command that analyze any file’s embedded “EXIF” metadata (exiftool), a media file’s technical metadata (mediainfo), or identify a file’s format (Apache Tika and sigfried) and stat a program that shows file size, permission, timestamp, and location information.


cli-2.png

APACHE TIKA: identify file formats

Create a file identity report

tika filename.txt

cli-2.png

EXIFTOOL: read and write metadata in a file

Create a basic report of a file's "exif" metadata.

exiftool filename.mov


Create a basic report of a file's "exif" metadata and send the output to a .CSV file

exiftool filename.mov > exif_output.csv


Create a basic exif metadata report for all files in a given directory

exiftool /path/to/files/


Delete all exif metadata from a file (null all metadata)

exiftool -exif:all= filename.jpg

       NOTE: Not all file formats support this action.


cli-2.png

MEDIAINFO: Analyze a file and create a report of audiovisual format technical metadata

Create a basic report of an audiovisual file's technical metadata

mediainfo filename.mov


Create a detailed report of an audiovisual file's technical metadata

mediainfo -f filename.mov

       --Full, -f Full information Display (all internal tags)


cli-2.png

SIGFRIED: Analyze a file and create a file format identification report

Create a basic file format identification report using PRONOM, MIME-info, and FDD

sf filename.MXF


Calculate file checksum with hash algorithm and format identification report

sf -hash md5 filename.MXF

       -hash Generate a checksum hash
       md5 Use the md5 algorithm for this hash

Note: The version of Sigfried used in this example was installed using homebrew for macOS was installed using homebrew:

  • brew install richardlehane/digipres/siegfried

cli-2.png

STAT: display file or file system status

Show file size, permission, timestamp, and location information

stat /path/to/file.txt


Show created, access, and modification times for a given file

stat -x /path/to/file.txt

       -x display created, access, and modification times


4. Web Archiving

The wget and youtube-dl programs set up web crawls to capture online content for web archiving and download video from the web, respectively.


cli-2.png

WGET: download data from the web

Download website data recursively.

       -r Turn on recursive retrieving. The default maximum depth is 5.

Download website data with a password.

wget --user myusername --password mypassword http://www.website.com/

       --user & --password Specify the username user and password password

Download one file type

wget -A .mp4 http:/www.website.com/

       -A Accept files containing the stated variable (.mp4)

Example containing all options above

wget -r -A .mp4 --user myemail@email.com --password mypassword http://www.website.com

       -r Turn on recursive retrieving.
       -A Accept files containing .mp4       
       --user & --password Specify the username user and password


cli-2.png

YOUTUBE-DL: Download video from youtube & other sites

Download video from YouTube


Login with a username, password, and two-factor authentication to download video

youtube-dl -u myusername -p mypassword https://youtu.be/kJQP7kiw5Fk
       -u Login with account username
       -p Login with account password
       -2 Two-factor authentication code


5. File Conversion & Transcoding

This set of scripts, including ffmpeg, and imagemagick, can be used to convert video, audio, or still image media formats.
For a comprehensive guide to using, installing, and understanding FFMPEG, check out ffimprovisr


cli-2.png

FFMPEG: Media File Rewrap Command

"Rewrap" video and audio data from input file and drop it into new container or wrapper (.mov) in the output file.

ffmpeg -i input.mp4 -c:v copy -c:a copy output.mov
       -c:v Specify video codec
       -c:a Specify audio codec


cli-2.png

FFMPEG: Transcode Video to Apple ProRes

Transcode input file to specified type of ProRes format

ffmpeg -i input.mov -c:v prores -profile:v $NUMBER -an output.mov

For different flavors of ProRes replace $NUMBER with a single number from 0 to 3 where:
0 ProRes422 (Proxy)
1 ProRes422 (LT)
2 ProRes422 (Normal)
3 ProRes422 (HQ)


cli-2.png

IMAGEMAGICK: Convert image formats

Convert a .TIFF file to .JPG

convert input.tiff output.jpg

Add a a text layover to a GIF (meme-generator)
convert leaf.gif -fill white -gravity North -pointsize 40 -annotate +0+100 'WHO DID THIS?’ leaf-quote.gif