old-terminal.gif

TWO BIT DIGITAL PRESERVATION
BASH SCRIPT LIBRARY


WELCOME
This site contains a list of BASH programming language commands used in day to day digital archiving and preservation workflows. These workflows include operations like digital asset management, reporting and logging, web archiving, preservation packaging, metadata and file analysis, data format conversion, video transcoding, and creation of fixity checks. This set of commands were designed specifically with digital audiovisual collections in mind (audio, video, graphics, and complex formats), but many are general and can be used with any data type.

COMMUNITY AND AUDIENCE
The resource was created for students and practitioners, including archivists, asset managers, conservators, and anyone with an interest in digital preservation. It is assumed that the user has basic familiarity with the BASH shell and command line. If you’re not familiar with BASH or the command line, check out Ryan’s Bash Scripting Tutorial, a great guide for beginners.

bash-logo.jpg

COMPUTING ENVIRONMENT
The commands below were tested on macOS and many of the programs used come pre-installed. Software that does not come pre-installed can be downloaded using the homebrew software repository. Programs installed with homebrew are noted.

JUST THE BASICS
This library contains only basic commands for digital preservation and asset management. It is designed as a reference and resource for beginners, and is meant to help familiarize introductory users with a few popular CLI tools that are used to manage digital collections and perform digital preservation.


SCRIPT LIBRARY INDEX

This library contains scripts used for digital preservation. Scripts are broken up into six categories:

1. Digital Asset Management & Reporting
cp, mv, diff, du, find, ls, mkdir, rm, touch, tree

2. Data Integrity
md5, sha-1, hashdeep, rsync, bagit

3. Metadata & File Analysis
Apache Tika, exiftool, mediainfo, sigfried, stat

 

4. Web Archiving
wget, youtube-dl

5. File Conversion, Transcoding, Disk Imaging & Disk Formatting
ffmpeg, imagemagick, ghostscript

6. Cool Tricks
sleep
watch


1. Digital Asset Management & Reporting

This section contains commands that I use constantly in my day to day work to manage data on our servers in a sustainable way that reduces harm to data. These commands are used to copy files (cp), move files from one location (mv), run reports determining differences between CLI output data and other text files (diff), calculate the amount of space taken up by a give file or set of files on disk (du), locate files or directories in a given location (find), create new files and directories (touch, mkdir) remove files or directories (rm), and create indexes or lists of files in a give location (ls, tree).


cli-2.png
 

CP: copy files and directories

Copy data from on location to another

cp /source/path /destination/path

cli-2.png
 

MV: move (rename) files

Move updates the file system index so that data appears in a new location. Data is not actually moved on storage, but it appears to have been moved to the user.

mv /source/path /destination/path

cli-2.png
 

DIFF: compare files line by line

Compare the text of "myfile_01.txt" to "myfile_02.txt"

diff myfile_01.txt myfile_02.txt
 

cli-2.png
 

DU: estimate file space usage

Create disk usage report showing how much space is used by files and folders in the specified directory

du -sch /source/path/*

       -s display only a total for each argument
       -c produce a grand total       
       -h print sizes in human readable format (e.g., 1K 234M 2G)

Create disk usage reports with results in gigabytes

du -sg /source/path/*

       -s display only a total for each argument
       -g show sizes in gigabytes       
      

Create disk usage reports with results in gigabytes and sort in reverse order (largest results appear first)

du -sg /source/path/* | sort -r

       -s display only a total for each argument
       -g show sizes in gigabytes              
       sort invoke the sort command
       -r sort in reverse order


cli-2.png
 

FIND: search for files in a directory hierarchy

Find a file named "filename.txt"

find . -name "filename.txt"


Find any file with a ".txt" extension

find . -name ".txt"
       -name match pattern that is case sensitive

Find a file and ignore capitalization

find . -iname "filename.txt"
       -iname match pattern that is case insensitive

Find a file that contains the word "animals"

find . -name -type f "animals"
       -name match pattern that is case sensitive
       -type d File is of type d or directory/folder       

Find a directory that contains the word "banana"

find . -name -type d "banana"
       -name match pattern that is case sensitive
       -type d File is of type d or directory/folder       

The find command works really nicely with grep, which is used to add simple filters to find output.

Filter results from the find command to exclude files that were not in a folder marked “finals”

The find command works really nicely with grep, which is used to add simple filters to find output.

Filter find results (for files with .mov extension) to exclude results that were not located in "finals" folder
find . -iname "*mov" | grep exports
       -iname match pattern that is case insensitive
       grep filter results

cli-2.png
 

LS: list directory contents

Create a recursive index of files and display output as one file per line

ls -1R /source/path

       -1 List one file per line
       -R List subdirectories recursively

Create a list of all files (including invisible files) with detailed output

ls -alh /source/path

       -a do not ignore entries starting with .
       -l use a long listing format       
       -h with -l, print sizes in human readable format (e.g., 1K 234M 2G)


cli-2.png
 

MKDIR: make directories

Create the directory(ies), if they do not already exist.

mkdir /source/path/mydirectory

cli-2.png

RM: remove files or directories

⚠️ Use caution when executing RM scripts ⚠️

Recursively remove directories and their contents

rm filename.txt

⚠️ Recursively remove directories and their contents⚠️

rm -R /source/path

       -R, --recursive remove directories and their contents recursively

⚠️ Run a batch process to delete lots of files⚠️
Use the "rm" command and pipe "yes" to agree to delete all files

yes "yes" | rm -R /path/to/directory/


cli-2.png

TOUCH: create new file or change file timestamps

Create a new empty .txt file

touch myfile.txt


cli-2.png

TREE: list contents of directories in a tree-like format

Create an indented index of directory contents

tree /source/path


2. Data Integrity

The following set of commands are used to ensure data integrity of digital collections. These include checksum algorithms to determine whether flipped bits, bit rot, or some form of intervention has compromised data (MD5 and SHA-1 ), a program that creates a “manifest” checksum log that can later be audited (hashdeep), a command to perform safe data transfers that preserves file integrity and creates event logs (rsync), and the Library of Congress’ bagit tool, which generates standard digital preservation packages for long term storage and tracking within a digital repository.


cli-2.png

MD5: create a message digest based on the md5 algorithm

md5 /path/to/file/filename.html

cli-2.png

SHA-1: Create a message digest based on the SHA-1 algorithm

openssl sha1 /path/to/file/filename.mov

cli-2.png

HASHDEEP: Perform batch checksums

Create an MD5 and SHA-1 checksum manifest file (.txt) for folder and its contents:

hashdeep -bre source_01/* > knownhashes_source_01.txt

       -b "bare mode" Strips any leading directory information from displayed filenames.
       -r "recursive mode" checks all files in subsequent directories.       
       -e show progress while script is running.

Next, compare the text file of known hashes to the folder "source_02":

hashdeep -r -x -k knownhashes_source_01.txt source_02/
       -r "recursive mode" checks all files in subsequent directories.
       -x "negative matching" only files NOT in the list of known hashes are displayed (unique files only). [NOTE: use -m instead of -x to get a list of duplicate files]       
       -k load a file of known hashes (ex: knownhashes_source_01.txt from above)


cli-2.png

RSYNC: Perform a safe file transfer that maintains data integrity


1. Copy files from on location to another

rsync /source-files-or-directories /destination-directory


2. Copy files from on location to another, preserve attributes, and issue verbose output and progress information

rsync -va --progress /source-files-or-directories /destination-directory

       -v --verbose increase the amount of information you are given during the transfer
       -a recursively preserve all file attributes (metadata and permissions)       
       --progress print information showing the progress of the transfer.


3. Perform a "mock" transfer with verbose output. Use this command to see whether files in the source and destination locations will sync in the way you intended.

rsync -v --dry-run /source-files-or-directories /destination-directory

       -v --verbose increase the amount of information you are given during the transfer
       --dry-run perform a trial run with no changes made


4. Perform a "mock" transfer with verbose output. Use this command to see which files in the source directory do not exist in the destination

rsync -va --progress --dry-run /source-files-or-directories/ /destination-directory/

       *Note: make sure source and destination directories end with a forward slash. To test output, consider adding a file in the source directory that you know does not exist in the destination.
       -v --verbose increase the amount of information you are given during the transfer
       -a recursively preserve all file attributes (metadata and permissions)
       --progress print information showing the progress of the transfer.
       --dry-run perform a trial run with no changes made       


cli-2.png

BAGIT:

Create a full preservation package based on the Library of Congress' Bagit File Packaging Format, which contains a manifest of filenames, checksum values, and package identification information.

Create a bag in place

bagit baginplace path/to/bag


Verify that a bag is complete (all files are present)

bagit verifycomplete path/to/bag


Perform fixity checks on data contained in the bag

bagit verifypayloadmanifests path/to/bag


Verify that a bag is valid

bagit verifyvalid path/to/bag

3. Metadata & File Analysis

This section contains commands that analyze any file’s embedded “EXIF” metadata (exiftool), a media file’s technical metadata (mediainfo), or identify a file’s format (Apache Tika and sigfried) and stat a program that shows file size, permission, timestamp, and location information.


cli-2.png

APACHE TIKA: identify file formats

Create a file identity report

tika filename.txt

cli-2.png

EXIFTOOL: read and write metadata in a file

Create a basic report of a file's "exif" metadata.

exiftool filename.mov


Create a basic report of a file's "exif" metadata and send the output to a .CSV file

exiftool filename.mov > exif_output.csv


Create a basic exif metadata report for all files in a given directory

exiftool /path/to/files/


Delete all exif metadata from a file (null all metadata)

exiftool -exif:all= filename.jpg

       NOTE: Not all file formats support this action.


cli-2.png

MEDIAINFO: Analyze a file and create a report of audiovisual format technical metadata

Create a basic report of an audiovisual file's technical metadata

mediainfo filename.mov


Create a detailed report of an audiovisual file's technical metadata

mediainfo -f filename.mov

       --Full, -f Full information Display (all internal tags)


cli-2.png

SIGFRIED: Analyze a file and create a file format identification report

Create a basic file format identification report using PRONOM, MIME-info, and FDD

sf filename.MXF


Calculate file checksum with hash algorithm and format identification report

sf -hash md5 filename.MXF

       -hash Generate a checksum hash
       md5 Use the md5 algorithm for this hash

Note: The version of Sigfried used in this example was installed using homebrew for macOS was installed using homebrew:

  • brew install richardlehane/digipres/siegfried

cli-2.png

STAT: display file or file system status

Show file size, permission, timestamp, and location information

stat /path/to/file.txt


Show created, access, and modification times for a given file

stat -x /path/to/file.txt

       -x display created, access, and modification times


4. Web Archiving

The wget and youtube-dl programs set up web crawls to capture online content for web archiving and download video from the web, respectively.


cli-2.png

WGET: download data from the web

Download website data recursively.

       -r Turn on recursive retrieving. The default maximum depth is 5.

Download website data with a password.

wget --user myusername --password mypassword http://www.website.com/

       --user & --password Specify the username user and password password

Download one file type

wget -A .mp4 http:/www.website.com/

       -A Accept files containing the stated variable (.mp4)

Example containing all options above

wget -r -A .mp4 --user myemail@email.com --password mypassword http://www.website.com

       -r Turn on recursive retrieving.
       -A Accept files containing .mp4       
       --user & --password Specify the username user and password


cli-2.png

YOUTUBE-DL: Download video from youtube & other sites

Download video from YouTube


Login with a username, password, and two-factor authentication to download video

youtube-dl -u myusername -p mypassword https://youtu.be/kJQP7kiw5Fk
       -u Login with account username
       -p Login with account password
       -2 Two-factor authentication code


5. File Conversion & Transcoding

This set of scripts, including ffmpeg, and imagemagick, are used to convert video, audio, or still image media formats.
For a comprehensive guide to using, installing, and understanding FFMPEG, check out ffimprovisr


cli-2.png

FFMPEG: Media File Rewrap Command

"Rewrap" video and audio data from input file and drop it into new container or wrapper (.mov) in the output file.

ffmpeg -i input.mp4 -c:v copy -c:a copy output.mov
       -c:v Specify video codec
       -c:a Specify audio codec


cli-2.png

FFMPEG: Set Target Data Rate (Bitrate)

Transcode video with a specific bitrate (ex: 2mbps)

ffmpeg -i input.mov -c:v libx264 -b:v 2M -maxrate 2M -bufsize 1M output.mp4


-i – input file name
-c:v libx264 – set video codec to h.264
-b:v 2M – set bitrate to 2mbps
-maxrate 2M – set maximum bitrate to 2mbps
-bufsize 1M – set average bitrate to 1mbps


cli-2.png

FFMPEG: Extract Audio

Extract audio from video to create an MP3 audio-only file
We use this frequently when sending interview audio to users with slow bandwith internet connections.

ffmpeg -i input.mov -vn -c:a mp3 output.mp3


-i – input file name
-vn – exclude video
-c:a mp3 – set audio codec to MP3


cli-2.png

FFMPEG: Transcode Video to Apple ProRes

Transcode input file to specified type of ProRes format

ffmpeg -i input.mov -c:v prores -profile:v $NUMBER -an output.mov

For different flavors of ProRes replace $NUMBER with a single number from 0 to 3 where:
0 ProRes422 (Proxy)
1 ProRes422 (LT)
2 ProRes422 (Normal)
3 ProRes422 (HQ)


cli-2.png

FFMPEG: Compressed 16:9 for Low Bandwidth Distribution

Transcode input file for low bandwidth internet connections. This recipe uses h.264/mp4 formatted video and aac audio to ensure compatibility with playback software. I use it as our compression file format for translators or subtitlers.

ffmpeg -i input.mp4 -c:v libx264 -c:a aac -b:v .5M -maxrate .5M -bufsize .25M -s 720x406 output.mp4


-c:v libx264 h.264 video codec
-c:a aac AAC audio codec
-b:v .5M bitrate .5mbps
-maxrate .5M maximum bitrate .5mbps
-bufsize .25M buffer size .25mbps
-s 720x406 frame size 720x406


cli-2.png

IMAGEMAGICK: Convert image formats

Convert a .TIFF file to .JPG

convert input.tiff output.jpg

Add a a text layover to a GIF (meme-generator)
convert leaf.gif -fill white -gravity North -pointsize 40 -annotate +0+100 'WHO DID THIS?’ leaf-quote.gif

       -fill white use white text
       -gravity North orient text at top of frame       
       -pointsize 40 font size 40 point
       -annotate +0=100 text to write into frame


cli-2.png

DCFLDD: Create a Disk Image

Create and ISO Disk Image from an External Drive

  1. Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line"
    diskutil list
  1. Create an ISO from external disk using DCFLDD
    dcfldd if=/dev/disk4 of=~/Desktop/mydiskimage.iso status=on sizeprobe=if hash=md5 md5log=~/Desktop/mydiskimage_md5.txt

       if=/path/to/disk – source disk location
       of=/path/to/ISO – destination location for ISO file       
       status=on – enable status and progress updates
       sizeprobe=if – show progress updates based on source disk data size
       hash=md5 – create md5 hash of source disk       
       md5log=/path/to/hash-log.txt – create a text file with source disk hash       


DISKUTIL: Drive Formatting and Encryption

Erase and reformat a hard drive and add password protected encryption

  1. Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "[How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line][2]"
    diskutil list
  1. Use diskutil to erase and reformat your drive for macOS APFS. If the number assigned to your disk in diskutil list is /dev/disk2, add this at end of your code here:

    diskutil eraseDisk apfs DiskName /dev/disk#

  2. To add an encrypted volume to this drive use the command below, where disk# is determined from diskutil list and "YourVolumeNameHere" is your new volume name and "YourPasswordHere" is your encryption password

    diskutil apfs addVolume disk# APFS YourVolumeNameHere -passphrase YourPasswordHere


6. Cool Tricks

A few cool trick that will make your life easier.


cli-2.png

SLEEP: delay for a specified amount of time

Wait for a given amount of time, then execute a command

  1. Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line"
    sleep 15 && echo hello world

       15 – number of seconds to sleep
       && – execute command after this command       
      


WATCH: run a script repeatedly

Run a script at specified intervals over and over

  1. Use diskutil to determine the location of your source disk. Scan through diskutil output to idenitfy your disk and find which number it has been assigned (example: /dev/disk2). For instructions to find your disk ID, consult, "How to Find a Disk ID & Device Node Identifier in Mac OS X Command Line"
    watch -n 15 echo "I want pizza"

       -n – set interval in seconds
       15 – number of seconds       
       echo – run desired command (echo)