Research Notes: Cloud Workflows & Storage

February 10, 2023 Nicole Martin

I recently conducted a research project for my lovely employer to investigate cloud workflows for our video production teams and remote digital storage for our media archive. I thought it might be fun to share my findings from this project, as well as some info about the digital systems and tech architecture we ultimately decided to build.

My teammates and I left our New York City office in March of 2020 and have been working from home ever since. In three years, we doubled the size of our team and hired staff based in locations all over the world. Now editing video and archiving associated data at a global scale, we’d evolved and needed a technical architecture that would allow us to work fully remotely and sustainably… forever… apparently.

Video editing is complicated, particularly when editors and producers use collaborative workflows. Back in 2019 our entire technical architecture was housed in our New York City office, which was poorly resourced with major bandwidth and power limitations. We needed to move to a new, better equipped location like a data center, or possibly the cloud.

I quickly learned that basic remote editing tools are relatively inexpensive, but we needed something more elaborate. We intended to grow, both in terms of staff and data. We were already operating at a large scale with about two dozen users and several hundred terabytes of data. Our team is comprised of video editors and producers who use complex content creation workflows, as well as archivists who ingest, describe, and preserve collections.

With that in mind, I conducted research to see which technical systems and workflows (if any) we could move to the cloud. My initial research, which began two years ago, showed that lifting our entire centralized video production workflow into the cloud simply wasn’t an option; partly because of the high cost of running large scale video workflows in the cloud, and also because of bandwidth limitations of users working from home. We needed a terrestrial home base where we could send and receive data via physical hard drives which might contain up to 4TB of data each. An “on-premise” or “local” solution was a secure option that allowed us to retain immediate physical access to our digital collections and storage servers. Instead of a full cloud architecture, we opted for a “hybrid” model, moving our entire technical architecture from our New York City office to a separate facility with redundant networking, redundant power, and a gigabit internet connection. This reliable hybrid setup afforded the collection control we need for basic operations and preservation and is also fully remotely accessible.

I knew as soon as I began researching remote workflows that the complex systems we needed were ready for us, just not in the cloud. With a little help from our vendors and developers, it was surprisingly easy to build a remote architecture — the most difficult part was disassembling our gigantic server racks and maneuvering them into the tiny elevators in our NYC office. In the end, we were very happy with our setup and left with little to do in “the cloud”.

We did, however, find one use for the cloud, which is to store a copy of our data in triplicate for disaster recovery in case our local copies meet an untimely end. In the next few years we hope to flip this model, move our centralized architecture to the cloud, and use our “on premise” data as a backup in case of disaster. Until then, here are the research notes I jotted down while exploring all available options…

Notes on cloud storage for archiving and preservation in 2023

Local Copies: No matter which storage medium you choose, retain at least one “local” copy (hard drives, LTO, or RAID) of data you’ve stored in the cloud. We keep two copies of our data using mirrored local storage, and one “disaster recovery” copy in the cloud. We chose this model for three main reasons:
- Geographic separation: Storing files in multiple locations ensures data will persist even if one copy is destroyed in a disaster event
- Account/Admin Issues: Even reliable cloud storage companies may delete/lose your data due to technical errors, or more likely, billing or administrative errors
- Data Integrity Checks: Fixity checks and other data integrity operations can’t run in the cloud at scale (yet) so data must be downloaded to run basic digital preservation operations/checks. Users can run small-scale data integrity spot checks in the cloud on several files at a time in some cases, but even this takes a decent amount of engineering to set up. Most (if not all) cloud storage providers perform internal data integrity checks — ask to see if you can access these reports.
Storage Tiers: Most cloud storage providers offer multiple tiers of storage with different levels of performance and data access and recovery times. Data saved on higher performance “standard” tiers for active use are stored on servers running “S3 Object” storage. Lower performance tiers like Amazon’s Glacier and Deep Glacier are believed to be stored offline on LTO tape or possibly hard drives. Check to see if your cloud storage company will disclose details about the storage architecture it uses (Amazon AWS does not provide this info).
“Egress” Fees: Most cloud services charge a monthly fee for storage, no fee for upload, and a significant fee for download or “egress”. Depending your provider and the tier of storage you use, you may also be charged for access to your data. Each company should have its own storage cost calculator to help total these costs and make them transparent, but overall pricing can be difficult to determine. Make sure egress fees are reasonable and perform small-scale testing of pricing models (try out the service for a month or so on your own) before uploading your whole collection and fully committing to a service.
Basic Recommendations: At the moment, for active collections that need to be accessed frequently, I’d recommend Backblaze. Their user interface is great and storage and egress fees are way less expensive than AWS or Google Cloud. For longer term storage, AWS Deep Glacier is the least expensive and most reliable option.
Platform Testing: After narrowing down cloud storage providers, create a demo account and test the upload, download, and cloud access workflows on each platform you’re interested in using. Make sure you feel comfortable using the platform for everyday work or occasional access - whatever is needed for your project. Ask a few other users to test the platform as well and be sure to use extreme or “high scale” examples when testing (for example, upload and download large amounts of data or execute processes that require high computing performance).
Scaling Up or Down: Based on testing and experimentation, assess whether your chosen cloud provider will be able to scale services up or down depending on your needs. Can you move data between performance tiers? Can you leave the service entirely, and if so, how long would it take to download/recover all of your data and end a service contract?
Ingest / File Delivery / Upload: Cloud storage providers offer both virtual (internet-based) and physical (hard drive) options to send data to the cloud for long-term storage. Uploads may take place via a web interface, API, or file transfer protocols such as SFTP or SSH. You may also be able to transfer data to your provider using a physical device, such as an Amazon Snowball or Google Transfer Appliance. Providers will offer this option to cloud customers who have data sets that would take weeks or months to send over the internet. Note: When delivering files to the cloud for long-term preservation, transferring data to a physical device may help maintain data integrity since long network transfers can introduce data corruption.
Data Security: Review each provider’s privacy policy and data security documentation. Contact representatives from each company and ask them to describe available security features and options to decide what works best for your institution.
Integration: Consider whether you need to integrate your cloud storage with other tools. If so, good luck (lol)!
Disaster Recovery & Insurance: If local or “on premise” copies of data are destroyed in an incident (like a natural disaster or fire), your insurance companies may pay cloud download/egress fees to recover data. Check with your insurer to find out. If egress fees are covered and you need to recover data in case of emergency/disaster, you only need to calculate monthly storage costs of your data (though you should make sure your institution could potentially afford to cover egress fees if insurance doesn’t come through for some reason).

What is a Digital File?

September 14, 2020 Nicole Martin

Digital Files and Preservation
Digital files are composed of bits, or zeros and ones. Files can represent many types of information such as text, images, audio, or moving images. Over time, data can degrade – storage media can deteriorate and files may be corrupted when they are transferred from one place to another. Data integrity is easily compromised when digital preservation best practices are not followed. As a result, zeros may change into ones or vice versa ("flipped bits"), entire sections of files can be lost, and data can no longer be viewed or decoded. To better understand how data degrades, it’s useful to explore how it is structured, and how it is encoded (written) and decoded (viewed or played) by software programs and other systems.

How Data is Written: Bits, Bytes, Sectors, and Blocks
Digital computer files are sets of data saved on digital storage media, such as hard drives, SD cards, floppy disks, CDs or DVDs. Data is recorded as binary code, or patterns of zeros and ones written or "encoded" in standard arrangements that represent information.

Each zero or one is known as a "bit", and a group of eight bits is known as a "byte".
0 0 1 0 1 1 0 0 = 1 byte

A set of 512 bytes makes up a “sector". These sectors are grouped together into “blocks”, which contain locations or addresses that computer programs use to locate data associated with individual files. Computer programs interpret these sets of data, recognizing patterns of zeros and ones that make up a file.

When computers interpret data patterns, they convey information to humans in the form of digital files, such as documents, spreadsheets, CAD files, or audiovisual files that we can see, hear, and understand.

Parts of a File
There are four parts of a file, the “file signature,” “header,” “body”, and “end”. The first part of a file is the "file signature", a short section of code defining the file's format. After that comes the "header", which contains a sequence of data that tells programs how to read or interpret the file. After the header comes the "body" or "payload", which is data that makes up the contents of the file, followed by a final set of data verifying the end of the file. We will explore each part of a file below.

1. File Signature
The file signature is a section of code stored at the beginning of a file. There are different file formats for different types of data, such as image, text, audio, graphics or video files. Digital files indicate their format to computer programs through a “file signature”. Computer programs can also determine a file’s format through its extension, which appears at the end of a file name, such as .txt, .mp3, or .jpg.

2. File Header
The second part of the file, known as the “header”, contains information that outlines the properties of a file. Computer programs follow the instructions defined by the file's header to interpret the data it contains. The example pictured above, created by Ange Albertini, shows a breakdown of a .PNG image file's hexadecimal data, displaying the code that corresponds to different parts of the file (signature, header, data or body, and end).

3. File Body
The bulk of data contained in a file is stored in the “file body”. Once a computer program knows what type of file has encountered (by reading the file signature), and knows that file’s basic properties (from the header), it renders the file so that it can be read, seen, and heard by humans, or interpreted by programs or other entities. The body of a file contains data or information that is meaningful and understandable. When a program renders a file, it presents code from the body of a file in a readable form. The code in the body of a file can be stored either as plain text or binary data.

4. End
The final section of code in a digital file explains that the file data has ended.

Plain Text vs. Binary
There are two ways that data can be encoded and stored in a file: as plain text or binary data. Plain text data systems represent alphanumeric text characters and binary data can represent anything, such as video, image, or audio data.

1. Plain Text: "Plain text" data is interpreted by computing systems as regular, alphanumeric text characters. At its foundation, Plain Text is represented by an 8-bit* binary code (byte) that is arranged in a standard form or “string” that corresponds to text characters, like letters and numbers. For example, in the Plain Text standard known as ASCII, the letter X is represented by the binary string, “1011000”. ASCII uses the following set of bytes to define the text characters that make up the twenty six letters of the English alphabet:

In all, there are 256 ASCII characters, including all twenty six letters of the English language alphabet, numbers zero through nine, and many symbols, such as punctuation marks. The ASCII character set was created in 1963, and was inspired by code used to transmit data using telegraph machines, such as Morse Code.

Computer programs can read plain text data without the help of file signatures or headers. Certain file types, such as .txt and .html files contain plain text, and can be read by software application and programs even if they lack a signature or header. This raw text data is highly compatible with computing software and systems, and is considered more sustainable than binary data, which must be interpreted in a more comprehensive way to be rendered.

2. Binary Data: Binary data is much more flexible than plain text and can represent many kinds of information, such as images, audio, video, compressed files, or other complex media formats. Unlike plain text, these more diverse forms of information are composed of sets of zeros and ones, and read by computer programs (or "decoded") with the help of instructions stored in a file's signature and header. Binary data is introduced to computer programs in linear data "stream" or sequence of bytes. It is decoded block by block according to the signature and header instructions, which specify that a section of data, such as individual frames of video, be played at a certain speed, and in accordance to a certain video format.

Examples of binary files include .mov, .jpg, .mp3 and .cad files.

Metadata
Most file formats reserve a set of bytes just for file metadata. Metadata provides basic information about a file, such as its properties, or data that is meant to be human readable, like copyright information or GPS location data.

Computer File Architecture and Digital Preservation
In the field of digital preservation, knowledge of the structure and makeup of files can prevent data loss or corruption. It gives preservation practitioner insight into file composition, which can be used to troubleshoot issues that might arise. For example, a plain text file that is stored without a file extension, ex: "mytextfile", may not open in certain computer programs until it is renamed with a proper file extension, ex: "mytextfile.txt". Similarly, a video file that does not play could contain perfectly usable data, but its metadata header, which contains instructions for programs, may have been corrupted. Replacing a bad file header with a good one could make the file readable again.

*NOTE: ASCII is represented by 7 bits of data, with an added bit for “parity” that is used for error checking.

Citations:
ASCII Code - The extended ASCII table
Ange Albertini: Reverse engineering & technical illustrations
Ange Albertini: Basics of Computing
Barake Emmanuel: File systems — An in-depth intro
BBC Bitesize Guides: Data representation
Encyclopedia of Graphics Formats (mirror): File Headers
Wikipedia: Computer file

Why Use the Command Line for Digital Archiving and Preservation?

June 21, 2020 Nicole Martin

3d-isometric-animations-90s-electronic-items-kaypro.gif

Why should practitioners use command line programming for digital archiving and preservation?

In the field of digital archiving and preservation, learning command line skills is an imperative. Though coding skills are invaluable and practitioners are eager to learn more about programming, we often don’t know where to start. To help readers consider potential uses of programming in our work and bring perspective to this conversation, I’d like to try to answer the question, “Why use the command line for digital archiving and preservation?”

When I started working as a digital archivist in 2008, the virtual world seemed obscure and opaque to me. I felt uncomfortable making promises about the longevity of collections in my care. Over time, I found ways to make data more transparent, easier to understand, quantify, identify, and verify. I worked with other archivists, IT folks, and nerdy friends to find tools that transformed me from an uncertain, early-career professional to a confident manager and steward of digital collections. It took a long time, and I’m grateful for so many people who helped me along the way.

I started using command line utilities for a couple of reasons. First, I quickly discovered nothing else really worked. Second, I found UNIX and GNU command line tools to be extremely powerful when used together in combination. Third, I soon learned that digital collections are enormous and command line utilities can be leveraged to run automated or batch processes to get things done quickly, efficiently, and consistently. I elaborate on each of these points (and a few others) in the discussion that follows.

1. Nothing Works as Well as CLI Tools

Command line (CLI) programs differ from software that is familiar to most users because it lacks a graphical user interface (GUI). There are many excellent GUI software utilities for data management that operate on popular computing platforms such as Windows, Mac, and Linux, but GUI tools are limited. I personally only felt fully independent and capable as an asset manager after I started using CLI utilities. Long before I’d mastered even one or two CLI programs, I felt comfortable running basic commands that allowed me to reliably and safely move, monitor, and create documentation about data. With a little bit of practice using the BASH programming language, I was soon able to control assets in my care, and began making promises and reasonable projections about the longevity and stability of our digital collections. The command line opened up a world of transparency and certainty. The program that first hooked me was the data backup utility, rsync.

Rsync is included with most UNIX-like operating system and was first released in 1996. It transfers and syncs data, and creates logs that give preservationists a paper trail of every action that it runs, such as an itemized list of files transferred and a summary of a “job” or operation when it is complete. Rsync is much more stable and reliable that drag-and-drop file transfers. I started using it exclusively to transfer data because the transfers I made in macOS’s default Finder file manager frequently yielded errors for large sets of files (over 500GB). I would set up a Finder drag-and-drop transfer to run overnight, and four mornings out of five, I’d arrive at my work station to discover the entire operation had failed because of a network drop or corrupted file. A colleague who was familiar with BASH suggested I use the rsync utility, which not only handles network drops and can skip bad files instead of derailing an entire transfer operation, but also produces logs listing every file that was transferred, and any errors along the way.

The more I dug into rsync and its options or “flags”, the more I appreciated how powerful it is when managing digital collections. I use rsync to retain important metadata for each file (like date and time stamps that could otherwise be lost), create checksums, produce transfer logs of all operations, include or exclude certain files from my job, and test sync with a mock transfer or “dry run” that reports which files would and would not be moved during a transfer. There are other programs (some with graphical user interfaces) with similar functionality, but many use rsync as the underlying engine to run their code. As someone who needs a lot from a file transfer utility, having access to all of the options available in rsync affords great control and creativity, and has made digital collections in my care a lot safer.

2. Classic UNIX and GNU Utilities are Super Powerful

In addition to rsync, there are lots of amazing CLI utilities that come pre-installed on most UNIX-like operating systems, and they are incredibly powerful when used together in combination. These programs (ls, cp, mv, diff, touch, date, find, chmod, grep, sort, to name a few) are not just disparate utilities, they are actually part of a suite of tools initially distributed with the UNIX operating system in the 1960s, and were specifically designed to be used together. Different sets of command line utilities have an interesting history of licensing and usage within various operating systems at platforms (UNIX, Linux, macOS, Windows, NeXT, and many others), but all are made to be interoperable and compatible with one another. When combined in creative ways, these tools can be leveraged to perform powerful operations.

For example, the find command can be used to locate all files on a disk starting with the word, “goat”, then the du command can create a report of the file sizes of each of these files, and the sort command can sort the results by size, showing a set of files named “goat”, sorted by size from largest to smallest. I would run the find, du, and sort commands with their corresponding options or “flags” in a single line and separate each one with a “pipe” tool, which is the character that looks like this… |

find . -iname "*goat*" | du -sch * | sort -r

NOTE: For a full breakdown of the commands and options used in the script above (or any script) check out explainshell.

When used together in this way, command line utilities are interoperable and flexible. Operations like this can sometimes be performed in a limited way in GUI applications, but folks who work in the command line are limited only by their creativity and imagination.

When I began using BASH long ago, I assumed GUI tools for archiving and preservation would eventually “catch up” and recreate or mimic the amazing functionality of CLI tools. Although many incredible GUI applications have been developed in the last decade, I now believe there is no substitute for the power of interoperable command-line tools and their execution of extremely granular, custom operations. The ease-of-use and accessibility of GUI tools is unmatched, however, and hopefully a community of incredible developers will someday create graphical applications that prove me wrong.

3. Automation and Batch Processing

When it comes to design and execution of batch processes or automated workflows, the command line really shines. Operations that might normally be performed manually by a human operator over a matter of hours can be executed in seconds or minutes with a little command line scripting.

A single script can be written to execute an operation on many files in a data set, or triggered to run in a “loop” to transform or manipulate data in a given directory or with a certain name. I often use “loop” scripts in my work to transcode all of the video files in a given folder from one file format to another. For example, we often receive AVCHD-formatted video as MTS files from freelance videographers that needs to be transcribed quickly. MTS files are large, take a long time to download, and generally won’t play in standard media player software, so they need to be converted for our transcriptionists. I use the following loop script to extract audio from MTS files and convert it to MP3 files, which I then send out for transcription:

#!/bin/sh
for f in *.MTS;
do ffmpeg -i "$f" -c:a mp3 "${f%.MTS}.mp3";
done

Batch scripts like the loop script above can be run one at a time, or on a timed schedule with the cron utility. Cron can run a script every five minutes, every day at 1am, once a week, or whenever a user specifies. I run a “cron job” to create a nightly list of every project folder on our server and report the size of each one. I also set a second job to run one hour after my list is created to compare tonight’s list to last night’s (with the diff command), showing the change of size of each project from the day before. I do this to keep track of new data on our servers. I also often use the sleep command to trigger timed operations, which tells an operation to run at a specified interval, for example three hours from now. I use sleep to transfer large sets of data to our servers overnight when there is no traffic from staff on our networks.

4. Other Reasons I <3 CLI for #DigiPres

Chatting with the Kernel & SSH
Additionally, I really love using the command line because it’s fun to communicate directly with a computer’s “kernel”. The kernel is considered the brain of your computer’s operating system, and it controls all hardware and software functions, relaying messages between them. When you send commands directly to the kernel as a user via commands in Terminal, you bypass the graphical user interface entirely (with the exception of the Terminal application itself). Working in this GUI-free environment, it’s much easier to manage assets remotely using CLI tools and SSH, or a “Secure Shell” remote session. I can open up an SSH Terminal session on any shared computer on a network and perform remote operations, which is super convenient. I can also use SSH to log into multiple clients simultaneously and run batch background processes on all of them.

Because Homebrew
I love using CLI tools in macOS specifically because I can easily download new utilities and programs using the Homebrew package manager. Homebrew installs and maintains command line programs and their dependencies (like ffmpeg libraries) that would be very difficult to install otherwise. Homebrew makes command line utilities accessible. Homebrew users can easily experiment with and test new programs, and it has an enormous user base to send feedback and improve the platform. If you’re interested in using CLI tools for macOS (and Linux), check out the Homebrew website:
https://brew.sh/

Online Community and Support
Since I started my career as a digital archivist, our community has grown exponentially. Fellow geeks are collaborative, and eager to share their code, successes and ideas. I love that we get to work in a field where we don’t compete, and have so much to gain from working together. Some of my best friends were once mentors and collaborators, and the web is now full of forums, tweets, and code that I reference in my daily work.

🥓 Money 🥓
Last but not least, learning about CLI utilities and improving programming skills will make any digital archivist a more valuable worker. Our profession requires a great deal of training and expertise, and considering the low wages archivists are paid when compared with counterparts in tech, I believe we are seriously undervalued. Adding programming skills to a CV helps us make a case for a well-deserved, higher wage.

Although I’ve spent many years accumulating CLI skills, attending trainings, teaching CLI skills in courses and workshops , suffering through online tutorials, trolling the internet and Stack Exchange for code and answers to all of my BASH-related questions, I still don’t consider myself a super advanced, high-level BASH programmer. Despite my deep interest and willingness to tool around, I think my lack of expertise is partly due to the fact that my non-expert skills have served me quite well on the job. With just a few commands in your pocket and a little patience, any digital archivist can successfully manage large sets of complex data with a lot more confidence.

How do you use command line programming in your work? Feel free to (kindly) drop a comment and share your (friendly) thoughts :)

How Does Metadata Work?

June 15, 2017 Nicole Martin

What is metadata?
Metadata is descriptive, technical, or administrative information about a file that is usually stored within a file itself. This "data about data" can be used in many different ways; it can be made visible to users, parsed into catalogs or databases, or interpreted by computers to execute commands. Examples of metadata include a file's size, creation date, tags, keywords, or descriptive information, such as creator and copyright information and GPS coordinates. This article describes how metadata works, how it is generated, where it is stored, how it can be altered or changed, and issues inherent in preserving it.

Metadata is not always stored within a file, and can also be stored outside of a file in a file system. In the context of our discussion about digital preservation, metadata is comprised of several sets of data, each derived from different sources and stored in various locations that describes or supplements the primary data contained in a digital file.

Where does metadata come from?
Metadata is generated automatically by software, systems and devices, or manually by users with software applications like Adobe Bridge, Photoshop, or command line programs like EXIFTOOL. When a file is created, transferred, or used in applications, its metadata properites may change or be lost without the user's input or knowledge. Metadata may be altered many times in file's lifecycle, and these changes can impact a file's functionality and behavior, and the ability of applications to open or render it. These risks to sustainability are particularly relvant for complex media files that are part of interactive or dynamic systems with external dependecies, such as web archives.

A digital image, for example, may initially be produced by a camera that applies a set of metadata values to the files it creates, including the camera's make and model, the shutter speed and aperture setting used for the photo, and innumerable other specifications. The file system in which the file was created stores its own set of metadata for a file, such as its total size (in KB, MB, or GB), creation, modification, and "last opened" time stamps, and location in the file system. Subsequently, a user may open the image in Adobe Photoshop and add IPTC metadata, including a caption or description, creator/author information, creator contact information, and a copyright statement. The file may then be uploaded to a social media service, such as Facebook, or sent to another user with Whatsapp, both of which strip all metadata from the file, replacing it with a new set of several values that convey very little information. Metadata is added and deleted from files as they move to different locations, and are rendered by different software programs and services.

Where is metadata stored?
To understand risks posed to metadata and how metadata is attributed to files, it's important to know where different types of metadata are located. There are two places where file metadata is stored, internally (within the file itself), and externally (within the file system). Both types of metadata are discussed in detail below, with examples:

1. Internal File Format Metadata – Metadata stored within the file itself

The section of data at the beginning of a file, known as a header, usually contains a file's internal metadata. Header metadata would likely include administrative metadata (file format, permissions, creation date), technical metadata (bitrate, aspect ratio, frame rate), structural metadata (chapters in an ebook) , descriptive metadata (keywords, description, caption), rights metadata, and instructions for applications about how the file can be rendered and used.

Header metadata is derived from different sources. In some cases, metadata fields are populated with values when data is created, and other fields may be added by a user later on. For example, when you take a photograph with a digital camera, the camera software may add the shutter speed, aperature setting, and GPS metadata to the photograph file's internal "EXIF" metadata. When editing the image file in Adobe Photoshop, a user may decide to add copyright information, a caption, and a title to the photo's metadata, which would be stored as "IPTC" metadata. The metadata that a file is able to accept and store depends on the file's format.

Different file types (document, image, video, etc.) and file formats (.pdf, .jpg, .mov.) are built to support different sets of metadata. For more information about different types of metadata supported by various file formats, check out...

Because internal file format metadata is stored within the file itself, it is included in data integrity checks. If you create checksums for a file and then edit its internal metadata, subsequent checksums will fail.

2. External File System Metadata – Metadata stored within the file system

Software developers who build operating systems (Mac, Linux, Windows) create sets of structures and rules for storing files on hard drives and other storage media. Together, these sets of structures and rules comprise "file systems", and each file systems has its own way of creating and/or storing metadata. Computers running Mac operating systems, for example, currently use the HFS+ (or APFS) file system, Window uses NTFS, and Linux uses ext4.

Of all of the metadata associated with a given file, metadata stored within the file system is most at-risk. Not only is metadata stored differently in different file systems, it is also displayed and used differently. As mentioned previously, file system metadata is unlikely to be included in data integrity checks, so it can be altered in ways that are not immediately detectable to preservationists.

1. File Attributes: Your computer's basic system refers to these files for file status information. This information includes, for example, whether or not the file has changed since the last time it was backed up, whether the file is hidden from regular users, whether or not it is a a file the system needs to run, and if the file is "read-only", meaning that it can't be altered by regular users.

2. Extended Attributes: Like basic file attributes (above), extended attributes are stored within the file system, but allow non-file system metadata to be accessed by the file system. The file system may want to access metadata beyond what is provided in file attributes, and extended attributes supply this information. Examples of extended attributes include author or creator metadata, character encoding type, or checksum hashes.

3. File System Forks: Some operating systems use forks to contain metadata in addition to basic attributes and extended attributes. In the Windows operating system, which currently uses the NTFS File System, forks are known as Alternative Data Streams (ADS). In the Macintosh environment, forks for the HFS+ file system are called Resource Forks, and they are sometimes represented as .DS_Store files, which are invisible "sidecar" files that contain information about icon properties and file properties that allow users, for example, to highlight files in the Finder with color tags.

Metadata, Preservation, and Interoperability
The fluidity with which files are edited, transferred, and shared in various software programs and computing environments makes metadata difficult to preserve. Software applications, file systems, operating systems and computing platforms all manage metadata in different ways. Computer users can alter metadata unintentionally by saving and moving files in normal, everyday workflows without any knowledge of these alterations. These conditions are difficult to anticipate, and require digital archivists to design special data management workflows that do not alter file metadata. To create these workflows, preservationists must consider potential alterations to metadata that could be made with software applications and by file systems and storage.

Metadata Alterations via Software
Software developers decide how their programs handle metadata, and there is not universal agreement on how metadata specifications should be implemented for all files, especially file types that do not follow extremely well-defined metadata standards. Some applications are flexible in their approach to metadata, using general fields to document specific information, which may cause trouble for applications that were programmed in accordance with strict standards. Some applications are intentionally flexible and open, while others use proprietary sets of metadata that are not even visible to other programs and software. Like forensic investigators handling data that may be used as evidence, preservationists should avoid altering files in their repositories by using software programs to manage or move digital objects.

Metadata Alterations via File Systems and Storage
When a file is moved from one file system to another (from Mac HFS+ to Windows FAT32, for example), its metadata may not be compatible. If files are simply dragged and dropped from one file system to another, critical metadata can be unintentionally discarded, changing file properties that are stored in metadata, such as the file's timestamps, owner and permissions. Sometimes these changes alter the file in a way that is detectible by software that performs integrity checks on data, and sometimes not. For example, when a user changes a file's EXIF metadata in Adobe Bridge, the resultant file will fail integrity checks (checksums). However, when a file's modification date or permissions metadata is altered, file integrity checks will pass. Metadata discrepancies, therefore, can be totally undetectable in a digital preservation workflow, and preservationists can unintentionally alter file metadata without knowing it. To alleviate this problem, it is important to use safe file transfer methods when moving and storing files in preservation workflows.

Metadata Preservation Best Practices: Container Formats and Safe File Transfer
It is rarely possible to preserve data in the environment in which it was created (or may ideally be used). Digital archives often store thousands of files from dozens of computing environments and operating systems. Complex media files, which work together dynamically as part of larger systems are especially at-risk than stand-alone files that operate independently. Because metadata is easily lost when data is transferred between computing environments, it is necessary to use preservation workflow methods to retain file metadata. Methods to presere metadata across different file systems include storing data in digital preservation container formats and/or using safe file transfer software.

Digital Preservation Storage Continer Formats
Digital preservation container formats are single files that contain other files. Container formats preserve internal and external file metadata, assuming they are created and decoded (or opened) with the proper software or applications. The most well-known container format is the ZIP file format, which is often used to transmit sets of files over the web. Other formats include disk image formats (ISO, DMG), archive formats for specific content (TAR, WARC), compresed formats (ZIP, 7-ZIP) and forensics formats used to store evidentiary material for criminal investigations (AFF). To choose a container format, consider your current systems and infrastructure, the file formats you expect to preserve, and the needs of your designated community. For a full evaluation of container formats for digital archives, see Yunhyong Kim and Seamus Ross', Digital Forensics Formats: Seeking a Digital Preservation Storage Container Format for Web Archiving in the International Journal of Digital Curation.

Safe File Transfer
UNIX file transfer commands are metadata aware, which means they recognize fragile metadata, such as metadata that is stored by file systems. Metadata-friendly UNIX file transfer commands, such as rsync, cp, mv and rm are sensitive to different types of file metadata. For users who prefer programs with a graphical user interface, check out Exactly, which is compatible with Windows and Mac operating systems.

--------------------------------------------------------------------------------------

Works Cited
ExifTool Commands for Image Organization: Tags & groups: Where does image metadata come from?
http://ninedegreesbelow.com/photography/exiftool-commands.html#tags-groups

International Journal of Digital Curation
Digital Forensics Formats: Seeking a Digital Preservation Storage Container Format for Web Archiving
http://www.ijdc.net/index.php/ijdc/article/view/217
http://dx.doi.org/10.2218/ijdc.v7i2.227

Apple Pro Training Series: OS X Lion Support Essentials: Supporting and Troubleshooting OS X Lion: Data Management
By Kevin M. White, December 6, 2011
http://www.peachpit.com/articles/article.aspx?p=1762250&seqNum=5

Open Preservation Foundation: MIA: Metadata
http://openpreservation.org/blog/2013/06/12/mia-metadata/

Wikipedia: File Attribute
https://en.wikipedia.org/wiki/File_attribute

Wikipedia: Extended File Attribute
https://en.wikipedia.org/wiki/Extended_file_attributes

Wikipedia: File System Forks
https://en.wikipedia.org/wiki/Fork_(file_system)#Microsoft

The Effects of Metadata Corruption on NFS: University of Wisconsin
https://research.cs.wisc.edu/wind/Publications/NFSCorruption-storagess07.pdf

Dealing with Resource Forks and .DS_Store Files on non-Mac Volumes
http://lowendmac.com/2006/resource-forks-and-ds_store-files-on-non-mac-volumes/