What is metadata?
Metadata is descriptive, technical, or administrative information about a file that is usually stored within a file itself. This "data about data" can be used in many different ways; it can be made visible to users, parsed into catalogs or databases, or interpreted by computers to execute commands. Examples of metadata include a file's size, creation date, tags, keywords, or descriptive information, such as creator and copyright information and GPS coordinates. This article describes how metadata works, how it is generated, where it is stored, how it can be altered or changed, and issues inherent in preserving it.
Metadata is not always stored within a file, and can also be stored outside of a file in a file system. In the context of our discussion about digital preservation, metadata is comprised of several sets of data, each derived from different sources and stored in various locations that describes or supplements the primary data contained in a digital file.
Where does metadata come from?
Metadata is generated automatically by software, systems and devices, or manually by users with software applications like Adobe Bridge, Photoshop, or command line programs like EXIFTOOL. When a file is created, transferred, or used in applications, its metadata properites may change or be lost without the user's input or knowledge. Metadata may be altered many times in file's lifecycle, and these changes can impact a file's functionality and behavior, and the ability of applications to open or render it. These risks to sustainability are particularly relvant for complex media files that are part of interactive or dynamic systems with external dependecies, such as web archives.
A digital image, for example, may initially be produced by a camera that applies a set of metadata values to the files it creates, including the camera's make and model, the shutter speed and aperture setting used for the photo, and innumerable other specifications. The file system in which the file was created stores its own set of metadata for a file, such as its total size (in KB, MB, or GB), creation, modification, and "last opened" time stamps, and location in the file system. Subsequently, a user may open the image in Adobe Photoshop and add IPTC metadata, including a caption or description, creator/author information, creator contact information, and a copyright statement. The file may then be uploaded to a social media service, such as Facebook, or sent to another user with Whatsapp, both of which strip all metadata from the file, replacing it with a new set of several values that convey very little information. Metadata is added and deleted from files as they move to different locations, and are rendered by different software programs and services.
Where is metadata stored?
To understand risks posed to metadata and how metadata is attributed to files, it's important to know where different types of metadata are located. There are two places where file metadata is stored, internally (within the file itself), and externally (within the file system). Both types of metadata are discussed in detail below, with examples:
1. Internal File Format Metadata – Metadata stored within the file itself
The section of data at the beginning of a file, known as a header, usually contains a file's internal metadata. Header metadata would likely include administrative metadata (file format, permissions, creation date), technical metadata (bitrate, aspect ratio, frame rate), structural metadata (chapters in an ebook) , descriptive metadata (keywords, description, caption), rights metadata, and instructions for applications about how the file can be rendered and used.
Header metadata is derived from different sources. In some cases, metadata fields are populated with values when data is created, and other fields may be added by a user later on. For example, when you take a photograph with a digital camera, the camera software may add the shutter speed, aperature setting, and GPS metadata to the photograph file's internal "EXIF" metadata. When editing the image file in Adobe Photoshop, a user may decide to add copyright information, a caption, and a title to the photo's metadata, which would be stored as "IPTC" metadata. The metadata that a file is able to accept and store depends on the file's format.
Different file types (document, image, video, etc.) and file formats (.pdf, .jpg, .mov.) are built to support different sets of metadata. For more information about different types of metadata supported by various file formats, check out...
Because internal file format metadata is stored within the file itself, it is included in data integrity checks. If you create checksums for a file and then edit its internal metadata, subsequent checksums will fail.
2. External File System Metadata – Metadata stored within the file system
Software developers who build operating systems (Mac, Linux, Windows) create sets of structures and rules for storing files on hard drives and other storage media. Together, these sets of structures and rules comprise "file systems", and each file systems has its own way of creating and/or storing metadata. Computers running Mac operating systems, for example, currently use the HFS+ (or APFS) file system, Window uses NTFS, and Linux uses ext4.
Of all of the metadata associated with a given file, metadata stored within the file system is most at-risk. Not only is metadata stored differently in different file systems, it is also displayed and used differently. As mentioned previously, file system metadata is unlikely to be included in data integrity checks, so it can be altered in ways that are not immediately detectable to preservationists.
1. File Attributes: Your computer's basic system refers to these files for file status information. This information includes, for example, whether or not the file has changed since the last time it was backed up, whether the file is hidden from regular users, whether or not it is a a file the system needs to run, and if the file is "read-only", meaning that it can't be altered by regular users.
2. Extended Attributes: Like basic file attributes (above), extended attributes are stored within the file system, but allow non-file system metadata to be accessed by the file system. The file system may want to access metadata beyond what is provided in file attributes, and extended attributes supply this information. Examples of extended attributes include author or creator metadata, character encoding type, or checksum hashes.
3. File System Forks: Some operating systems use forks to contain metadata in addition to basic attributes and extended attributes. In the Windows operating system, which currently uses the NTFS File System, forks are known as Alternative Data Streams (ADS). In the Macintosh environment, forks for the HFS+ file system are called Resource Forks, and they are sometimes represented as .DS_Store files, which are invisible "sidecar" files that contain information about icon properties and file properties that allow users, for example, to highlight files in the Finder with color tags.
Metadata, Preservation, and Interoperability
The fluidity with which files are edited, transferred, and shared in various software programs and computing environments makes metadata difficult to preserve. Software applications, file systems, operating systems and computing platforms all manage metadata in different ways. Computer users can alter metadata unintentionally by saving and moving files in normal, everyday workflows without any knowledge of these alterations. These conditions are difficult to anticipate, and require digital archivists to design special data management workflows that do not alter file metadata. To create these workflows, preservationists must consider potential alterations to metadata that could be made with software applications and by file systems and storage.
Metadata Alterations via Software
Software developers decide how their programs handle metadata, and there is not universal agreement on how metadata specifications should be implemented for all files, especially file types that do not follow extremely well-defined metadata standards. Some applications are flexible in their approach to metadata, using general fields to document specific information, which may cause trouble for applications that were programmed in accordance with strict standards. Some applications are intentionally flexible and open, while others use proprietary sets of metadata that are not even visible to other programs and software. Like forensic investigators handling data that may be used as evidence, preservationists should avoid altering files in their repositories by using software programs to manage or move digital objects.
Metadata Alterations via File Systems and Storage
When a file is moved from one file system to another (from Mac HFS+ to Windows FAT32, for example), its metadata may not be compatible. If files are simply dragged and dropped from one file system to another, critical metadata can be unintentionally discarded, changing file properties that are stored in metadata, such as the file's timestamps, owner and permissions. Sometimes these changes alter the file in a way that is detectible by software that performs integrity checks on data, and sometimes not. For example, when a user changes a file's EXIF metadata in Adobe Bridge, the resultant file will fail integrity checks (checksums). However, when a file's modification date or permissions metadata is altered, file integrity checks will pass. Metadata discrepancies, therefore, can be totally undetectable in a digital preservation workflow, and preservationists can unintentionally alter file metadata without knowing it. To alleviate this problem, it is important to use safe file transfer methods when moving and storing files in preservation workflows.
Metadata Preservation Best Practices: Container Formats and Safe File Transfer
It is rarely possible to preserve data in the environment in which it was created (or may ideally be used). Digital archives often store thousands of files from dozens of computing environments and operating systems. Complex media files, which work together dynamically as part of larger systems are especially at-risk than stand-alone files that operate independently. Because metadata is easily lost when data is transferred between computing environments, it is necessary to use preservation workflow methods to retain file metadata. Methods to presere metadata across different file systems include storing data in digital preservation container formats and/or using safe file transfer software.
Digital Preservation Storage Continer Formats
Digital preservation container formats are single files that contain other files. Container formats preserve internal and external file metadata, assuming they are created and decoded (or opened) with the proper software or applications. The most well-known container format is the ZIP file format, which is often used to transmit sets of files over the web. Other formats include disk image formats (ISO, DMG), archive formats for specific content (TAR, WARC), compresed formats (ZIP, 7-ZIP) and forensics formats used to store evidentiary material for criminal investigations (AFF). To choose a container format, consider your current systems and infrastructure, the file formats you expect to preserve, and the needs of your designated community. For a full evaluation of container formats for digital archives, see Yunhyong Kim and Seamus Ross', Digital Forensics Formats: Seeking a Digital Preservation Storage Container Format for Web Archiving in the International Journal of Digital Curation.
Safe File Transfer
UNIX file transfer commands are metadata aware, which means they recognize fragile metadata, such as metadata that is stored by file systems. Metadata-friendly UNIX file transfer commands, such as rsync, cp, mv and rm are sensitive to different types of file metadata. For users who prefer programs with a graphical user interface, check out Exactly, which is compatible with Windows and Mac operating systems.
ExifTool Commands for Image Organization: Tags & groups: Where does image metadata come from?
International Journal of Digital Curation
Digital Forensics Formats: Seeking a Digital Preservation Storage Container Format for Web Archiving
Apple Pro Training Series: OS X Lion Support Essentials: Supporting and Troubleshooting OS X Lion: Data Management
By Kevin M. White, December 6, 2011
Open Preservation Foundation: MIA: Metadata
Wikipedia: File Attribute
Wikipedia: Extended File Attribute
Wikipedia: File System Forks
The Effects of Metadata Corruption on NFS: University of Wisconsin
Dealing with Resource Forks and .DS_Store Files on non-Mac Volumes