Digital computer files are sets of data stored in blocks that are saved on digital storage media, such as hard drives, floppy disks, or CDs and DVDs. Data is recorded as binary code, or patterns of zeros and ones written or "encoded" in standard arrangements that represent information.
Each zero or one is known as a "bit", and a set of eight bits is known as a "byte". Computer programs interpret "bytes" of data, recognizing patterns of zeroes and ones that make meaningful data.
0 0 1 0 1 1 0 0 = 1 byte
Parts of a File
Bytes are arranged in sequences that computer programs can read and understand. The first part of a file is the "file signature", a short bit of code defining the file's format. After that comes the "header", which contains a sequence of data that tell programs how to read or interpret the file. After the header comes the "body" or "payload", which is data that makes up the contents of the file, followed by a final set of data verifying the end of the file.
Signature and Header
There are different file formats for different types of data, such as image, text, audio, graphics or video files. File formats are indicated by a file's signature at the beginning of the file, and extension, which appears at the end of a file name, such as .txt, .mp3, or .jpg. When a computer program reads a file, it looks at its filename signature and follows the instructions defined by the file's header to interpret the data it contains. The example to the left, created by Ange Albertini, shows a breakdown of a .PNG image file's hexadecimal data, displaying the code that corresponds to different parts of the file.
Most file formats reserve a set of bytes just for file metadata. This metadata provides basic information about a file, such as its properties, or data that is meant to be human readable, such as copyright information or GPS location data.
File Body: Plain Text vs. Binary
There are two ways that data can be encoded and stored in a file: as plain text or binary data. Plain text data is a system that uses 8-bit codes to represent single text characters, or individual letters or numbers. Binary data, on the other hand, can represent anything, such as video, image, or audio data, and is encoded according to standards defined by a set of instructions that a file carries with it wherever it goes.
PLAIN TEXT: "Plain text" data is interpreted by computing systems as regular, alphanumeric text characters. It is represented by an 8-bit (ex: 11001010) binary code that is arranged in a standard form that corresponds to text characters, like letters and numbers. Computer programs can read plain text data without the help of metadata or headers. ASCII, a plain text standard for encoding text characters, uses the following set of bytes to define the text characters that make up the twenty six letters of the English alphabet:
In all, there are 256 ASCII characters, including all twenty six letters of the English language alphabet, numbers zero through nine, and many symbols, such as punctuation marks. The ASCII character set was created in 1963, and was inspired by code used to transmit data using telegraph machines, such as Morse Code.
Examples of plain text file types include .txt and .html files.
BINARY: Binary data is much more flexible than plain text, and is read by computer programs (or "decoded") with the help of instructions stored in a file's metadata or header. These instructions define how file data will be interpreted by a computer program. Types of data that require instructions include time-based media files, such as audio, video, image files, compressed files, or complex media files that can't be interpreted as plain text. This data, which may be fed to computer programs in linear data "streams" or sequences of bytes, is described by metadata headers and other file metadata that describes how a file can be read or decoded. These headers tell programs to read files block by block, specifying that a section of data, such as individual frames of video, be played at a certain speed, and in accordance to a certain video format. All of these parameters are defined in a binary file's metadata.
Examples of binary files include .mov, .jpg, .mp3 and .cad files.
Computer File Architecture and Digital Preservation
All files are composed of bytes that convey information we call data. Over time, data can degrade: storage media may deteriorate and files may be corrupted when they are transferred from one place to another. When digital preservation best practices are not followed, data integrity is compromised. The result is that zeros may change to ones or vice versa ("flipped bits"), entire sections of data can be lost, and files can no longer be viewed or decoded.
In the field of digital preservation, knowledge of the structure and makeup of files can help prevent data loss or corruption. It allows us to troubleshoot issues that might arise. For example, a plain text file that is stored without a file extension, ex: "mytextfile", may not open in certain computer programs until it is renamed with a proper file exentsion, ex: "mytextfile.txt". Similarly, a video file that does not play could contain perfectly usable data, but its metadata header, which contains instructions for programs, may have been corrupted. Replacing a bad file header with a good one could make the file readable again.
One way that preservationists can ensure data integrity is to run "fixity" checks or "checksums" to verify that the order of bytes that make up a file remains unchanged. Fixity checks can be run on entire files, or on the streams of data that are contained within a file, such as frames of video. Audiovisul Archivist Dave Rice wrote an amazing post about these different types of checksums here.