What is a Digital File? — twoBit Preservation

Digital Files and Preservation
Digital files are composed of bits, or zeros and ones. Files can represent many types of information such as text, images, audio, or moving images. Over time, data can degrade – storage media can deteriorate and files may be corrupted when they are transferred from one place to another. Data integrity is easily compromised when digital preservation best practices are not followed. As a result, zeros may change into ones or vice versa ("flipped bits"), entire sections of files can be lost, and data can no longer be viewed or decoded. To better understand how data degrades, it’s useful to explore how it is structured, and how it is encoded (written) and decoded (viewed or played) by software programs and other systems.

How Data is Written: Bits, Bytes, Sectors, and Blocks
Digital computer files are sets of data saved on digital storage media, such as hard drives, SD cards, floppy disks, CDs or DVDs. Data is recorded as binary code, or patterns of zeros and ones written or "encoded" in standard arrangements that represent information.

Each zero or one is known as a "bit", and a group of eight bits is known as a "byte".
0 0 1 0 1 1 0 0 = 1 byte

A set of 512 bytes makes up a “sector". These sectors are grouped together into “blocks”, which contain locations or addresses that computer programs use to locate data associated with individual files. Computer programs interpret these sets of data, recognizing patterns of zeros and ones that make up a file.

When computers interpret data patterns, they convey information to humans in the form of digital files, such as documents, spreadsheets, CAD files, or audiovisual files that we can see, hear, and understand.

Parts of a File
There are four parts of a file, the “file signature,” “header,” “body”, and “end”. The first part of a file is the "file signature", a short section of code defining the file's format. After that comes the "header", which contains a sequence of data that tells programs how to read or interpret the file. After the header comes the "body" or "payload", which is data that makes up the contents of the file, followed by a final set of data verifying the end of the file. We will explore each part of a file below.

1. File Signature
The file signature is a section of code stored at the beginning of a file. There are different file formats for different types of data, such as image, text, audio, graphics or video files. Digital files indicate their format to computer programs through a “file signature”. Computer programs can also determine a file’s format through its extension, which appears at the end of a file name, such as .txt, .mp3, or .jpg.

2. File Header
The second part of the file, known as the “header”, contains information that outlines the properties of a file. Computer programs follow the instructions defined by the file's header to interpret the data it contains. The example pictured above, created by Ange Albertini, shows a breakdown of a .PNG image file's hexadecimal data, displaying the code that corresponds to different parts of the file (signature, header, data or body, and end).

3. File Body
The bulk of data contained in a file is stored in the “file body”. Once a computer program knows what type of file has encountered (by reading the file signature), and knows that file’s basic properties (from the header), it renders the file so that it can be read, seen, and heard by humans, or interpreted by programs or other entities. The body of a file contains data or information that is meaningful and understandable. When a program renders a file, it presents code from the body of a file in a readable form. The code in the body of a file can be stored either as plain text or binary data.

4. End
The final section of code in a digital file explains that the file data has ended.

Plain Text vs. Binary
There are two ways that data can be encoded and stored in a file: as plain text or binary data. Plain text data systems represent alphanumeric text characters and binary data can represent anything, such as video, image, or audio data.

1. Plain Text: "Plain text" data is interpreted by computing systems as regular, alphanumeric text characters. At its foundation, Plain Text is represented by an 8-bit* binary code (byte) that is arranged in a standard form or “string” that corresponds to text characters, like letters and numbers. For example, in the Plain Text standard known as ASCII, the letter X is represented by the binary string, “1011000”. ASCII uses the following set of bytes to define the text characters that make up the twenty six letters of the English alphabet:

In all, there are 256 ASCII characters, including all twenty six letters of the English language alphabet, numbers zero through nine, and many symbols, such as punctuation marks. The ASCII character set was created in 1963, and was inspired by code used to transmit data using telegraph machines, such as Morse Code.

Computer programs can read plain text data without the help of file signatures or headers. Certain file types, such as .txt and .html files contain plain text, and can be read by software application and programs even if they lack a signature or header. This raw text data is highly compatible with computing software and systems, and is considered more sustainable than binary data, which must be interpreted in a more comprehensive way to be rendered.

2. Binary Data: Binary data is much more flexible than plain text and can represent many kinds of information, such as images, audio, video, compressed files, or other complex media formats. Unlike plain text, these more diverse forms of information are composed of sets of zeros and ones, and read by computer programs (or "decoded") with the help of instructions stored in a file's signature and header. Binary data is introduced to computer programs in linear data "stream" or sequence of bytes. It is decoded block by block according to the signature and header instructions, which specify that a section of data, such as individual frames of video, be played at a certain speed, and in accordance to a certain video format.

Examples of binary files include .mov, .jpg, .mp3 and .cad files.

Metadata
Most file formats reserve a set of bytes just for file metadata. Metadata provides basic information about a file, such as its properties, or data that is meant to be human readable, like copyright information or GPS location data.

Computer File Architecture and Digital Preservation
In the field of digital preservation, knowledge of the structure and makeup of files can prevent data loss or corruption. It gives preservation practitioner insight into file composition, which can be used to troubleshoot issues that might arise. For example, a plain text file that is stored without a file extension, ex: "mytextfile", may not open in certain computer programs until it is renamed with a proper file extension, ex: "mytextfile.txt". Similarly, a video file that does not play could contain perfectly usable data, but its metadata header, which contains instructions for programs, may have been corrupted. Replacing a bad file header with a good one could make the file readable again.

*NOTE: ASCII is represented by 7 bits of data, with an added bit for “parity” that is used for error checking.

Citations:
ASCII Code - The extended ASCII table
Ange Albertini: Reverse engineering & technical illustrations
Ange Albertini: Basics of Computing
Barake Emmanuel: File systems — An in-depth intro
BBC Bitesize Guides: Data representation
Encyclopedia of Graphics Formats (mirror): File Headers
Wikipedia: Computer file