Research Notes: Cloud Workflows & Storage

I recently conducted a research project for my lovely employer to investigate cloud workflows for our video production teams and remote digital storage for our media archive. I thought it might be fun to share my findings from this project, as well as some info about the digital systems and tech architecture we ultimately decided to build.

My teammates and I left our New York City office in March of 2020 and have been working from home ever since. In three years, we doubled the size of our team and hired staff based in locations all over the world. Now editing video and archiving associated data at a global scale, we’d evolved and needed a technical architecture that would allow us to work fully remotely and sustainably… forever… apparently.

Video editing is complicated, particularly when editors and producers use collaborative workflows. Back in 2019 our entire technical architecture was housed in our New York City office, which was poorly resourced with major bandwidth and power limitations. We needed to move to a new, better equipped location like a data center, or possibly the cloud.

I quickly learned that basic remote editing tools are relatively inexpensive, but we needed something more elaborate. We intended to grow, both in terms of staff and data. We were already operating at a large scale with about two dozen users and several hundred terabytes of data. Our team is comprised of video editors and producers who use complex content creation workflows, as well as archivists who ingest, describe, and preserve collections.

With that in mind, I conducted research to see which technical systems and workflows (if any) we could move to the cloud. My initial research, which began two years ago, showed that lifting our entire centralized video production workflow into the cloud simply wasn’t an option; partly because of the high cost of running large scale video workflows in the cloud, and also because of bandwidth limitations of users working from home. We needed a terrestrial home base where we could send and receive data via physical hard drives which might contain up to 4TB of data each. An “on-premise” or “local” solution was a secure option that allowed us to retain immediate physical access to our digital collections and storage servers. Instead of a full cloud architecture, we opted for a “hybrid” model, moving our entire technical architecture from our New York City office to a separate facility with redundant networking, redundant power, and a gigabit internet connection. This reliable hybrid setup afforded the collection control we need for basic operations and preservation and is also fully remotely accessible.

I knew as soon as I began researching remote workflows that the complex systems we needed were ready for us, just not in the cloud. With a little help from our vendors and developers, it was surprisingly easy to build a remote architecture — the most difficult part was disassembling our gigantic server racks and maneuvering them into the tiny elevators in our NYC office. In the end, we were very happy with our setup and left with little to do in “the cloud”.

We did, however, find one use for the cloud, which is to store a copy of our data in triplicate for disaster recovery in case our local copies meet an untimely end. In the next few years we hope to flip this model, move our centralized architecture to the cloud, and use our “on premise” data as a backup in case of disaster. Until then, here are the research notes I jotted down while exploring all available options…

Notes on cloud storage for archiving and preservation in 2023

Local Copies: No matter which storage medium you choose, retain at least one “local” copy (hard drives, LTO, or RAID) of data you’ve stored in the cloud. We keep two copies of our data using mirrored local storage, and one “disaster recovery” copy in the cloud. We chose this model for three main reasons:
- Geographic separation: Storing files in multiple locations ensures data will persist even if one copy is destroyed in a disaster event
- Account/Admin Issues: Even reliable cloud storage companies may delete/lose your data due to technical errors, or more likely, billing or administrative errors
- Data Integrity Checks: Fixity checks and other data integrity operations can’t run in the cloud at scale (yet) so data must be downloaded to run basic digital preservation operations/checks. Users can run small-scale data integrity spot checks in the cloud on several files at a time in some cases, but even this takes a decent amount of engineering to set up. Most (if not all) cloud storage providers perform internal data integrity checks — ask to see if you can access these reports.
Storage Tiers: Most cloud storage providers offer multiple tiers of storage with different levels of performance and data access and recovery times. Data saved on higher performance “standard” tiers for active use are stored on servers running “S3 Object” storage. Lower performance tiers like Amazon’s Glacier and Deep Glacier are believed to be stored offline on LTO tape or possibly hard drives. Check to see if your cloud storage company will disclose details about the storage architecture it uses (Amazon AWS does not provide this info).
“Egress” Fees: Most cloud services charge a monthly fee for storage, no fee for upload, and a significant fee for download or “egress”. Depending your provider and the tier of storage you use, you may also be charged for access to your data. Each company should have its own storage cost calculator to help total these costs and make them transparent, but overall pricing can be difficult to determine. Make sure egress fees are reasonable and perform small-scale testing of pricing models (try out the service for a month or so on your own) before uploading your whole collection and fully committing to a service.
Basic Recommendations: At the moment, for active collections that need to be accessed frequently, I’d recommend Backblaze. Their user interface is great and storage and egress fees are way less expensive than AWS or Google Cloud. For longer term storage, AWS Deep Glacier is the least expensive and most reliable option.
Platform Testing: After narrowing down cloud storage providers, create a demo account and test the upload, download, and cloud access workflows on each platform you’re interested in using. Make sure you feel comfortable using the platform for everyday work or occasional access - whatever is needed for your project. Ask a few other users to test the platform as well and be sure to use extreme or “high scale” examples when testing (for example, upload and download large amounts of data or execute processes that require high computing performance).
Scaling Up or Down: Based on testing and experimentation, assess whether your chosen cloud provider will be able to scale services up or down depending on your needs. Can you move data between performance tiers? Can you leave the service entirely, and if so, how long would it take to download/recover all of your data and end a service contract?
Ingest / File Delivery / Upload: Cloud storage providers offer both virtual (internet-based) and physical (hard drive) options to send data to the cloud for long-term storage. Uploads may take place via a web interface, API, or file transfer protocols such as SFTP or SSH. You may also be able to transfer data to your provider using a physical device, such as an Amazon Snowball or Google Transfer Appliance. Providers will offer this option to cloud customers who have data sets that would take weeks or months to send over the internet. Note: When delivering files to the cloud for long-term preservation, transferring data to a physical device may help maintain data integrity since long network transfers can introduce data corruption.
Data Security: Review each provider’s privacy policy and data security documentation. Contact representatives from each company and ask them to describe available security features and options to decide what works best for your institution.
Integration: Consider whether you need to integrate your cloud storage with other tools. If so, good luck (lol)!
Disaster Recovery & Insurance: If local or “on premise” copies of data are destroyed in an incident (like a natural disaster or fire), your insurance companies may pay cloud download/egress fees to recover data. Check with your insurer to find out. If egress fees are covered and you need to recover data in case of emergency/disaster, you only need to calculate monthly storage costs of your data (though you should make sure your institution could potentially afford to cover egress fees if insurance doesn’t come through for some reason).