In 2012, Crawford Media Services began digitizing audiovisual assets for the American Archive of Public Broadcasting. Working in conjunction with the Corporation for Public Broadcasting, WGBH, the Library of Congress, and over one hundred public broadcasting stations across the country, Crawford digitized over 33,000 hours from analog media.
That was the easy part.
In 2014, born digital began.
Born digital scares us. You, me, archivists everywhere. And rightly so: born digital is vast, shapeless, and volatile. That unmarked, sticky, detached U-Matic tape in your desk might give you pause, but it can always be repaired, baked, and, in most cases, you will be able to figure out what’s on it and preserve the content. But that hard drive that’s been sitting in your desk for six years? That’s another animal entirely. If it still spins up, you could have hundreds of files to laboriously inspect (if none of them are corrupted or mistagged). With any luck, the files that do play are labeled with a unique ID that references a catalog somewhere; if not, you have a murky soup of unidentified bytes to sort through. You will then have to open all of these files and peruse them in great detail to verify their individual contents.
So, yes, tapes are easy (relatively speaking). Tape archiving is like traveling a narrow, winding roadway in rural Appalachia. You’re going to see some things that look faded and outdated, and you might crack a couple of inappropriate Deliverance jokes on the trip, but such roads are (mostly) well-travelled and mapped, and there should be plenty of markers along the way to guide you.
Born digital throws you onto a riverboat on the Nung during the Vietnam War. And, like Colonel Kurtz said, “Horror! Horror has a face.”
That face is digital.
I don’t want to make it seem as though every hard drive Crawford gets is a mess. Most of the stations that participated in the American Archive did a terrific job providing us with pristine hard drives. The participants reviewed their files, organized them, and ensured that they only were sending what they wanted to keep. They matched up their files with an inventory and sent them our way.
But in my experience working with born digital, clients with organized born digital holdings are not the norm. All too often, when Crawford received files from a born digital client, we have to devote days of engineering time to sort through folders of corrupt files, files with no extensions, or extraneous files not listed on the digital manifest. This process takes a tremendous amount of engineering time, which, in turn, adds to the cost and increases the duration of the project. This becomes increasingly worrisome when clients ask us to make judgment calls about their born digital content, because, no matter how good a job we do and how thorough our processes are, this is not data generated by Crawford. The true content holders will always be the best equipped to cull through indecipherable or duplicate files.
If I have your attention, then read on to share some of the lessons learned at Crawford while progressing through born digital; these suggestions will prove relevant, useful, and make your born digital archival transition more manageable. These suggestions should not be taken as a definitive guide to dealing with born digital, but following these steps will help you to cut down on costs in the long run and simplify the archival process for you and anyone else working on your collection.
It Starts With Production
One commonplace lapse with keepers of born digital is a tendency to retain (dare I say hoard?) data that they would have kicked to the curb long ago if they were still recording to tape. Before digital, most everyone had to pay attention to the costs of tapes and their limited physical space. But digital files don’t take up obvious, physical space, and hard drives have grown increasingly cheaper in the past decade. It becomes all too easy to push born digital hygiene to the back of your mind.
While hard drives have gotten cheaper, long term storage remains an expensive commitment. Hard drives fail, so you had better store your data on a RAID 6 back-up with hot spares. Disasters happen, so off-site LTO copies of your content remain essential. Archival storage costs add up quickly, and it doesn’t make sense to inflate your overhead by keeping hours upon hours of unused, duplicate, or unidentified material.
The simple way to clear away the confusion is simply to manage your digital files more comprehensively during the production process, but that’s easier said than done. When in the midst of an expensive project, policing data generation is the last thing on people’s minds. It is cheaper and safer in the short-term to keep every file rather than rush to delete content that might prove invaluable later. As a result, drives slowly fill with raw camera footage, clips and subclips, used and unused files, EDLs, rendered versions, final versions, unmixed versions. The list goes on and on. Then production wraps, and the staff immediately starts on the next high-priority production. Nobody wants to spend money on a finished product, so duplicate files, unneeded versions, and unused material sit on storage until no staff member has the expertise to sort through the data.
Back in the days of analog, the editor gathered all the important elements of a production onto a single reel. As the final arbiter of used footage, editors stood in a unique position of knowing exactly what was used in building the final product. Maybe it is necessary to adapt a similar model for dealing with born digital? Emphasis must be placed on identifying a person responsible for cleaning excess files and keeping a production’s footprint in check from the very beginning.
Consistency is King
When dealing with born digital archival projects, a logical, well-structured digital manifest becomes the most important piece of documentation toward a successful project. Consistency in spreadsheet formatting and file naming is paramount for software to interact properly with the spreadsheet.
Furthermore, filenames need to be clearly recognizable on the spreadsheet and provide a 100% character match. Everything must correlate and, since Crawford runs a mass migration workflow, anytime we need to stop and investigate why data is missing or why a filename does not conform to the catalog, it halts our workflow and slows down the process.
Our first job at Crawford when we receive a hard drive of born digital media from a client (after the requisite virus scan) is to create a drive report. This gives us flexibility when investigating the contents of a drive. We run scripts that compare the drive report and the digital manifest provided by the client. The script finds and reports any discrepancies between the two, and the engineering staff then must review and analyze those differences before the drive can be processed.
Keep Your Filenames Legal
Most of our born digital work is done using scripts and programs; using shell characters, spaces, file system path characters, and similar illegal characters causes problems that impede not only our process, but also the compatibility of the files with other software down the line. When we run our analysis of the drive against the manifest, all illegal characters are noted and removed so the file can be processed properly. But, when that occurs, these files no longer match the manifest we received, thus causing potential confusion downstream once the files are created without slashes, colons, and spaces. Follow a proper naming convention from the very beginning of your archival project using alphanumeric characters, and remember to update any digital manifests whenever you make changes to file names.
Check Your Files
Make sure your files work. There are tools out there that will automate part of this process – ffmpeg is particularly useful. By running all the files through some sort of analysis engine and immediately identifying any files that are broken or misidentified (such as a one frame image tagged as a video file), you will significantly cut down on any confusion when you start digging through the files in detail. However, that’s only part of the process. Once the files with obvious problems have been identified, the files that passed the analysis should still at least be opened in order to make certain they are worth digitizing. I’ve lost count of the number of files I’ve worked on that either did not open, were mostly black, or were full of audio or video glitches. You don’t want to waste time and storage space by trying to flip or archive corrupt or useless files.
In writing this blog post, I hoped that, through the insights of engineers, archivists, and project managers, I would stumble across some skeleton key for born digital, that inspiration would strike, and we would somehow come up with a simple, painless mechanism to manage born digital assets. No such luck.
We know that there is no skeleton key — born digital cannot be managed solely by following a processor’s recipe. There are simply too many variables. Remember the key factors for audiovisual archiving — organization and a painstaking eye for detail throughout every step in the process (even more important with born digital than it is for analog). Keep your organization on track, keep good notes, and keep only what you must. It will go a long way toward developing and maintaining your born digital collection.