In January 2016, the Council on Library and Information Resources awarded WGBH, the Library of Congress, WETA, and NewsHour Productions, LLC a grant to digitize, preserve, and make publicly accessible on the AAPB website 32 years of NewsHour predecessor programs, from October 1975 to December 2007, that currently exist on obsolete analog formats. Described by co-creator Robert MacNeil as “a place where the news is allowed to breathe, where we can calmly, intelligently look at what has happened, what it means and why it is important,” the NewsHour has consistently provided a forum for newsmakers and experts in many fields to present their views at length in a format intended to achieve clarity and balance, rather than brevity and ratings. A Gallup Poll found the NewsHour America’s “most believed” program. We are honored to preserve this monumental series and include it in AAPB.
Today, we’re pleased to update you on our project progress, specifically regarding the new digitization project workflows that we have developed and implemented to achieve the goals of the project.
The physical work digitizing the NewsHour tapes and ingesting the new files across the project collaborators has been moving forward since last fall and is now healthily and steadily progressing. Like many projects, ours started out as a great idea with many enthusiastic partners – and that’s good, because we needed some enthusiasm to help us sort out a practical workflow for simultaneously tracking, ingesting, quality checking, digitally preserving, describing, and making available at least 7512 unique programs!
In practice the workflow has become quite different from what the AAPB experienced with our initial project to digitize 40,000 hours of programming from more than 100 stations. With NewsHour, we started by examining the capabilities of each collaborator and what they already intended to do regarding ingestion and quality control on their files. That survey identified efficiencies: The Library of Congress (the Library) took the lead on ingesting preservation quality files and conducting item level quality control of the files. WGBH focused on ingestion of the proxies and communication with George Blood, the digitization vendor. The Library uses the Baton quality control software to individually pass or fail every file received. At WGBH, we use MDQC from AVPreserve to check that the proxy files we receive are encoded in accordance with our desired specifications. Both institutions use scripts to validate the MD5 file checksums the vendor provides us. If any errors are encountered, we share them in a Google Sheet and WGBH notifies the vendor. The vendor then rectifies the errors and submits a replacement file. Once approved, it is time for WGBH to make the files accessible on the AAPB website.
I imagined that making the files accessible would be a smooth routine – I would put the approved files online and everything would be great. What a nice thought that was! In truth, any one work (Global Unique Identifier or “GUID” – our unique work level identifier) could have many factors that influence what actions we need to be taken to prepare it to go online. When I started reviewing the files we were receiving, looking at transcripts, and trying to keep track of the data and where various GUIDs were in the workflow, I realized that the “some spreadsheets and my mind” system I intended to employ would result in too many GUIDs falling through the cracks, and would likely necessitate far too much duplicate work. I decided to identify the possible statuses of GUIDs in the NewsHour series and every action that would need to be taken to resolve each status. After I stared at a wall for probably too long, my coworkers found me with bloodshot eyes (JK?) and this map:
(It seems appropriate that the fire alarm is in this picture)
Some of the statuses I identified are:
- Tapes we do not want captured
- Tapes that are not able to be captured
- GUIDs where the digitization is not yet approved
- GUIDs that don’t have transcripts
- GUIDs that have transcripts, but they don’t match the content
- GUIDs that are not a broadcast episode of the NewsHour
- GUIDs that are incomplete recordings
- GUIDs that need redacting
- GUIDs that passed QC but should not have
Every status has multiple actions that need to be taken to resolve that issue and move the GUID towards being accessible. The statuses are not mutually exclusive, though some are contingent on or preclude others. It was immediately clear to me that this would be too much to manually track and that I needed a centralized automated solution. The system would have to allow simultaneous users and would need to be low cost and maintenance. After discussions with my colleagues, we decided that the best solution would be a Google Spreadsheet that everyone at the AAPB could share.
Here is a link to a copy of the NewsHour Workflow workbook we built. The workbook functions through a “Master List” with a row of metadata for every GUID, an “Intern Review” phase worksheet that automatically assigns statuses to GUIDs based on answers to questions, workflow “Tracker” sheets with resolutive actions for each status, and a “Master GUID Status Sheet” that automatically displays the status of every GUID and where each one is in the overall workflow. Some actions in trackers automatically place the GUID into another tracker – for instance, if a reviewer working on an episode for which we don’t have a transcript in the “No Transcript Tracker” and that GUID is identified as having content that needs to be redacted, the GUID is automatically placed on the “Redaction Tracker”.
A broad description of our current project workflow is: All of the project’s GUIDs are on the “Master GUID List” and their presence on that list automatically puts them on the “Master GUID Status Sheet”. When we receive a GUID’s digitized file, staff put the GUID on the “Approval Tracker”. When a GUID passes both WGBH and the Library’s QC workflows it is marked approved on the “Approval Tracker” and automatically placed on the “Intern Review Sheet.” Interns review each GUID and answer questions about the content and transcript, and the answers to those questions automatically place the GUID into different status trackers. We then use the trackers to track actions that resolve the GUIDs statuses. When a GUID’s issues in all the status trackers are resolved, it is marked as “READY!” to go online and placed in the “AAPB Online Tracker.” When we’ve updated the GUID’s metadata, put the file online, and recorded those actions in the “AAPB Online Tracker,” the GUID is automatically marked complete. Additionally, any statuses that indicate a GUID cannot go online (for instance, a tape was in fatal condition and unable to be captured) are marked as such in the “Master GUID Status Sheet.” This function helps us differentiate between GUIDs that will not be able to go online and GUIDs that are not yet online but should be when the project is complete.
Here is a picture of a portion of the “Master GUID Status Sheet.”’
Right now there is a lot of red GUIDs, but in the coming months that will be switching to green!
The workbook functions through cross-sheet references and simple logic. It is built with mostly “IF,” “COUNTIF,” and “VLOOKUP” statements. Its functionality depends on users inputting the correct values in action cells and confirming that they’ve completed their work, but generally those values are locked in with data validation rules and sheet permissions. The workflow review I had conducted proved valuable because it provided the logic needed to construct the formulas and tracking sheets.
Building the workflow manager in Google Sheets took a few drafts. I tested the workflow with our first few NewsHour pilot digitizations, unleashed it on a few kind colleagues, and then improved it with their helpful feedback. I hope that the workbook will save us time figuring out what needs to happen to each GUID and will help prevent any GUIDs from falling through the cracks or incorrectly being put online. Truthfully, the workbook struggles under its own weight sometimes (at one point in my design I reached the 2,000,000 cell limit and had to delete all the extra cells spreadsheet programs always automatically make). Anyone conducting a project any larger or more complicated than the NewsHour would likely need to upgrade to a true workflow management software or a program designed to work from the command line. I hope, if you’re interested, that you take some time to try out the copy of the NewsHour Workflow workbook! If you’d like more information, a link to our workflow documentation that further explains the workbook can be provided.
This post was written by Charles Hosale, WGBH.