PBS NewsHour Digitization Project Update: Ingest and Digital Preservation Workflows

In our last blog post (click for link) on managing the PBS NewsHour Digitization Project, I briefly discussed WGBH’s digital preservation and ingest workflows. Though many of our procedures follow standard practices common to archival work, I thought it would be worthwhile to cover them more in-depth for those who might be interested. We at WGBH are responsible for describing, providing access to, and digitally preserving the proxy files for all of our projects. The Library of Congress preserves the masters. In this post I cover how we preserve and prepare to provide access to proxy files.

Before a file is digitized, we ingest the item-level tape inventory generated during the project planning stages into our Archival Management System (AMS – see link for the Github). The inventory is a CSV that we normalized to our standards, upload, and then map to PBCore in MINT, or “Metadata Interoperability Services,” an open-source web-based plugin designed for metadata mapping and aggregation. The AMS ingests the data and creates new PBCore records, which are stored as individual elements in tables in the AMS. The AMS generates a unique ID (GUID) for each asset. We then export the metadata, provide it to the digitization vendor, and use the GUID identifiers to track records throughout the project workflow.

Screen Shot 2016-07-07 at 3.30.19 PM.png
Mapping a CSV to PBCore in MINT

For the NewsHour project, George Blood L.P. receives the inventory metadata and the physical tapes to digitize to our specifications. For every GUID, George Blood creates a MP4 proxy for access, a JPEG2000 MXF preservation master, sidecar MD5 checksums for both video files, and a QCTools report XML for the master. George Blood names each file after the corresponding GUID and organizes the files into an individual folder for each GUID. During the digitization process, they record digitization event metadata in a PREMIS spreadsheets. Those sheets are regularly automatically harvested by the AMS, which inserts the metadata into the corresponding catalog records. With each delivery batch George Blood also provides MediaInfo XML saved in BagIt containers for every GUID, and a text inventory of the delivery’s assets and corresponding MD5 checksums. The MediaInfo bags are uploaded via FTP to the AMS, which harvests technical metadata from them and creates PBCore instantiation metadata records for the proxies and masters. WGBH receives the digitized files on LTO 6 tapes, and the Library of Congress receives theirs on rotating large capacity external hard drives.

For those who are not familiar with the tools I just mentioned, I will briefly describe them. A checksum is a computer generated cryptographic hash. There are different types of hashes, but we use MD5, as do many other archives. The computer analyzes a file with the MD5 algorithm and delivers a 32 character code. If a file does not change, the MD5 value generated will always be the same. We use MD5s to ensure that files are not corrupted during copying and that they stay the same (“fixed”) over time. QCTools is an open source program developed by the Bay Area Video Coalition and its collaborators. The program analyzes the content of a digitized asset, generates reports, and facilitates the inspection of videos. BagIt is a file packaging format developed by the Library of Congress and partners that facilitates the secure transfer of data. MediaInfo is a tool that reports technical metadata about media files. It’s used by many in the AV and archives communities. PREMIS is a metadata standard used to record data about an object’s digital preservation.

Now a digression about my inventories – sorry in advance. ¯\_(ツ)_/¯

I keep two active inventories of all digitized files received. One is an Excel spreadsheet “checksum inventory” in which I track if a GUID was supposed to be delivered but was not received, or if a GUID was delivered more than once. I also use it to confirm that the checksums George Blood gave us match the checksums we generate from the delivered files, and it serves as a backup for checksum storage and organization during the project. The inventory has a master sheet with info for every GUID, and then each tape has an individual sheet with an inventory and checksums of its contents. I set up simple formulas that report any GUIDs or checksums that have issues. I could use scripts to automate the checksum validation process, but I like having the data visually organized for the NewsHour project. Given the relatively small volume of fixity checking I’m doing this manual verification works fine for this project.

Screen Shot 2017-04-10 at 2.37.28 PM.png
Excel “checksum inventory” sheet page for NewsHour LTO tape #27.

The other inventory is the Approval Tracker spreadsheet in our Google Sheets NewsHour Workflow workbook (click here for link). The Approval Tracker is used to manage reporting about GUID’s ingesting and digital preservation workflow status. I record in it when I have finished the digital preservation workflow on a batch, and I mark when the files have been approved by all project partners. Partners have two months from the date of delivery to report approvals to George Blood. Once the files are approved they’re automatically placed on the Intern Review sheet for the arrangement and description phase of our workflow.

Screen Shot 2017-04-10 at 2.38.11 PM.png
The Approval Tracker in the NewsHour Workflow workbook.

Okay, forgive me for that, now back to WGBH’s  ingest and digital preservation workflow for the NewsHour project!

The first thing I do when we receive a shipment from George Blood is the essential routine I learned the hard way while stocking a retail store – always make sure everything that you paid for is actually there! I do this for both the physical LTO tapes, the files on the tapes, the PREMIS spreadsheet, the bags, and the delivery’s inventory. In Terminal I use a bash script that checks a list of GUIDs against the files present on our server to ensure that all bags have been correctly uploaded to the AMS. If we’ve received everything expected, I then organize the data from the inventory, copying the submission checksums into each tape’s spreadsheet in my Excel “checksum inventory”. Then I start working with the tapes.

Important background information is that the AAPB staff at WGBH work in a Mac environment, so what I’m writing about works for Mac, but it could easily be adopted to other systems. The first step I take with the tapes is to check the them for viruses. We use Sophos to do that in Terminal, with the Sweep command. If no viruses are found I then use one of our three LTO workstations to copy the MP4 proxies, proxy checksums, and QCTools XML reports from the LTO to a hard drive. I use the Terminal to do the copying, which I leave run while I go to other work. When the tape is done copying I use Terminal to confirm that the number of files copied matches the number of files I expected to copy. After that, I use it to run an MD5 report (with the find, -exec, and MD5 commands) on the copied files on the hard drive. I put those checksums into my Excel sheet and confirm they match the sums provided by George Blood, that there are no duplicates, and that we received everything we expected. If all is well, I put the checksum report onto our department server and move on to examining the delivered files’ specifications.

I use MediaInfo and MDQC to confirm that files we receive conform to our expectations. Again, this is something I could streamline with scripts if the workflow needed, but MDQC gets the job done for the NewsHour project. MDQC is a free program from AVPreserve that checks a group of files against a reference file and passes or fails them according to rules you specify. I set the test to check that the delivered batch are encoded to our specifications (click here for those). If any files fail the test, I use MediaInfo in Terminal to examine why they failed. I record any failures at this stage, or earlier in the checksum stage, in an issue tracker spreadsheet the project partners share, and report the problems to the vendor so that they can deliver corrected files.

Screen Shot 2017-04-10 at 2.39.55 PM
MDQC’s simple and effective user interface.

Next I copy the set of copies on the hard drive onto other working hard drives for the interns to use during the review stage. I then skim a small sample of the files to confirm their content meets our expectations, comparing the digitizations to the transfer notes provided by George Blood in the PREMIS metadata. I review a few of the QCTools reports, looking at the video’s levels. I don’t spend much time doing that though, because the Library of Congress reviews the levels and characteristics of every master file. If everything looks good I move on, because all the proxies will be reviewed at an item level by our interns during the next phase of the project’s workflow anyways.

The last steps are to mark both the delivery batch’s digital preservation complete and the files as approved in the Approval Tracker, create a WGBH catalog record for the LTO, run a final MD5 manifest of the LTO and hard drive, upload some preservation metadata (archival LTO name, file checksums, and the project’s internal identifying code) to the AMS, and place the LTO and drive in our vault. The interns then review and describe the records and, after that, the GUIDs move into our access workflow. Look forward to future blog posts about those phases!

PBS NewsHour Digitization Project Update

NewsHour_Project_LogosIn January 2016, the Council on Library and Information Resources awarded WGBH, the Library of Congress, WETA, and NewsHour Productions, LLC a grant to digitize, preserve, and make publicly accessible on the AAPB website 32 years of NewsHour predecessor programs, from October 1975 to December 2007, that currently exist on obsolete analog formats. Described by co-creator Robert MacNeil as “a place where the news is allowed to breathe, where we can calmly, intelligently look at what has happened, what it means and why it is important,” the NewsHour has consistently provided a forum for newsmakers and experts in many fields to present their views at length in a format intended to achieve clarity and balance, rather than brevity and ratings. A Gallup Poll found the NewsHour America’s “most believed” program. We are honored to preserve this monumental series and include it in AAPB.

Last week, our contract archivist Alexander (AJ) Lawrence completed the inventory of 7,320 NewsHour tapes stored in 523 boxes located in WETA’s storage units in Arlington, Virginia, comprising the bulk of the collection. (Additional content is located at two other locations.)

“I was so excited to receive Casey’s initial email asking about my interest in the NewsHour project. I’ve been a life long watcher of the program and the chance to be involved in the preservation of such a valuable resource for historical research seemed like a wonderful opportunity.

The process of inventorying the entire collection seemed pretty daunting on my first day when I got my first in-person look at the storage units housing the estimated 7,500 tapes. However, the process has gone quite smoothly overall and we’ve now surpassed the halfway point. Generally, the tapes have little more than a date to identify them, but it’s been especially interesting to come across the tapes for significant historical events over the past 40+ years. These tapes in particular offered me a chance to reflect on some major cultural milestones I’ve witnessed, often through coverage by the NewsHour team. That said, it was also fun to come across the broadcast that aired on the day I was born, as well as the very first broadcast of The MacNeil/Lehrer NewsHour.

Thankfully, I haven’t been tackling the entire inventory alone. I need to offer a special thanks to Matthew Graylin, a desk assistant with the NewsHour who’s been tasked with assisting me with the work. Needless to say, conducting an archival inventory is well beyond the normal duties of a broadcast news assistant, but Matthew has dived in with gusto. We still have a few weeks together, so hopefully I can convert him into a future audiovisual archivist in that time.”

This slideshow requires JavaScript.

We have also selected a digitization vendor for the project and are looking to begin pilot tests for digitization within the next month. Meanwhile, the Library has instituted quality control procedures to ensure that all digitized files will be properly preserved for present and future generations.

We can’t wait to get started with digitization and look forward to making this monumental series accessible as part of the AAPB collection. In the meantime, we’re pleased to share this clip reel sampling of content that will be digitized, courtesy of NewsHour Productions.

 

WGBH, Library of Congress, and WETA to Digitize PBS NewsHour Collection

NewsHour_Project_Logos

32 years of PBS NewsHour programs to be made available online through American Archive of Public Broadcasting

BOSTON, Mass. (January 28, 2016) – More than three decades of PBS NewsHour broadcasts from 1975 to 2007 will be preserved and available online as part of the American Archive of Public Broadcasting (AAPB). Public media producer WGBH, the Library of Congress, and WETA, Washington, DC will digitize, preserve and allow the public online access to PBS NewsHour‘s predecessor programs from 1975 to 2007, made possible with funding from the Council on Library and Information Resources (CLIR). The project will digitize nearly 10,000 programs comprising more than 8,000 recorded hours that chronicle American and foreign affairs, providing access to original source material, including interviews with presidents and other world leaders and reports on major issues and events. The content will be presented as a part of the American Archive of Public Broadcasting, a collaboration between WGBH and the Library of Congress.

Noting the value of preserving the PBS NewsHour material, Steven Roberts, renowned journalist and the Shapiro Professor of Media and Public Affairs at George Washington University, said “No other broadcast on television has upheld the highest standards of the profession with such consistent devotion.”

The digitized PBS NewsHour collection will provide valuable primary source material not available elsewhere for historians to consider in their explorations into the recent past, especially in the areas of politics, policymaking, and international affairs. It will give scholars a previously unavailable source from which to study ideas and rhetoric to illuminate what intellectual historian Daniel Rodgers recently characterized as “a multisided contest of arguments and social visions that ranged across the late twentieth century.”

The programs feature interviews with leading newsmakers including presidents, Supreme Court justices, members of Congress, every secretary of state since 1976 and with world leaders, including the Shah of Iran, Ayatollah Khomeini, Fidel Castro, Muammar Khadafy, Yasser Arafat, Menachem Begin, Boris Yeltsin, Vaclav Havel, Nelson Mandela and Margaret Thatcher. The collection includes extensive coverage of election campaigns, African-American history, global and domestic health care, poverty, technology, immigration debates, the end of the Cold War, terrorism, the economy, climate change, energy issues, religion, education issues, rural life, scientific exploration, poetry and the media.

The PBS NewsHour collection will be made available on the AAPB website, growing the online collection to more than 20,000 programs. The AAPB will ensure that this rich source for American political, social, and cultural history and creativity will be saved and made available once again to future generations.

More information is available on the American Archive website at americanarchive.org.

About the American Archive of Public Broadcasting
The American Archive of Public Broadcasting is a collaborative effort by the Library of Congress and WGBH in Boston to preserve for posterity the most significant public television and radio programs of the past 60 years. The American Archive will ensure that this rich source for American political, social, and cultural history and creativity will be saved and made available once again to future generations. Major funding is provided by the Corporation for Public Broadcasting, the Institute for Museum and Library Services, and the Council on Library and Information Resources. More information is available at americanarchive.org.

About The Library of Congress
The Library of Congress, the nation’s oldest federal cultural institution, is the world’s preeminent reservoir of knowledge, providing unparalleled collections and integrated resources to Congress and the American people. The Library holds the largest collection of audio-visual recordings in the world and has been collecting and preserving historically, culturally and aesthetically significant recordings in all genres for nearly 120 years. Many of the Library’s rich resources and treasures may also be accessed through the Library’s website, www.loc.gov.

About WGBH
WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur, Curious George, and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.

About WETA
WETA Washington, DC, is one of the largest-producing stations of new content for public television in the United States and serves Virginia, Maryland and the District of Columbia with educational initiatives and with high-quality programming on four digital television channels. Other WETA productions and co-productions include WASHINGTON WEEK WITH GWEN IFILL, THE KENNEDY CENTER MARK TWAIN PRIZE and documentaries by filmmaker Ken Burns, including THE ROOSEVELTS: AN INTIMATE HISTORY and a forthcoming film on Jackie Robinson. Sharon Percy Rockefeller is president and CEO of WETA. More information on WETA and its programs and services is available at www.weta.org.

About PBS NewsHour
PBS NewsHour is seen by over four million weekly viewers and is also available online, via public radio in select markets, and via podcast. PBS NewsHour is a production of NewsHour Productions LLC, a wholly-owned non-profit subsidiary of WETA Washington, D.C., in association with WNET in New York. Major funding for PBS NewsHour is provided by the Corporation for Public Broadcasting, PBS and public television viewers. Major corporate funding is provided by BNSF and Lincoln Financial Group, with additional support from Alfred P. Sloan Foundation, Carnegie Corporation of New York, the J. Paul Getty Trust, the S.D. Bechtel, Jr. Foundation, the John D. and Catherine T. MacArthur Foundation, the Lemelson Foundation, National Science Foundation, The Rockefeller Foundation, the William and Flora Hewlett Foundation, Ford Foundation, Skoll Foundation, Friends of the NewsHour and others. More information on PBS NewsHour is available at pbs.org/newshour. On social media, visit NewsHour on Facebook or follow @NewsHour on Twitter.

Media Contacts

Library of Congress:
Sheryl Cannady
202-707-6456
scannady@loc.gov

WGBH:
Emily Balk
617-300-5317
emily_balk@wgbh.org

PBS NewsHour:
Nick Massella
nmassella@newshour.org