In 2015, the Institute of Museum and Library Services (IMLS) awarded WGBH on behalf of the American Archive of Public Broadcasting a grant to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. Through a collaboration with Pop Up Archive and HiPSTAS at the University of Texas at Austin, our project has supported the creation of speech-to-transcripts for the initial 40,000 hours of historic public broadcasting preserved in the AAPB, the launch of a free open-source speech-to-text tool, and FIX IT, a game that allows the public to help correct our transcripts.
Now, our colleagues at HiPSTAS are debuting a new machine learning toolkit and DIY techniques for labeling speakers in “unheard” audio — audio that is not documented in a machine-generated transcript. The toolkit was developed through a massive effort using machine learning to identify notable speakers’ voices (such as Martin Luther King, Jr. and John F. Kennedy) from within the AAPB’s 40,000 hour collection of historic public broadcasting content.
This effort has vast potential for archivists, researchers, and other organizations seeking to discover and make accessible sound at scale — sound that otherwise would require a human to listen and identify in every digital file.
As part of our NEH-funded PBCore Development and Training Project, we’re developing tools and resources around PBCore, a metadata schema and data model designed to describe and manage audiovisual collections.
Based on feedback from a previous survey to users and potential users, we’ve generated a list of tools and resources that previous respondents indicated would be valuable to the archival and broadcasting communities. Now, we’re looking for feedback on what to prioritize that will be of real use to the archives and public media communities.
Please fill out this short survey – which should take at most five minutes – to check out our development plans and give your feedback on where we should focus our efforts: https://www.surveymonkey.com/r/WPF3QZD
Thanks for taking the time to fill out the survey! You can read more about the PBCore Development and Training Project here and see the PBCore website here.
In this post I will describe our “Asset Review” and “Online Workflow” phases. The “Asset Review” phase is where we determine what work we will need to do to a recording to make it available online, and the “Online Workflow” phase is where we extract metadata from a transcript, add the metadata to our repository, and make the recording available online.
The goals and realities of the NewsHour project necessitate an item level content review of each recording. The reasons for this are distinct and compounding. The scale of the collection (nearly 10,000 assets) meant that the inventories from which we derived our metadata were generated only from legacy databases and tape labels, which are sometimes wrong. At no point were we able to confirm that the content on any tape is complete and correct prior to digitization. In fact, some of the tapes are unplayable before being prepared to be digitized. Additionally, there is third-party content that needs to be redacted from some episodes of the NewsHour before they can be made available. A major complication is that the transcripts only match 7pm Eastern broadcasts, and sometimes 9pm or 11pm updates would be recorded and broadcast if breaking news occurred. The tapes are not always marked with broadcast times, and sometimes do not contain the expected content – or even an episode of the NewsHour!
These complications would be fine if we were only preserving the collection, but our project goal is to make each recording and corresponding transcript or closed caption file broadly accessible. To accomplish that goal each record must have good metadata, and to have that we must review and describe each record! Luckily, some of the description, redaction, and our workflow tracking is automatable.
Access and Description Workflow Overview
As I’ve mentioned before, we coordinate and document all our NewsHour work in a large Google Sheet we call the “NewsHour Workflow workbook” (click here for link). The chart below explains how a GUID moves through sheets of the NewsHour workbook throughout our access and description work.
After a digitized recording has been delivered to WGBH and preserved, it is automatically placed in queue on the “Asset Review” sheet of our workbook. During the Asset Review, the reviewer answers thirteen different questions about the GUID. Using these responses, the Google Sheet automatically places the assets into the appropriate workflow trackers in our workbook. For instance, if a recording doesn’t have a transcript, it is placed in the “No Transcript tracker”, which has extra workflow steps for generating a description and subject metadata. A GUID can have multiple issues that place it into multiple trackers simultaneously. For instance, a tape that is not an episode will also not have a transcript, and will be placed on both the “Not an Episode tracker” and the “No Transcript tracker”. The Asset Review is critical because the answers determine the work we must perform, and ensures that each record will be correctly presented to the public when work on it is completed.
A GUID’s status in the various trackers is reflected in the “Master GUID Status sheet”, and is automatically updated when different criteria in the trackers are met and documented. When a GUID’s workflow tasks have been completely resolved in all the trackers, it appears as “Ready to go online” on the “Master GUID Status sheet.” The GUID is then automatically placed into to the “AAPB Online Status tracker”, which presents the metadata necessary to put the GUID online and indicates if tasks have been completed in the “Online Workflow tracker”. When all tasks are completed, the GUID will be online and our work on the GUID is finished.
In this post I am focusing on a workflow that follows digitizations which don’t have problems. This means the GUIDs are episodes, contain no technical errors, and have transcripts that match (green arrows in the chart). In future blog posts I’ll elaborate on our workflows for recordings that go into the other trackers (red arrows).
Each row of the “Asset Review sheet” represents one asset, or GUID. Columns A-G (green cell color) on the sheet are filled with descriptive and administrative metadata describing each item. This metadata is auto-populated from other sheets in the workbook. Columns H-W (yellow cell color) are the reviewer’s working area, with questions to answer about each item reviewed. As mentioned earlier, the answers to the questions determines the actions that need to be taken before the recording is ready to go online, and place the GUID into the appropriate workflow trackers.
The answers to some questions on the sheet impact the need to answer others, and cells auto-populate with “N/A” when one answer precludes another. Almost all the answers require controlled values, and the cells will not accept input besides those values. If any of the cells are left blank (besides questions #14 and #15) the review will not register as completed on the “Master GUID Status Sheet”. I have automated and applied value control to as much of the data entry in the workbook as possible, because doing so helps mitigate human error. The controlled values also facilitate workbook automation, because we’ve programmed different actions to trigger when specific expected text strings appear in cells. For instance, the answer to “Is there a transcript for this video?” must be “Yes” or “No”, and those are the only input the cell will accept. A “No” answer places the GUID on the “No Transcript tracker”, and a “Yes” does not.
To review an item, staff open the GUID on an access hard drive. We have a multiple access drives which contain copies of all the proxy files delivered NewsHour digitizations. Reviewers are expected to watch between one and a half to three minutes of the beginning, middle, and end of a recording, and to check for errors while fast-forwarding through everything not watched. The questions reviewers answer are:
Is this video a nightly broadcast episode?
If an episode, is the recording complete?
If incomplete, describe the incompleteness.
Is the date we have recorded in the metadata correct?
If not, what is the corrected date?
Has the date been updated in our metadata repository, the Archival Management System?
Is the audio and video as expected, based on the digitization vendor’s transfer notes?
If not, what is wrong with the audio or video?
Is there a transcript for this video?
If yes, what is the transcript’s filename?
Does the video content completely match the transcript?
If no, in what ways and where doesn’t the transcript match?
Does the closed caption file match completely (if one exists)?
Should this video be part of a promotional exhibit?
Any notes to project manager?
Date the review is completed.
Initials of the reviewer.
Our internal documentation has specific guidelines on how to answer each of these questions, but I will spare you those details! If you’re conducting quality control and description of media at your institution, these questions are probably familiar to you. After a bit of practice reviewers become adept at locating transcripts, reviewing content, and answering the questions. Each asset takes about ten minutes to review if the transcript matches, the content is the expected recording, and the digitization is error free. If any of those criteria are not true, the review will take longer. The review is laborious, but an essential step to make the records available.
A large majority of recordings are immediately ready to go online following the asset review. These ready GUIDs are automatically placed into the “AAPB Online Status tracker,” where we track the workflow to generate metadata from the transcript and upload that and the recording to the AAPB.
About once a month I use the “AAPB Online Status tracker” to generate a list of GUIDs and corresponding transcripts and closed caption files that are ready to go online. To do this, all I have to do is filter for GUIDs in the “AAPB Online Status tracker” that have the workflow status “Incomplete” and copy the relevant data for those GUIDs out of the tracker and into a text file. I import this list into a FileMaker tool we call “NH-DAVE” that our Systems Analyst constructed for the project.
“NH-DAVE” is a relational database containing all of the metadata that was originally encoded within the NewsHour transcripts. The episode transcripts provided by NewsHour contained the names of individuals appearing and subject terms for that episode in marked up values. Their subject terms were much more specific than ours, so we mapped them to the more broad AAPB controlled vocabulary we use to facilitate search and discovery on our website. When I ingest a list of GUIDs and transcripts to “NH-DAVE” and click a few buttons, it uses an AppleScript to match metadata from the transcript to the corresponding NewsHour metadata records in our Archival Management System and generate SQL statements. We use the statements to insert the contributor and subject metadata from the transcripts into the GUIDs’ AAPB metadata records in the Archival Management System.
Once the transcript metadata has been ingested we use both a Bash and a Ruby script to upload the proxy recordings to our streaming service, Sony Ci, and the transcripts and closed caption SRT files to our web platform, Amazon. We run a Bash script to generate another set of SQL statements to add the Sony Ci URLs and some preservation metadata (generated during the digital preservation phase) to our Archival Management System. We then export the GUIDs’ Archival Management System records into PBCore XML and ingest the XML into the AAPB’s website. As each step of this process is completed, we document it in the “Online Workflow tracker,” which will eventually register that work on the GUID is completed. When the PBCore ingest is completed and documented on the “Online Workflow tracker,” the recording and transcript are immediately accessible online and the record displays as complete on the “Master GUID Status spreadsheet”!
We consider a record that has an accurate full text transcript, contributor names, and subject terms to be sufficiently described for discovery functions on the AAPB. The transcript and terms will be fully indexed to facilitate searching and browsing. When a transcript matches, our descriptive process for NewsHour is fully automated. This is because we’re able to utilize the NewsHour’s legacy data. Without that data, the descriptive work required for this collection would be tremendous.
The American Archive of Public Broadcasting (AAPB) has launched a new digital exhibit about newsmagazines, a popular form of news presentation spanning five decades of radio and television broadcasting. Departing from mainstream examples such as 60 Minutes and All Things Considered, the exhibit brings together unique programs produced by independent stations from across the country for the first time as a unified collection. The newsmagazines showcased in “Structuring the News” cover topics from labor strikes to a day in the life of an air traffic controller, and emphasize conversations and voices often overlooked by network news shows.
“Structuring the News” is curated by Digital Exhibits Intern Alejandra Dean, and highlights 42 definitive examples representing both metropolitan producers and smaller, regional studios. Many of the shows in the exhibit prioritize local issues and communities, providing a window into American daily life from 1976-2016. In addition to defining the format, the exhibit looks at important precursors during the 1960s that experimented with news reporting.
To celebrate the launch of “Structuring the News: The Magazine Format in Public Media”, the exhibit’s curator, Alejandra Dean, AAPB Project Manager Casey Davis Kaufman, and Mark Williams, Professor of Film and Media Studies at Dartmouth College, will be discussing newsmagazines in a Facebook Live event at 12pm EDT on Thursday, July 6th. Don’t miss this inside look at over fifty years of broadcast newsmagazines, and the chance to ask questions about the exhibit! To watch, head to WGBH’s Facebook page at 12pm EDT on July 6th.
Grant will bolster capacity and usability of the American Archive of Public Broadcasting
BOSTON (June 22, 2017) – WGBH Educational Foundation is pleased to announce that the Andrew W. Mellon Foundation has awarded WGBH a $1 million grant to support the American Archive of Public Broadcasting (AAPB). The AAPB, a collaboration between Boston public media station WGBH and the Library of Congress, has been working to digitize and preserve nearly 50,000 hours of broadcasts and previously inaccessible programs from public radio and public television’s more than 60-year legacy.
WGBH will use the grant funds to build technical capacity for the intake of new content, develop collaborative initiatives, build training and support services for AAPB contributors and foster scholarly use and enhance public access for the collection. These efforts will include the creation of advisory committees for scholars, stations and educators.
“The work of the American Archive of Public Broadcasting is crucial for preserving our public media history and making this rich vault of content available to all,” said WGBH President and CEO Jon Abbott. “I am grateful that the Mellon Foundation has recognized the invaluable efforts of our archivists to save these historic programs for the future. WGBH is honored to accept this generous grant.”
The AAPB is a national effort to preserve at-risk public media and provide a central web portal for access to the programming that public stations and producers have created over the past 60 years. In its initial phase, the AAPB digitized approximately 40,000 hours of radio and television programming and related materials selected by more than 100 public media stations and organizations across the country. The entire collection is available for research on location at WGBH and the Library, and currently more than 20,000 programs are available in the AAPB’s Online Reading Room at americanarchive.org to anyone in the United States.
WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur, Curious George, and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.
About the Library of Congress
The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.
About the American Archive of Public Broadcasting
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 40,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and more than 20,000 programs are available online at americanarchive.org.
About the Andrew W. Mellon Foundation
Founded in 1969, the Andrew W. Mellon Foundation endeavors to strengthen, promote, and, where necessary, defend the contributions of the humanities and the arts to human flourishing and to the well-being of diverse and democratic societies by supporting exemplary institutions of higher education and culture as they renew and provide access to an invaluable heritage of ambitious, path-breaking work. Additional information is available at mellon.org.
2017 is the 50th anniversary of the Public Broadcasting Act. Join Current for Get with The Program!: Shows that Shaped Public Television, a series of online events looking at some of the most influential public TV programs of all time. First up: Firing Line, the legendary public affairs program hosted by conservative intellectual William F. Buckley. Watch clips of Firing Line, courtesy of the Hoover Institution Archives, and discuss the impact of this groundbreaking show on American culture and public TV itself. Guests include Heather Hendershot, author of “Open to Debate: How William F. Buckley Put Liberal America on The Firing Line” and former ABC News analyst Jeff Greenfield. This free event is Wednesday, May 24 at 1 pm ET. Reserve your spot here: bit.ly/pba50-firingline.
WGBH, on behalf of the American Archive of Public Broadcasting (AAPB) and with funding from the Institute of Museum and Library Services, is excited to announce today’s launch of FIX IT, an online game that allows members of the public to help AAPB professional archivists improve the searchability and accessibility of more than 40,000 hours of digitized, historic public media content.
For grammar nerds, history enthusiasts and public media fans, FIX IT unveils the depth of historic events recorded by public media stations across the country and allows anyone and everyone to join together to preserve public media for the future. FIX IT players can rack up points on the game leaderboard by identifying and correcting errors in machine-generated transcriptions that correspond to AAPB audio. They can listen to clips and follow along with the corresponding transcripts, which sometimes misidentify words or generate faulty grammar or spelling. Each error fixed is points closer to victory.
Visit fixit.americanarchive.org to help preserve history for future generations. Players’ corrections will be made available in public media’s largest digital archive at americanarchive.org. Please help us spread the word!
In our last blog post (click for link) on managing the PBS NewsHour Digitization Project, I briefly discussed WGBH’s digital preservation and ingest workflows. Though many of our procedures follow standard practices common to archival work, I thought it would be worthwhile to cover them more in-depth for those who might be interested. We at WGBH are responsible for describing, providing access to, and digitally preserving the proxy files for all of our projects. The Library of Congress preserves the masters. In this post I cover how we preserve and prepare to provide access to proxy files.
Before a file is digitized, we ingest the item-level tape inventory generated during the project planning stages into our Archival Management System (AMS – see link for the Github). The inventory is a CSV that we normalized to our standards, upload, and then map to PBCore in MINT, or “Metadata Interoperability Services,” an open-source web-based plugin designed for metadata mapping and aggregation. The AMS ingests the data and creates new PBCore records, which are stored as individual elements in tables in the AMS. The AMS generates a unique ID (GUID) for each asset. We then export the metadata, provide it to the digitization vendor, and use the GUID identifiers to track records throughout the project workflow.
For the NewsHour project, George Blood L.P. receives the inventory metadata and the physical tapes to digitize to our specifications. For every GUID, George Blood creates a MP4 proxy for access, a JPEG2000 MXF preservation master, sidecar MD5 checksums for both video files, and a QCTools report XML for the master. George Blood names each file after the corresponding GUID and organizes the files into an individual folder for each GUID. During the digitization process, they record digitization event metadata in a PREMIS spreadsheets. Those sheets are regularly automatically harvested by the AMS, which inserts the metadata into the corresponding catalog records. With each delivery batch George Blood also provides MediaInfo XML saved in BagIt containers for every GUID, and a text inventory of the delivery’s assets and corresponding MD5 checksums. The MediaInfo bags are uploaded via FTP to the AMS, which harvests technical metadata from them and creates PBCore instantiation metadata records for the proxies and masters. WGBH receives the digitized files on LTO 6 tapes, and the Library of Congress receives theirs on rotating large capacity external hard drives.
For those who are not familiar with the tools I just mentioned, I will briefly describe them. A checksum is a computer generated cryptographic hash. There are different types of hashes, but we use MD5, as do many other archives. The computer analyzes a file with the MD5 algorithm and delivers a 32 character code. If a file does not change, the MD5 value generated will always be the same. We use MD5s to ensure that files are not corrupted during copying and that they stay the same (“fixed”) over time. QCTools is an open source program developed by the Bay Area Video Coalition and its collaborators. The program analyzes the content of a digitized asset, generates reports, and facilitates the inspection of videos. BagIt is a file packaging format developed by the Library of Congress and partners that facilitates the secure transfer of data. MediaInfo is a tool that reports technical metadata about media files. It’s used by many in the AV and archives communities. PREMIS is a metadata standard used to record data about an object’s digital preservation.
Now a digression about my inventories – sorry in advance. ¯\_(ツ)_/¯
I keep two active inventories of all digitized files received. One is an Excel spreadsheet “checksum inventory” in which I track if a GUID was supposed to be delivered but was not received, or if a GUID was delivered more than once. I also use it to confirm that the checksums George Blood gave us match the checksums we generate from the delivered files, and it serves as a backup for checksum storage and organization during the project. The inventory has a master sheet with info for every GUID, and then each tape has an individual sheet with an inventory and checksums of its contents. I set up simple formulas that report any GUIDs or checksums that have issues. I could use scripts to automate the checksum validation process, but I like having the data visually organized for the NewsHour project. Given the relatively small volume of fixity checking I’m doing this manual verification works fine for this project.
The other inventory is the Approval Tracker spreadsheet in our Google Sheets NewsHour Workflow workbook (click here for link). The Approval Tracker is used to manage reporting about GUID’s ingesting and digital preservation workflow status. I record in it when I have finished the digital preservation workflow on a batch, and I mark when the files have been approved by all project partners. Partners have two months from the date of delivery to report approvals to George Blood. Once the files are approved they’re automatically placed on the Intern Review sheet for the arrangement and description phase of our workflow.
Okay, forgive me for that, now back to WGBH’s ingest and digital preservation workflow for the NewsHour project!
The first thing I do when we receive a shipment from George Blood is the essential routine I learned the hard way while stocking a retail store – always make sure everything that you paid for is actually there! I do this for both the physical LTO tapes, the files on the tapes, the PREMIS spreadsheet, the bags, and the delivery’s inventory. In Terminal I use a bash script that checks a list of GUIDs against the files present on our server to ensure that all bags have been correctly uploaded to the AMS. If we’ve received everything expected, I then organize the data from the inventory, copying the submission checksums into each tape’s spreadsheet in my Excel “checksum inventory”. Then I start working with the tapes.
Important background information is that the AAPB staff at WGBH work in a Mac environment, so what I’m writing about works for Mac, but it could easily be adopted to other systems. The first step I take with the tapes is to check the them for viruses. We use Sophos to do that in Terminal, with the Sweep command. If no viruses are found I then use one of our three LTO workstations to copy the MP4 proxies, proxy checksums, and QCTools XML reports from the LTO to a hard drive. I use the Terminal to do the copying, which I leave run while I go to other work. When the tape is done copying I use Terminal to confirm that the number of files copied matches the number of files I expected to copy. After that, I use it to run an MD5 report (with the find, -exec, and MD5 commands) on the copied files on the hard drive. I put those checksums into my Excel sheet and confirm they match the sums provided by George Blood, that there are no duplicates, and that we received everything we expected. If all is well, I put the checksum report onto our department server and move on to examining the delivered files’ specifications.
I use MediaInfo and MDQC to confirm that files we receive conform to our expectations. Again, this is something I could streamline with scripts if the workflow needed, but MDQC gets the job done for the NewsHour project. MDQC is a free program from AVPreserve that checks a group of files against a reference file and passes or fails them according to rules you specify. I set the test to check that the delivered batch are encoded to our specifications (click here for those). If any files fail the test, I use MediaInfo in Terminal to examine why they failed. I record any failures at this stage, or earlier in the checksum stage, in an issue tracker spreadsheet the project partners share, and report the problems to the vendor so that they can deliver corrected files.
Next I copy the set of copies on the hard drive onto other working hard drives for the interns to use during the review stage. I then skim a small sample of the files to confirm their content meets our expectations, comparing the digitizations to the transfer notes provided by George Blood in the PREMIS metadata. I review a few of the QCTools reports, looking at the video’s levels. I don’t spend much time doing that though, because the Library of Congress reviews the levels and characteristics of every master file. If everything looks good I move on, because all the proxies will be reviewed at an item level by our interns during the next phase of the project’s workflow anyways.
The last steps are to mark both the delivery batch’s digital preservation complete and the files as approved in the Approval Tracker, create a WGBH catalog record for the LTO, run a final MD5 manifest of the LTO and hard drive, upload some preservation metadata (archival LTO name, file checksums, and the project’s internal identifying code) to the AMS, and place the LTO and drive in our vault. The interns then review and describe the records and, after that, the GUIDs move into our access workflow. Look forward to future blog posts about those phases!
The Council on Library and Information Resources (CLIR) recently completed a six-part webinar series to share best practices and lessons learned from their Cataloging Hidden Collections program. Sponsored through the generous support of The Andrew W. Mellon Foundation, the Strategies for Advancing Hidden Collections (SAHC) series aims to help those working in GLAM (Gallery, Library, Archive, Museum) organizations build the confidence they need to tackle the processing of hidden archival collections. This series may also be particularly useful for public media organizations that are planning preservation projects.
In the past few months, we’ve added several new radio collections to our Online Reading Room!
The Donald Voegeli collection preserves the music and memory of Don Voegeli, who wrote the theme music for All Things Considered on NPR, along with providing many other contributions to public radio over the course of a long and impressive career.
VariationsontheAllThingsConsideredtheme make up just a fraction of the Don Voegeli collection. There’s also plenty of Voegeli’s other work to explore, from musical compositions in the vein of the ATC theme like Swiss Clock Maker to catchy educational jingles like Math Song (“you bisect an angle by using a ruler and compass / you bisect a compass by using a good sharp axe”)
Donald Voegeli’s son Jim Voegeli, a radio producer in his own right, has also contributed four audio documentaries of his own as a separate collection. “Speaking of Wilderness,” Jim’s first documentary on the importance of the conservation of wild places, aired on NPR when he was only 16. Jim’s piece “Remembering Aldo Leopold,” a radio documentary essay on the life and legacy of the visionary conservationist and writer, went on to win an Ohio State Award.
Finally, for more award-winning environmental journalism, check out our newest collection of works by Ilsa Setziol, longtime environmental reporter for KPCC. Among other honors, Setziol has been recognized for Outstanding Beat Reporting in Radio by the Society of Environmental Journalists for pieces like this 2003 report on the environmental aftermath of fires in San Bernardino County, “Fire Recovery, Part 1.”