Upcoming AAPB Webinar Featuring Kathryn Gronsbell, Digital Collections Manager at Carnegie Hall

DUuoHWcVQAABjGA

Photo courtesy of Rebecca Benson, @jeybecques, PBPF Fellow at University of Missouri.

This Thursday, March 15th at 8 pm EST, American Archive of Public Broadcasting (AAPB) staff will host a webinar with Kathryn Gronsbell, Digital Collections Manager at Carnegie Hall and will cover topics in documentation, including why documentation is important, what to think about when recording workflows for future practitioners, and where to find examples of good documentation in the wild.

The public is welcome to join for the first half hour. The last half hour will be limited to Q&A with our Public Broadcasting Preservation Fellows, who have now begun to inventory their digitized public broadcasting collections to be preserved in the AAPB.

Webinar URL: http://wgbh1.adobeconnect.com/documentation/

For anyone who missed the last webinar on tools for Quality Control, it’s now also available for viewing through this link: http://wgbh1.adobeconnect.com/psv1042lp222/.

*******************************

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work: digitized content preserved in the American Archive of Public Broadcasting from our collaborating host organizations WUNCKOPNOklahoma Educational Television AuthorityGeorgia Public Broadcasting, and the Center for Asian American Media as well as documentation created to support ongoing audio and video preservation education at the University of MissouriUniversity of OklahomaClayton State UniversityUniversity of North Carolina at Chapel Hill, and San Jose State University.

Celebrate Women’s History Month by Preserving Women’s Voices in Public Media

One of the most fascinating aspects of the American Archive of Public Broadcasting (AAPB) is discovering how local broadcasting stations used their platforms to communicate national issues to local audiences.

As second-wave feminism gained momentum between the years 1960 to 1980, WNED from Buffalo, New York documented the movement’s ripple effect in a half-hour public affairs talk show series titled Woman.  Syndicated by over 200 PBS stations during the years 1973-1977, Woman was the only year-round, national public television forum where a wide variety of national experts provided perspectives on the (then) evolving world of women’s history.

To celebrate this milestone in women’s public media history, the American Archive of Public Broadcasting (AAPB) launched a new Special Collection featuring the Woman series! Over 190 episodes are available online via the AAPB website: http://americanarchive.org/special_collections/woman-series.

Screen Shot 2018-03-06 at 10.10.46 AM.png
Woman Series, WNED – Buffalo, NY (1973-1977)

The AAPB invites you to celebrate Women’s History Month by helping preserve and make accessible six Woman transcripts. We’re launching a demo-version of our *NEW* transcript editor tool FIX IT+, a line-by-line editing platform initially developed by the New York Public Library. The six featured interviews include conversations with Gloria Steinem (editor and co-founder of Ms. Magazine), Dorothy Pitman Hughes (African American activist and co-founder of Ms. Magazine), Betty Friedan (author of The Feminine Mystique), Nora Ephron (editor for Esquire magazine and the author of the best-selling book Crazy Salad), Marcia Ann Gillespie (editor-in-chief of Essence Magazine and a board member of Essence communications), Connie Uri, M.D. (on the National Board of Research on the Plutonium Economy and the advisory board of NASC, the Native American Solidarity Committee), and Marie Sanchez (Chief Judge of the Northern Cheyenne Tribe, member of the Indian Women United for Social Justice).

These transcripts will be made available online through the AAPB’s website, allowing women’s voices in public media to be more readily searchable and accessible for future generations.

Below are sample recordings of the six interviews mentioned above. Search the Woman Special Collection for more interviews with activists, journalists, writers, scholars, lawyers, artists, psychologists, and doctors, covering topics such as women in sports, the Equal Rights Amendment, sexuality, marriage, women’s health, divorce, the Women’s Liberation Movement, motherhood, and ageism, among others.

Direct link to FIX IT+: http://54.205.165.195.xip.io/

Sample Recordings of Featured Transcripts:

Connie Uri, M.D. and Marie Sanchez, Chief Judge of the Northern Cheyenne Tribe, FIX IT+ Transcript: http://54.205.165.195.xip.io/transcripts/cpb-aacip_81-67wm3fxh

Marcia Ann Gillespie, FIX IT+ Transcript: http://54.205.165.195.xip.io/transcripts/cpb-aacip_81-69z08t6x

Nora Ephron, FIX IT+ Transcript: http://americanarchive.org/catalog/cpb-aacip_81-988gttr0

Gloria Steinem, FIX IT+ Transcript: http://americanarchive.org/catalog/cpb-aacip_81-57np5qgv

Betty Friedan, FIX IT+ Transcript: http://americanarchive.org/catalog/cpb-aacip_81-9995xhm0

Dorothy Pitman Hughes, FIX IT+ Transcript: http://54.205.165.195.xip.io/transcripts/cpb-aacip_81-59c5b5nr

Written by Ryn Marchese, AAPB Engagement and Use Manager

Upcoming Webinar: AAPB’s Quality Control Tools and Techniques for Ingesting Digitized Collections

static1.squarespace.jpg

Oklahoma mentor Lisa Henry (left) cleaning a U-matic deck with Public Broadcasting Preservation Fellow Tanya Yule.

This Thursday, February 15th at 8 pm EST, American Archive of Public Broadcasting (AAPB) staff will host a webinar covering quality control tools and technologies used when ingesting digitized collections into the AAPB archive, including MDQC, MediaConch, Sonic Visualizer, and QCTools.

The public is welcome to join for the first half hour. The last half hour will be limited to Q&A with our Public Broadcasting Preservation Fellows, who are just now beginning the process of digitizing at-risk public broadcasting collections to be preserved in the AAPB.

Webinar URL: http://wgbh1.adobeconnect.com/psv1042lp222/

*******************************

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work: digitized content preserved in the American Archive of Public Broadcasting from our collaborating host organizations WUNCKOPNOklahoma Educational Television AuthorityGeorgia Public Broadcasting, and the Center for Asian American Media as well as documentation created to support ongoing audio and video preservation education at the University of MissouriUniversity of OklahomaClayton State UniversityUniversity of North Carolina at Chapel Hill, and San Jose State University.

 

Resources Roundup: AAPB Presentations from 2017 AMIA Conference

DRq7ymbVwAE8zFi

Earlier this month the American Archive of Public Broadcasting staff hosted several workshops at the 2017 Association of Moving Image Archivists (AMIA) conference in New Orleans. Their presentations on workflows, crowdsourcing, and best copyright practices are now available online! Be sure to also check out AMIA’s YouTube channel for recorded sessions.

THURSDAY, November 30th

  • PBCore Advisory Sub-Committee Meeting
    Rebecca Fraimow reported on general activities of the Sub-Committee and the PBCore Development and Training Project. The following current activities were presented:

PBCore Cataloging Tool (Linda Tadic)
PBCore MediaInfo updates (Dave Rice)
ProTrack integration (Rebecca Fraimow)
Updated CSV templates (Sadie Roosa)
PBCore crosswalks (Rebecca Fraimow and Sadie Roosa)

FRIDAY, Dec 1st

Archives that hold A/V materials are at a critical point, with many cultural heritage institutions needing to take immediate action to safeguard at-risk media formats before the content they contain is lost forever. Yet, many in the cultural heritage communities do not have sufficient education and training in how to handle the special needs that A/V archive materials present. In the summer of 2015, a handful of archive educators and students formed a pan-institutional group to help foster “educational opportunities in audiovisual archiving for those engaged in the cultural heritage sector.” The AV Competency Framework Working Group is developing a set of competencies for audiovisual archive training of students in graduate-level education programs and in continuing education settings. In this panel, core members of the working group will discuss the main goals of the project and the progress that has been made on it thus far.

Born-Digital audiovisual files continue to present a conundrum to archivists in the field today: should they be accepted as-is, transcoded, or migrated? Is transcoding to a recommended preservation format always worth the potential extra storage space and staff time? If so, what are the ideal target specifications? In this presentation, individuals working closely with born-digital audiovisual content from the University of North Carolina, WGBH, and the American Folklife Center at the Library of Conference will present their own use cases involving collections processing practices, from “best practice” to the practical reality of “good enough”. These use cases will highlight situations wherein video quality, subject matter, file size and stakeholder expectations end up playing important roles in directing the steps taken for preservation. From these experiences, the panel will put forth suggestions for tiered preservation decision making, recognizing that not all files should necessarily be treated alike.

  • Crowdsourcing Anecdotes

How does the public play a role in making historical AV content accessible? The American Archive of Public Broadcasting has launched two games that engage the public in transcribing and describing 70+ years of audio and visual content comprising more than 50,000 hours.

 THE TOOLS: 

(Speech-to-Text Transcript Correction) FIX IT is an online game that allows the public to identify and correct errors in our machine-generated transcripts. FIX IT players have exclusive access to historical content and long-lost interviews from stations across the country.

AAPB KALDI is a tool and profile for speech-to-text transcription of video and audio, released by the Pop Up Archive and made available on Github at github.com/WGBH/american-archive-kaldi.

(Program Credits Cataloging) ROLL THE CREDITS is a game that allows the public to identify and transcribe information about the text that appears on the screen in so many television broadcasts. ROLL THE CREDITS asks users to collect this valuable information and classify it into categories that can be added to the AAPB catalog. To accomplish this goal, we’ve extracted frames from uncataloged video files and are asking for help to transcribe the important information contained in each frame.

20171201_182116.jpg

SATURDAY, Dec 2nd

Digitized collections often remain almost as inaccessible as they were on their original analog carriers, primarily due to institutional concerns about copyright infringement and privacy. The American Archive of Public Broadcasting has taken steps to overcome these challenges, making available online more than 22,000 historic programs with zero take-down notices since the 2015 launch. This copyright session will highlight practical and successful strategies for making collections available online. The panel will share strategies for: 1) developing template forms with standard terms to maximize use and access, 2) developing a rights assessment framework with limited resources (an institutional “Bucket Policy”), 3) providing limited access to remote researchers for content not available in the Online Reading Room, and 4) promoting access through online crowdsourcing initiatives.

20171202_101425.jpg

The American Archive of Public Broadcasting seeks to preserve and make accessible significant historical public media content, and to coordinate a national effort to save at-risk public media recordings. In the four years since WGBH and the Library of Congress began stewardship of the project, significant steps have been taken towards accomplishing these goals. The effort has inspired workflows that function constructively, beginning with preservation at local stations and building to national accessibility on the AAPB. Archivists from two contributing public broadcasters will present their institutions’ local preservation and access workflows. Representatives from WGBH and the Library of Congress will discuss collaborating with contributors and the AAPB’s digital preservation and access workflows. By sharing their institutions’ roles and how collaborators participate, the speakers will present a full picture of the AAPB’s constructive inter-institutional work. Attendees will gain knowledge of practical workflows that facilitate both local and national AV preservation and access.

As an increasing number of audiovisual formats become obsolete and the available hours remaining on deteriorating playback machines decrease, it is essential for institutions to digitize their AV holdings to ensure long-term preservation and access. With an estimated hundreds of millions of items to digitize, it is impractical, even impossible, that institutions would be able to perform all of this work in-house before time runs out.  While this can seem like a daunting process, why learn the hard way when you can benefit from the experiences of others? From those embarking on their first outsourced AV digitization project to those who have completed successful projects but are looking for ways to refine and scale up their process, everyone has something to learn from these speakers about managing AV digitization projects from start to finish.

How do you bring together a collection of broadcast materials scattered in various geographical locations across the country? National Education Television (NET), the precursor to PBS, distributed programs nationally to educational television stations from 1954-1972. Although this collection is tied together through provenance, it presents a challenge to processing due to differing approaches in descriptive practices across many repositories over many years. By aggregating inventories into one catalog and describing titles more fully, the NET Collection Catalog will help institutions holding these materials make informed preservation decisions. By its conclusion, AAPB will publish an online list of NET titles annotated with relevant descriptive information culled from NET textual records that will greatly improve discoverability of NET materials for archivists, scholars, and the general public. Examples of specific cataloging issues, including contradictory metadata documentation and legacy records, inconsistent titling practices, and the existence of international version will be explored.

download.jpg

ABOUT THE AAPB

The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and almost 25,000 programs are available online at americanarchive.org.

Announcing ROLL THE CREDITS: Classifying and Transcribing Text with Zooniverse

AAPB_RollTheCredits

Today we’re launching ROLL THE CREDITS, a new Zooniverse project to engage the public in helping us catalog unseen content in the AAPB archive. Zooniverse is the “world’s largest and most popular platform for people-powered research.” Zooniverse volunteers (like you!) are helping the AAPB in classifying and transcribing the text from extracted frames of uncataloged public television programs, providing us with information we can plug directly into our catalog, closing the gap on our sparsely described collection of nearly 50,000 hours of television and radio.

RolltheCredits.png

Example frame from ROLL THE CREDITS

The American people have made a huge investment in public radio and television over many decades. The American Archive of Public Broadcasting (AAPB) works to ensure that this rich source for American political, social, and cultural history and creativity is saved and made available once again to future generations.

The improved catalog records will have verified titles, dates, credits, and copyright statements. With the updated, verified information we will be able to make informed decisions about the development of our archive, as well as provide access to corrected versions of transcripts available for anyone to search free of charge at americanarchive.org.

In conjunction with our speech-to-text transcripts from FIX IT, a game that asks users to correct and validate the transcripts one phrase at a time, ROLL THE CREDITS helps us fulfill our mission of preserving and making accessible historic content created by the public media, saving at-risk media before the contents are lost to prosperity.

Thanks for supporting AAPB’s mission! Know someone who might be interested? Feel free to share with the other transcribers and public media fans in your life!

“Dockerized” Kaldi Speech-to-Text Tool

At the AAPB “Crowdsourcing Anecdotes” meeting last Friday at the Association of Moving Image Archivists conference, I talked about a free “Dockerized” build of Kaldi made by Stephen McLaughlin, PHD student at UT Austin School of Information. I thought I would follow up on my introduction to it there by providing links to these resources, instructions for setting it up, and some anecdotes about using it. First, the best resource for this Docker Kaldi and Stephen’s work is here in the HiPSTAS Github: https://github.com/hipstas/kaldi-pop-up-archive. It also has detailed information for setting up and running the Docker Kaldi.

I confess that I don’t know much about computer programming and engineering besides what I need to get my work done. I am an archivist and I eagerly continue to gain more computer skills, but some of my terminology here might be kinda wrong or unclear. Anyways, Kaldi is a free speech-to-text tool that interprets audio recordings and outputs timestamped JSON and text files. This “Dockerized” Kaldi allows you to easily get a version of Kaldi running on pretty much any reasonably powerful computer. The recommended minimum is at least 6gb of RAM, and I’m not sure about the CPU. The more of both the better, I’m sure.

The Docker platform provides a framework to easily download and set up a computer environment in which Kaldi can run. Kaldi is pretty complicated, but Stephen’s Docker image (https://hub.docker.com/r/hipstas/kaldi-pop-up-archive) helps us all bypass setting up Kaldi. As a bonus, it comes set up with the language model that PopUp Archive created as part of our IMLS grant (link here) with them and HiPSTAS. They trained the model using AAPB recordings. Kaldi needs a trained language model dataset to interpret audio data put through the system. Because this build of Kaldi uses the PopUp Archive model, it is already trained for American English.

I set up my Docker on my Mac laptop, so the rest of the tutorial will focus on that system, but the GitHub has information for Windows or Linux and those are not very different. By the way, these instructions will probably be really easy for people that are used to interacting with tools in the command line, but I am going to write this post as if the reader hasn’t done that much. I will also note that while this build of Kaldi is really exciting and potentially useful, especially given all the fighting I’ve done with these kinds of systems in my career, I didn’t test it thoroughly because it is only Stephen’s experiment complimenting the grant project. I’d love to get feedback on issues you might encounter! Also I’ve got to thank Stephen and HiPSTAS!! THANK YOU Stephen!!

SET UP AND USE:

The first step is to download Docker (https://www.docker.com/). You then need to go into Docker’s preferences, under Advanced, and make sure that Docker has access to at least 6gb of RAM. Add more if you’d like.

Screen Shot 2017-12-04 at 12.51.04 PM.png
Give Docker more power!

Then navigate to the Terminal and pull Stephen’s Docker image for Kaldi. The command is “docker pull -a hipstas/kaldi-pop-up-archive”. (Note: Stephen’s GitHub says that you can run the pull without options, but I got errors if I ran it without “-a”). This is a big 12gb download, so go do something else while it finishes. I ate some Thanksgiving leftovers.

When everything is finished downloading, set up the image by running the command “docker run -it –name kaldi_pua –volume ~/Desktop/audio_in/:/audio_in/ hipstas/kaldi-pop-up-archive:v1”. This starts the Kaldi Docker image and creates a new folder on your desktop where you can add media files you want to run through Kaldi. This is also the place where Kaldi will write the output. Add some media to the folder BUT NOTE: the filenames cannot have spaces or uncommon characters or Kaldi will fail. My test of this setup ran well on some short mp4s. Also, your Terminal will now be controlling the Docker image, so your command line prompt will look different than it did, and you won’t be “in” your computer’s file system until you exit the Docker image.

Screen Shot 2017-12-04 at 2.06.49 PM.png

Now you need to download the script that initiates the Kaldi process. The command to download it is “wget https://raw.githubusercontent.com/hipstas/kaldi-pop-up-archive/master/setup.sh”. Once that is downloaded to the audio_in folder (and you’ve added media to the same folder) you can run a batch by executing the command “sh ./setup.sh”.

Kaldi will run through a batch, and a ton of text will continue to roll through your Terminal. Don’t be afraid that it is taking forever. Kaldi is meant to run on very powerful computers, and running it this way is slow. I tested on a 30 minute recording, and it took 2.5 hrs to process. It will go faster the more computing power you assign permission for Docker to use, but it is reasonable to assume that on most computers the time to process will be around 5 times the recording length.

Screen Shot 2017-12-04 at 1.54.55 PM.png
Picture of Kaldi doing its thing

The setup script converts wav, mp3, and mp4 to a 16khz broadcast WAV, which is the input that Kaldi requires. You might need to manually convert your media to broadcast WAV if the setup script doesn’t work. I started out by test a broadcast WAV that I made myself with FFmpeg, but Kaldi and/or the setup script didn’t like it. I didn’t resolve that problem because the Kaldi image runs fine on media that it converts itself, so that saves me the trouble anyways.

When Kaldi is done processing, the text output will be in the “audio_in” folder, in the “transcripts” folder. There will be both a JSON and txt file for every recording processed, named the same as the original media file. The quality of the output depends greatly on the original quality of the recording, and how closely the recording resembles the language model (in this case, a studio recording of individuals speaking standard American English). That said, we’ve had some pretty good results in our tests. NOTE THAT if you haven’t assigned enough power to Docker, Kaldi will fail to process, and will do so without reporting an error. The failed files will create output JSON and txt files that are blank. If you’re having trouble try adding more RAM to Docker, or checking that your media file is successfully converting to broadcast WAV.

Screen Shot 2017-12-04 at 1.54.27 PM.png

When you want to return your terminal to normal, use the command “exit” to shut down the image and return to your file system.

When you want to start the Kaldi image again to run another batch, open another session by running “docker start /kaldi_pua” and then “docker exec -it kaldi_pua bash”. You’ll then be in the Kaldi image and can run the batch with the “sh ./setup.sh” command.

I am sure that there are ways to update or modify the language model, or to use a different model, or to add different scripts to the Docker Kaldi, or to integrate it into bigger workflows. I haven’t spent much time exploring any of that, but I hope you found this post a helpful start. We’re going to keep it in mind as we build up our speech-to-text workflows, and we’ll be sure to share any developments. Happy speech-to-texting!!

Introducing an audio labeling toolkit

In 2015, the Institute of Museum and Library Services (IMLS) awarded WGBH on behalf of the American Archive of Public Broadcasting a grant to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. Through a collaboration with Pop Up Archive and HiPSTAS at the University of Texas at Austin, our project has supported the creation of speech-to-transcripts for the initial 40,000 hours of historic public broadcasting preserved in the AAPB, the launch of a free open-source speech-to-text tool, and FIX IT, a game that allows the public to help correct our transcripts.

Now, our colleagues at HiPSTAS are debuting a new machine learning toolkit and DIY techniques for labeling speakers in “unheard” audio — audio that is not documented in a machine-generated transcript. The toolkit was developed through a massive effort using machine learning to identify notable speakers’ voices (such as Martin Luther King, Jr. and John F. Kennedy) from within the AAPB’s 40,000 hour collection of historic public broadcasting content.

This effort has vast potential for archivists, researchers, and other organizations seeking to discover and make accessible sound at scale — sound that otherwise would require a human to listen and identify in every digital file.

Read more about the audio labeling toolkit here, and stay tuned for more posts in this series.

Audio_Labeler_The_World

PBS NewsHour Digitization Project Update: “Asset Review” and Access and Description Workflows

I’ve previously written about developing and automating management of our workflows for the NewsHour project (click for link), and WGBH’s processes for ingesting and preserving the NewsHour digitizations (click for link). Now that the project is moving along, and over one thousand episodes of the NewsHour are already on the AAPB (with recently added transcript search functionality!!), I thought I would share more information about our access workflows and how we make NewsHour recordings available.

In this post I will describe our “Asset Review” and “Online Workflow” phases. The “Asset Review” phase is where we determine what work we will need to do to a recording to make it available online, and the “Online Workflow” phase is where we extract metadata from a transcript, add the metadata to our repository, and make the recording available online.

The goals and realities of the NewsHour project necessitate an item level content review of each recording. The reasons for this are distinct and compounding. The scale of the collection (nearly 10,000 assets) meant that the inventories from which we derived our metadata were generated only from legacy databases and tape labels, which are sometimes wrong. At no point were we able to confirm that the content on any tape is complete and correct prior to digitization. In fact, some of the tapes are unplayable before being prepared to be digitized. Additionally, there is third-party content that needs to be redacted from some episodes of the NewsHour before they can be made available. A major complication is that the transcripts only match 7pm Eastern broadcasts, and sometimes 9pm or 11pm updates would be recorded and broadcast if breaking news occurred. The tapes are not always marked with broadcast times, and sometimes do not contain the expected content – or even an episode of the NewsHour!

These complications would be fine if we were only preserving the collection, but our project goal is to make each recording and corresponding transcript or closed caption file broadly accessible. To accomplish that goal each record must have good metadata, and to have that we must review and describe each record! Luckily, some of the description, redaction, and our workflow tracking is automatable.

Access and Description Workflow Overview

As I’ve mentioned before, we coordinate and document all our NewsHour work in a large Google Sheet we call the “NewsHour Workflow workbook” (click here for link). The chart below explains how a GUID moves through sheets of the NewsHour workbook throughout our access and description work.

NewsHour_AccessWorkflowChart.png
AAPB NewsHour Acces and Description workflow chart

After a digitized recording has been delivered to WGBH and preserved, it is automatically placed in queue on the “Asset Review” sheet of our workbook. During the Asset Review, the reviewer answers thirteen different questions about the GUID. Using these responses, the Google Sheet automatically places the assets into the appropriate workflow trackers in our workbook. For instance, if a recording doesn’t have a transcript, it is placed in the “No Transcript tracker”, which has extra workflow steps for generating a description and subject metadata. A GUID can have multiple issues that place it into multiple trackers simultaneously. For instance, a tape that is not an episode will also not have a transcript, and will be placed on both the “Not an Episode tracker” and the “No Transcript tracker”. The Asset Review is critical because the answers determine the work we must perform, and ensures that each record will be correctly presented to the public when work on it is completed.

A GUID’s status in the various trackers is reflected in the “Master GUID Status sheet”, and is automatically updated when different criteria in the trackers are met and documented. When a GUID’s workflow tasks have been completely resolved in all the trackers, it appears as “Ready to go online” on the “Master GUID Status sheet.” The GUID is then automatically placed into to the “AAPB Online Status tracker”, which presents the metadata necessary to put the GUID online and indicates if tasks have been completed in the “Online Workflow tracker”. When all tasks are completed, the GUID will be online and our work on the GUID is finished.

In this post I am focusing on a workflow that follows digitizations which don’t have problems. This means the GUIDs are episodes, contain no technical errors, and have transcripts that match (green arrows in the chart). In future blog posts I’ll elaborate on our workflows for recordings that go into the other trackers (red arrows).

Asset Review

NewsHour_AssetReview
An image of a portion of our Access Review spreadsheet

Each row of the “Asset Review sheet” represents one asset, or GUID. Columns A-G (green cell color) on the sheet are filled with descriptive and administrative metadata describing each item. This metadata is auto-populated from other sheets in the workbook. Columns H-W (yellow cell color) are the reviewer’s working area, with questions to answer about each item reviewed. As mentioned earlier, the answers to the questions determines the actions that need to be taken before the recording is ready to go online, and place the GUID into the appropriate workflow trackers.

The answers to some questions on the sheet impact the need to answer others, and cells auto-populate with “N/A” when one answer precludes another. Almost all the answers require controlled values, and the cells will not accept input besides those values. If any of the cells are left blank (besides questions #14 and #15) the review will not register as completed on the “Master GUID Status Sheet”. I have automated and applied value control to as much of the data entry in the workbook as possible, because doing so helps mitigate human error. The controlled values also facilitate workbook automation, because we’ve programmed different actions to trigger when specific expected text strings appear in cells. For instance, the answer to “Is there a transcript for this video?” must be “Yes” or “No”, and those are the only input the cell will accept. A “No” answer places the GUID on the “No Transcript tracker”, and a “Yes” does not.

To review an item, staff open the GUID on an access hard drive. We have a multiple access drives which contain copies of all the proxy files delivered NewsHour digitizations. Reviewers are expected to watch between one and a half to three minutes of the beginning, middle, and end of a recording, and to check for errors while fast-forwarding through everything not watched. The questions reviewers answer are:

  1. Is this video a nightly broadcast episode?
  2. If an episode, is the recording complete?
  3. If incomplete, describe the incompleteness.
  4. Is the date we have recorded in the metadata correct?
  5. If not, what is the corrected date?
  6. Has the date been updated in our metadata repository, the Archival Management System?
  7. Is the audio and video as expected, based on the digitization vendor’s transfer notes?
  8. If not, what is wrong with the audio or video?
  9. Is there a transcript for this video?
  10. If yes, what is the transcript’s filename?
  11. Does the video content completely match the transcript?
  12. If no, in what ways and where doesn’t the transcript match?
  13. Does the closed caption file match completely (if one exists)?
  14. Should this video be part of a promotional exhibit?
  15. Any notes to project manager?
  16. Date the review is completed.
  17. Initials of the reviewer.

Our internal documentation has specific guidelines on how to answer each of these questions, but I will spare you those details! If you’re conducting quality control and description of media at your institution, these questions are probably familiar to you. After a bit of practice reviewers become adept at locating transcripts, reviewing content, and answering the questions. Each asset takes about ten minutes to review if the transcript matches, the content is the expected recording, and the digitization is error free. If any of those criteria are not true, the review will take longer. The review is laborious, but an essential step to make the records available.

Online Workflow

A large majority of recordings are immediately ready to go online following the asset review. These ready GUIDs are automatically placed into the “AAPB Online Status tracker,” where we track the workflow to generate metadata from the transcript and upload that and the recording to the AAPB.

About once a month I use the “AAPB Online Status tracker” to generate a list of GUIDs and corresponding transcripts and closed caption files that are ready to go online. To do this, all I have to do is filter for GUIDs in the “AAPB Online Status tracker” that have the workflow status “Incomplete” and copy the relevant data for those GUIDs out of the tracker and into a text file. I import this list into a FileMaker tool we call “NH-DAVE” that our Systems Analyst constructed for the project.

NewsHour_NHDAVE.png
A screenshot of our FileMaker tool “NH-DAVE”

“NH-DAVE” is a relational database containing all of the metadata that was originally encoded within the NewsHour transcripts. The episode transcripts provided by NewsHour contained the names of individuals appearing and subject terms for that episode in marked up values. Their subject terms were much more specific than ours, so we mapped them to the more broad AAPB controlled vocabulary we use to facilitate search and discovery on our website. When I ingest a list of GUIDs and transcripts to “NH-DAVE” and click a few buttons, it uses an AppleScript to match metadata from the transcript to the corresponding NewsHour metadata records in our Archival Management System and generate SQL statements. We use the statements to insert the contributor and subject metadata from the transcripts into the GUIDs’ AAPB metadata records in the Archival Management System.

Once the transcript metadata has been ingested we use both a Bash and a Ruby script to upload the proxy recordings to our streaming service, Sony Ci, and the transcripts and closed caption SRT files to our web platform, Amazon. We run a Bash script to generate another set of SQL statements to add the Sony Ci URLs and some preservation metadata (generated during the digital preservation phase) to our Archival Management System. We then export the GUIDs’ Archival Management System records into PBCore XML and ingest the XML into the AAPB’s website. As each step of this process is completed, we document it in the “Online Workflow tracker,” which will eventually register that work on the GUID is completed. When the PBCore ingest is completed and documented on the “Online Workflow tracker,” the recording and transcript are immediately accessible online and the record displays as complete on the “Master GUID Status spreadsheet”!

We consider a record that has an accurate full text transcript, contributor names, and subject terms to be sufficiently described for discovery functions on the AAPB. The transcript and terms will be fully indexed to facilitate searching and browsing. When a transcript matches, our descriptive process for NewsHour is fully automated. This is because we’re able to utilize the NewsHour’s legacy data. Without that data, the descriptive work required for this collection would be tremendous.

A large majority of NewsHour records follow the workflow I’ve described in this post in their journey to the AAPB. If, unlike those covered here, a record is not an episode, does not have a matching transcript, needs to be redacted, or has technical errors, then it requires more work than I have outlined. Look forward to blog posts about those records in the future! Click here to see a NewsHour record that went through this workflow. If you’re interested in our workflow, I encourage you to open the workbook and use “Find” to follow this GUID (“cpb-aacip-507-0r9m32nr3f”) through the various trackers. Click here to see all NewsHour records that have been put online!

AAPB NDSR Resources Roundup

In 2015, the Institute of Museum and Library Services awarded a generous grant to WGBH on behalf of the American Archive of Public Broadcasting (AAPB) to develop the AAPB National Digital Stewardship Residency (NDSR). Through the grant, we placed residents at public media organizations around the country to complete digital stewardship projects.

After a fantastic final presentation at the Society of American Archivists meeting in Portland last month, the 2016-2017 AAPB NDSR residencies have now officially drawn to a close. We wanted to share with you a complete list of the resources generated throughout the residencies, including instructional webinars, blog posts, and resources created for stations over the course of the NDSR projects.

Resources

Audiorecorder (Open-Source Audio Digitization Tool)

CUNY TV Mediamicroservices Documentation

KBOO 2-Page Recommendation Summary

KBOO Digital Preservation Policy

KBOO Current Digital Storage and Archiving Practices

KBOO Diagram for Current Digital Program Production Practices

PBCore-Based Data Model for KBOO Analog Audio Assets

Workflow for Open-Reel Preservation at KBOO

KBOO Digital Audio Guidelines and Procedures

Recommended Next Steps for Developing an Integrated Searchable Database of Born-Digital and Analog Audio at KBOO

Louisiana Public Broadcasting Digital Preservation Plan

WHUT Naming Conventions for Local Programming

Wisconsin Public Television Microsoft Access Database to PBCore Crosswalk

Wisconsin Public Television AMS Workflows Documentation

Wisconsin Public Television Digitization Workflows Chart

Wisconsin Public Television Proposal for New Metadata Database

Resident Webinars

Challenges of Removable Media in Digital Preservation,” by Eddy Colloton (slides)

Demystifying FFmpeg/FFprobe,” by Andrew Weaver (slides)

Intro to Data Manipulation with Python CSV,” by Adam Lott (slides)

Through the Trapdoor: Metadata and Disambiguation in Fanfiction,” by Kate McManus (slides)

ResourceSpace for Audiovisual Archiving,” by Selena Chau (slides) (Demo videos: 1, 2, 3, 4)

Whats, Whys, and How Tos of Web Archiving,” by Lorena Ramírez-López (slides) (transcript)

Other Webinars

“Metadata: Storage, Modeling and Quality,” by Kara Van Malssen, Partner & Senior Consultant at AVPreserve (slides only)

Public Media Production Workflows,” by Leah Weisse, WGBH Digital Archive Manager/Production Archival Compliance Manager (slides)

Imposter Syndrome” by Jen LaBarbera, Head Archivist at Lambda Archives of San Diego, and Dinah Handel, Mass Digitization Coordinator at the NYPL (slides)

Preservation and Access: Digital Audio,” by Erica Titkemeyer, Project Director and AV Conservator at the Southern Folklife Collection (slides)

Troubleshooting Digital Preservation,” by Shira Peltzman, Digital Archivist at UCLA Library (slides)

Studs Terkel Radio Archive: Tips and Tricks for Sharing Great Audio,” by Grace Radkins, Digital Content Librarian at Studs Terkel Radio Library (slides)

From Theory to Action: Digital Preservation Tools and Strategies,” by Danielle Spalenka, Project Director of the Digital POWRR Project (slides)

Resident Blog Posts

Digital Stewardship at KBOO Community Radio,” Selena Chau (8/9/16)

Metadata Practices at Minnesota Public Radio,” Kate McManus (8/15/16)

NDSA, data wrangling, and KBOO treasures,” Selena Chau (8/30/16)

Minnesota Books and Authors,” Kate McManus (9/23/16)

Snapshot from the IASA Conference: Thoughts on the 2nd Day,” Eddy Colloton (9/29/16)

Who just md5deep-ed and redirected all them checksums to a .csv file? This gal,” Lorena Ramírez-López (10/6/16)

IASA Day 1 and Voice to Text Recognition,” Selena Chau (10/11/16)

IASA – Remixed,” Kate McManus (10/12/16)

Learning GitHub (or, if I can do it, you can too!)” Andrew Weaver (10/13/16)
Home Movie Day,” Eddy Colloton (10/15/16)

Snakes in the Archive,” Adam Lott (10/20/16)

Vietnam, Oral Histories, and the WYSO Archives Digital Humanities Symposium,” Tressa Graves (11/7/16)

Archives in Conversation (A Glimpse into the Minnesota Archives Symposium, 2016),” Kate McManus (11/15/16)

Inside the WHUT video library clean-up – part 1: SpaceSaver,” Lorena Ramírez-López (11/21/16)

Is there something that does it all?: Choosing a metadata management system,” Selena Chau (11/22/16)

Inside the WHUT video library clean-up – part 2: lots of manual labor,” Lorena Ramírez-López (12/20/16)

Just Ask For Help Already!” Eddy Colloton (12/22/16)

Playing with Pandas: CSV metadata transformations,” Selena Chau (1/4/17)

MPR50,” Kate McManus (2/8/17)

Before & after XML to PBCore in ResourceSpace,” Selena Chau (2/9/17)

Advocating for Archives in a Production Environment,” Eddy Colloton (2/27/17)

Louisiana Public Broadcasting Digital Preservation Plan,” Eddy Colloton (3/6/17)

Moving Beyond the Allegory of the Lone Digital Archivist (& my day of Windows scripting at KBOO,” Selena Chau (3/16/17)

Save the Data!” Kate McManus (3/16/17)

Professional Development Time Project: Audiorecorder,” Andrew Weaver (3/27/17)

Library Technology Conference,” Kate McManus (3/29/17)

Reporting from PNW: Online Northwest Conference,” Selena Chau (4/13/17)

Adventures in Perceptual Hashing,” Andrew Weaver (4/20/17)

Trying New Things: Meditations on NDSR from the Symposium in DC,” Kate McManus (5/3/17)

Filmed Immersion Week Sessions

Why Archive Public Media

The History of Public Media and the AAPB

Mastering Project Management

Growing Your Professional Profile

Negotiating at Work

Think Like a Computer

Get To Know Your Audiovisual Media 

Many of these resources can also be found on the American Archive of Public Broadcasting Wiki, created by the residents for their collaborative final project.

Launching the American Archive of Public Broadcasting Wiki

The residency period of the American Archive of Public Broadcasting (AAPB) National Digital Stewardship Residency (NDSR) project has now ended, but we’re very proud to launch the final project created by our AAPB NDSR residents: The American Archive of Public Broadcasting Wiki, a technical preservation resource guide for public media organizations.

Selena Chau, Eddy Colloton, Adam Lott, Kate McManus, Lorena Ramírez-López, and Andrew Weaver have highlighted their collaboration and shared their resources, workflows, and documents used for managing audiovisual assets in all their possible formats and environments.  The resulting Wiki encompasses everything from the first stages of the planning process to exit strategies from a storage or database solution.

AAPB staff and the residents hope that this Wiki will be an evolving resource. Editing capabilities will be locked on the Wiki for one week following launch, to allow time for the creation of a web archive of the resource in its original form that the residents may use in their portfolios; after this period, we will open up account creation to the audiovisual archiving and public broadcasting communities. We welcome your participation and contributions!