AAPB Transcription Workflow, Part 1

The AAPB started creating transcripts as part of our “Improving Access to Time-Based Media through Crowdsourcing and Machine-Learning” grant from the Institute of Museum and Library Services (IMLS). For the initial 40,000 hours of the AAPB’s collection, we worked with Pop Up Archive to create machine-generated transcripts, which are primarily used for keyword indexing, to help users find otherwise under-described content. These transcripts are also being corrected through our crowdsourcing platforms FIX IT and FIX IT+.

As the AAPB continues to grow its collection, we have added transcript creation to our standard acquisitions workflow. Now, when the first steps of acquisition are done, i.e., metadata has been mapped and all of the files have been verified and ingested, the media is passed in to the transcription pipeline. The proxy media files are either copied directly off the original drive or pulled down from Sony Ci, the cloud-based storage system that serves americanarchive.org’s video and audio files. These are copied into a folder on the WGBH Archives’ server, and then they wait for an available computer running transcription software.

Dockerized Kaldi

The AAPB uses the docker image of PopUp Archive’s Kaldi running on many machines across WGBH’s Media Library and Archives. Rather than paying additional money to run this in the cloud or on a super computer, we decided to take advantage of the resources we already had sitting in our department. AAPB and Archives staff at WGBH that regularly leave their computers in the office overnight are good candidates for being part of the transcription team. All they have to do is follow instructions on the internal wiki to install Docker and a simple Macintosh application, built in-house, that runs scripts in the background and reports progress to the user. The application manages launching Docker, pulling the Kaldi image (or checking that you already have it pulled), and launching the image. The user doesn’t need any specific knowledge about how Docker images work to run the application. That app gets minimized on the dock and continues to run in the background as the staff members goes about their work during the day.* But that’s not all! When they leave for the night and their computer typically wouldn’t be doing anything, it continues to transcribe media files, making use of processing power that we were already paying for but hadn’t been utilizing.

*There have been reports of systems being perceptively slower when running this Docker image throughout the day. It has yet to have a significant impact on any staff member’s ability to do their job.

Square application window that shows list of transcripts that have been processed
Application user-interface

Centralized Solution

Now, we could just have multiple machines running Kaldi through Docker and that would let us create a lot of transcripts. However, it would be cumbersome and time-consuming to split the files into batches, manage starting a different batch on each computer, and collect the disparate output files from various machines at the end of the process. So we developed a centralized way of handling the input and output of each instance of Kaldi running on a separate machine.

That same Macintosh application that manages running the Kaldi Docker image also manages files in a network-shared folder on the Archives server. When a user launches the application, it checks that specific folder on the server for media files. If there are any media files in that folder, it takes the oldest file, copies it locally and starts transcribing it. When Kaldi has finished transcribing it, the output text and json formatted transcripts are copied to a subfolder on the Archives server, and the copy of the media file is deleted. Then the application checks the folder again, picks up the next media file, and the process continues.

Screenshot of a file directory with many .mp4 files, a few folders, and a few files named with base64 encoded strings
Files on the Archives server: the files at the top are waiting to be processed, the files near the bottom are the ones being processed by local machines

Avoiding Duplicate Effort

Now, since we have multiple computers running in parallel, all looking at the same folder on the server, how do we make sure that multiple computers aren’t duplicating efforts by transcribing the same file? Well, the process first tries to rename the file to be processed, using the person’s name and a base-64 encoding of the original filename.  If the renaming succeeds, the file is copied into the Docker container for local processing, and the process on every other workstation will ignore files named that way in their quest to pick up the oldest qualifying file. After a file is successfully processed by Kaldi, it is  then deleted, so no one else can pick it up. When Kaldi fails on a file, then the file on the server is renamed to its original file name with “_failed” appended, and again the scripts know to ignore the file. A human can later go in to see if any files have failed and investigate why. (It is rare for Kaldi to fail on an AAPB media file, so this is not part of the workflow we felt we needed to automate further).

Handling Computer and Human Errors

The centralized workflow relies on the idea that the application is not quitting in the middle of a transcription. If someone shuts their laptop, the application will stop, but when they open it again, the application will pickup right where it left off. It will even continue transcribing the current file if the computer is not connected to the WGBH network, because it maintains a local copy of the file that is processing. This allows a little flexibility in terms of staff taking their computers home or to conferences.

The problem starts when the application quits, which could occur when someone quits it intentionally, someone accidentally hits the quit button rather than the minimize button, someone shuts down or restarts their computer, or a computer fails and shuts itself down automatically. We have built the application to minimize the effects of this problem. When the application is restarted it will just pick up the next available file and keep going as if nothing happened. The only reason this is a problem at all is because the file they were in the middle of working on is still sitting on the Archives server, renamed, so another computer will not pick it up.

We consider these few downsides to this set up completely manageable:

  • At regular intervals a human must look into the folder on the server to check that a file hasn’t been sitting renamed for a long time. These are easy to spot because there will be two renamed files with the same person’s name. The older of these two files is the one that was started and never finished. The filename can be changed to its original name by decoding the base-64 string. Once the name is changed, another computer will pick up the file and start transcribing.
  • Because the file stopped being transcribed in the middle of the process, the processing time spent on that interrupted transcription is wasted. The next computer to start transcribing this file will start again at the beginning of the process.

Managing Prioritization

Because the AAPB has a busy acquisitions workflow, we wanted to make sure there was a way to manage prioritization of the media getting transcribed. Prioritization can be determined by many variables, including project timelines, user interest, and grant deadlines. Rather than spending a lot of time to build a system that let us track each file’s prioritization ranking, we opted for a simpler, more manual operation. While it does require human intervention, the time commitment is minimal.

As described above, the local desktop applications only look in one folder on the Archives server. By controlling what is copied into that folder, it is easy to control what files get transcribed next. The default is for a computer to pick up the oldest file in the folder. If you have a set of more recent files that you want transcribed before the rest of the files, all you have to do is remove any older files from that folder. You can easily put them in another folder, so that when the prioritized files are completed, it’s easy to move the rest of the files into the main folder.

For smaller sets of files that need to be transcribed, we can also have someone who is not running the application standup an instance of dockerized Kaldi and run the media through it locally. Their machine won’t be tied into the folder on the server, so they will only process those prioritized files they feed Kaldi locally.

Transforming the Output

At any point we can go to the Archives server and grab the transcripts that have been created so far. These transcripts are output as text files and as JSON files which pair time-stamp data with each word. However, the AAPB prefers JSON transcripts that are time-stamped at each 5-7 second phrase.

We use a script that parses the word-stamped JSON files and outputs phrase-stamped JSON files.

Word time-stamped JSON

Screenshot from a text editor showing a json document with wrapping json object called words with sub-objects with keys for word, time, and duration
Snippet of Kaldi output as JSON transcript with timestamps for each word

Phrase time-stamped JSON

Screenshot from a text editor of JSON with a container object called parts and sub-objects with keys text, start time, and end time.
Snippet of transformed JSON transcript with timestamps for 5-7 second phrases

Once we have the transcripts in the preferred AAPB format, we can use them to make our collections more discoverable and share them with our users. More on the part of the workflow in Part 2 (coming soon!).

Rebecca Benson, Public Broadcasting Preservation Fellow at KOPN

My name is Rebecca Benson, and I’m a graduate student at the University of Missouri, working on a Master’s in Library Science and focusing on work in special collections libraries. I am so excited for the experience I have gained working with the AAPB: I am familiar with much older materials, but the history of the past 100 years really demands broadcast media to be fully understood. The opportunity to work with AAPB and the materials from our local community radio station has expanded my archival horizons, and I look forward to sharing these materials and this history with researchers, as well as sharing this technology with other archivists.

IMG_3065The University of Missouri partnered with the one of the local community radio stations to work on this project. KOPN has been broadcasting from the same office in downtown Columbia since it was founded in 1973  — and I’m pretty sure some of the reels I digitized had not been touched since then. As one of the first open-access community radio stations, they have an amazing perspective on the history of the past several decades. The collection spans an incredible number of areas, from radio theatre to concerts to talk shows, from feminist, queer, indigenous, and otherwise marginalized voices. Working with Jackie Casteel, we decided to begin by digitizing the women’s programming, from the annual Women’s Weekend, the League of Women Voters, and the local Women’s Health collective, among others. Even within this subset, the range of programming spans from interview shows with women in prison to a discussion from one of the first female dentists in the area. Every time I start a new reel, I learn something new and interesting about Columbia or the world, and I cannot wait for others to use this trove of information to begin doing research. I have benefited from the information myself — by chance, I digitized the 1986 League of Women Voters panel on hospital trustees a week before another hospital trustee election in town, which dealt with the hospital lease discussed in 1986!

As I have worked with these materials, I have found that this sort of archival work can re-unite communities and bring people together. Not only have I worked with the university and our initial contacts at the station, I have encountered numerous other people who are, or were, connected with programming that I have now heard. Working on the metadata for our programs led me to the State Historical Society, and their archives of broadcast lists. My time sorting reels at the station led to meeting with a woman who had run much of the radio theatre programming for decades. A chance mention of KOPN led to learning more about the alternative ‘zine community in Columbia, and its connection with the radio station. This project has shown me all the ways in which archival projects are more than just scholarly work, but a way to build and re-build communities.

Getting all of these reels digitized has been — and continues to be — a massive project. As a community radio station, KOPN did not have the most standardized procedures for recording, broadcasting, and documentation, which has led to some interesting moments at the work station. I’m still uncertain how someone managed to splice one tape inside out and backwards! On the other hand, all of these quirks are a result of the creative community that grew around KOPN, and without it, the history of the station would be much poorer. We are so excited to share this vibrant part of our local history with the world.

Written by Rebecca Benson, PBPF Spring 2018 Cohort

*******************

About PBPF

The Public Broadcasting Preservation Fellowship (PBPF), funded by the Institute of Museum and Library Services, supports ten graduate student fellows at University of North Carolina, San Jose State University, Clayton State University, University of Missouri, and University of Oklahoma in digitizing at-risk materials at public media organizations around the country. Host sites include the Center for Asian American Media, Georgia Public Broadcasting, WUNC, the Oklahoma Educational Television Authority, and KOPN Community Radio. Contents digitized by the fellows will be preserved in the American Archive of Public Broadcasting. The grant also supports participating universities in developing long-term programs around audiovisual preservation and ongoing partnerships with their local public media stations.

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work.

The National Association of Educational Broadcasters (NAEB) Collection Now Available on AAPB

Screen Shot 2018-03-28 at 11.50.31 AM.png

The National Association of Educational Broadcasters (NAEB) Collection, now available on the AAPB website, consists of more than 5,500 radio programs from the 1950s and 1960s, created by over 100 NAEB member stations. The collection includes radio documentaries, coverage of events (hearings, meetings, conferences, and seminars), interviews, debates, and lectures on public affairs topics such as civil rights, foreign affairs, health, politics, education, and broadcasting.

These broadcasts, mostly stemming from university and public school-run radio stations, provide an in-depth look at the engagements and events of American history, as they were broadcast to and received by the general public in the twentieth century. Interview subjects and/or program participants feature a “who’s who” of mid-20th century public figures, including Hubert Humphrey, Betty Shabazz, Robert Frost, Frank Lloyd Wright, Alistair Cooke, Dr. Benjamin Spock, Margaret Mead, Studs Terkel, Dr. Albert Schweitzer, Marshall McLuhan, and Aldous Huxley. The collection also contains a notably large percentage of local content and voices, from a WDET Detroit series about local civil defense plans and policies called “Prepare for Survival,” to a series entitled “Document: Deep South,” a documentary series produced by WOUA at the University of Alabama depicting the increasing importance of the South in the economic development of the United States, to a show entitled “Search for Mental Health,” a series of talks about advances in psychiatry from the University of Chicago.

The NAEB was established in 1934 from a precursor organization, the Association of College and University Broadcasting Stations, that formed in 1925. The mission of the NAEB was to use communications technology for education and social purposes. It was an extremely successful and effective trade organization that, throughout its 60 years of existence, ushered in or helped to enable major changes in early educational broadcasting policy. In 1951, NAEB established a tape duplication exchange system in Urbana, IL, where programs produced by university radio stations across the country were copied and distributed to member stations, an early networking scheme that influenced the history of later public radio and television systems. The forerunner of CPB and its arms, NPR and PBS, the NAEB served as the primary organizer, developer, and distributor for noncommercial broadcast production and analysis between 1925 and 1981.

The NAEB Collection was contributed to the AAPB by the University of Maryland’s National Public Broadcasting Archives. The paper records of the NAEB are housed at University of Maryland and additional related materials are located at the Wisconsin Historical Society.

Access the collection here: http://americanarchive.org/special_collections/naeb

Special thanks to Stephanie Sapienza for her contributions to the curation of this collection.

Upcoming AAPB Webinar Featuring Kathryn Gronsbell, Digital Collections Manager at Carnegie Hall

DUuoHWcVQAABjGA

Photo courtesy of Rebecca Benson, @jeybecques, PBPF Fellow at University of Missouri.

This Thursday, March 15th at 8 pm EST, American Archive of Public Broadcasting (AAPB) staff will host a webinar with Kathryn Gronsbell, Digital Collections Manager at Carnegie Hall and will cover topics in documentation, including why documentation is important, what to think about when recording workflows for future practitioners, and where to find examples of good documentation in the wild.

The public is welcome to join for the first half hour. The last half hour will be limited to Q&A with our Public Broadcasting Preservation Fellows, who have now begun to inventory their digitized public broadcasting collections to be preserved in the AAPB.

Webinar URL: http://wgbh1.adobeconnect.com/documentation/

For anyone who missed the last webinar on tools for Quality Control, it’s now also available for viewing through this link: http://wgbh1.adobeconnect.com/psv1042lp222/.

*******************************

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work: digitized content preserved in the American Archive of Public Broadcasting from our collaborating host organizations WUNCKOPNOklahoma Educational Television AuthorityGeorgia Public Broadcasting, and the Center for Asian American Media as well as documentation created to support ongoing audio and video preservation education at the University of MissouriUniversity of OklahomaClayton State UniversityUniversity of North Carolina at Chapel Hill, and San Jose State University.

Upcoming Webinar: Building AAPB Participation into Digitization Grant Proposals

cropped-aapb_logo_color_1line7.png

Building AAPB Participation into Digitization Grant Proposals: Requirements, Recommendations and Workflows

Tuesday, December 12, 2017
12:00pm ET

Webinar Registration form: https://goo.gl/forms/lWWU5GgFkv09bNFi2
Direct meeting URL: http://wgbh1.adobeconnect.com/aapb_grant-proposals-1/

Curious about getting involved in the American Archive of Public Broadcasting (AAPB)?

Seeking information about the workflows and requirements for contributing digitized content and/or metadata to the AAPB?

Writing a grant proposal and want to explore collaborating with the AAPB to preserve copies of your digitized collections and/or provide an access point to your collections through the AAPB metadata portal?

Then this webinar is for you!

On Tuesday, December 12, 2017 at 12:00pm ET, the AAPB will host a webinar focused on grant writing for digitization and subsequent contribution of digital files and metadata to the AAPB.

By the end of this webinar, participants will gain an understanding of:

  • AAPB’s background and infrastructure,
  • how contributing to the AAPB could benefit your collection
  • steps to becoming an AAPB contributor,
  • metadata and digital file format requirements and recommendations,
  • delivery procedures, and
  • other workflows and considerations for contributing digital files and/or metadata to the AAPB.
  • the value of your collection as part of a national collection and how to express that in a proposal

Attendees will also receive advice on how to incorporate AAPB contribution into their CLIR Recordings at Risk (applications due February 9, 2018!), CLIR Digitizing Hidden Collections, or other grant proposal timelines and work plans.

Fill out this brief form to receive info about future webinars and to receive a webinar meeting invitation sent to your calendar: https://goo.gl/forms/lWWU5GgFkv09bNFi2

Anyone can join the webinar at this URL: http://wgbh1.adobeconnect.com/aapb_grant-proposals-1/

This webinar and future AAPB webinars are generously funded by The Andrew W. Mellon Foundation.

The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and almost 25,000 programs are available online at americanarchive.org.

Report of the AAPB Rights Meeting

Last month, staff from the Library of Congress Motion Picture, Broadcasting and Recorded Sound Division (MBRS) and Office of the General Counsel (OGC) met in Boston with WGBH Media Library and Archives staff and counsel from WGBH Business and Legal Affairs, as well as Representatives from the Cyberlaw Clinic and Fellows community at Harvard University’s Berkman Center for Internet & Society for a two-day brainstorming session to strategize regarding rights clearance for the American Archive for Public Broadcasting (AAPB). The AAPB Project Team anticipates that the outcomes of the meeting can serve as a model of how digital audiovisual archival rights can be managed.

Planning is at a very early stage, and will evolve based upon both technological and legal constraints.  The early sketch is that AAPB would employ several interlocking layers of rights clearance: obtaining permission from originating stations and rights holders; identifying public domain materials; and using copyright law exemptions including fair use, the library and archive exemptions, and existing provisions unique to public television and to the Library.

The preliminary access model is that there would be three basic levels of access to the American Archive.  First would be the open web, which would include public domain materials and materials for which the Archive (through WGBH and the Library) has obtained full permission.  Some materials at this level would be downloadable; most would be streamed.  The metadata for the entire AAPB would be in this level.

The second level would be an online virtual reading room, restricted to educational and scholarly uses.  Users would be required to register on the AAPB website, and would be presented with terms and conditions, including the use restriction and the requirement that the user comply with copyright and other legal restrictions.  This level would include materials that are permissioned for this reduced access.  It would also include materials that the legal team has determined may prudently be presented for educational and scholarly purposes under fair use and other legal doctrines.  For example, many historic news broadcasts may fall into this category. Materials on this level would be streaming only.

A third level would be materials that would be available only on Library and WGBH premises.  This is the most restricted level, and materials would likely migrate to less restricted levels as they are analyzed and as permissions are obtained.

The AAPB Project Team is excited to begin implementing this model through rights clearances and developing the technological infrastructure over the next several months. We will continue to provide updates as the work moves forward.

PBS Annual Meeting Presentation & Takeaways

The American Archive team from WGBH presented at the PBS Annual Meeting in San Francisco. We had the wonderful opportunity to meet many of our station collaborators in person and gather tremendously useful feedback from participants. Many thanks to all of those who attended the session and reception, as well as those who took the time to meet with us at other moments during the conference. Additionally, we are sincerely grateful to our co-presenters, Sandy Schonning from KQED and Laura Sampson from Rocky Mountain PBS’ Stations Archived Memories program.

Below we’ve provided our Annual Meeting slideshow, divided into three sections: 1) history and progress of the American Archive, 2) stories from stations, and 3) discussion. During the discussion section, we asked a series of questions, and in this version of the presentation you will find a summary of the answers. If your organization is participating in the American Archive, please feel free to comment on this post with your answers to these questions (or questions about these questions!).

Feel free to email any of our session presenters:

Karen Cariani, Director
WGBH Media Library & Archives
karen_cariani [at] wgbh [dot] org

Casey E. Davis, Project Manager, American Archive
WGBH Media Library & Archives
casey_davis [at] wgbh [dot] org

Laura Sampson, Rocky Mountain PBS
Stations Archived Memories 

laurasampson [at] me [dot] com

Sandy Schonning, KQED
sschonning [at] kqed [dot] org

♥ Happy Valentine’s Day from the American Archive and Chicago Public Media! ♥

Happy Valentine’s Day! Love is in the air today as we share with you a clip from the American Archive, contributed by Chicago Public Media (WBEZ), featuring Little Milton singing “I Want to Love You” at the Chicago Blues Festival in June of 1987.


“The Chicago Blues Festival has been a Chicago institution for over 30 years and has grown to hold the title of the largest free blues festival in the world. Held every summer in Chicago’s Grant Park, the festival has consistently featured blues legends alongside the future stars of the genre and, despite Chicago’s embarrassment of riches when it comes to blues artists, features performers from around the world. If they’ve sung the blues, chances are they’ve appeared at the festival,” says Chicago Public Media’s Director of Studio and Broadcast Operations Adam Yoffe. “WBEZ has been lucky enough to capture some of the earliest years of the festival to tape, and are excited to bring them to the archive in the coming months.”

Chicago Public Media’s music archives feature interviews and live performances with many of the most revered jazz and blues figures in the country and includes hundreds of reels that date from the mid-1980s to the early ’90s, such as performances of jazz greats Etta James and Dizzy Gillespie and blues legends Lonnie Brooks and Koko Taylor.

This program we’re sharing today was originally recorded on 1/4″ audio tape and was digitized in the first 40,000 hours of the American Archive collection, which are now being preserved at the Library of Congress.

This American Life showcased in today's Google Doodle
This American Life showcased in today’s Google Doodle

And while your in the Valentine’s Day spirit, you should check out today’s Google Doodle. WBEZ’s This American Life has collaborated with Google on today’s Doodle, featuring candy hearts and Valentine’s Day-themed stories produced by This American Life.

**Audio clip courtesy Chicago Public Media (WBEZ). All rights reserved.
Thanks to American Archive intern Bill Nehring for editing today’s clip.
This post was written by Casey E. Davis, Project Manager for the AAPB at WGBH