Earlier this year, the American Archive of Public Broadcasting (AAPB) launched FIX IT+, a crowdsourcing tool that invites the public to help improve the searchability of over 50,000 hours of historic public television and radio programming. Now, with the generous support of George Blood, a digitization service provider, the AAPB is launching the Transcribe to Digitize Challenge. The challenge is simple: George Blood will digitize another tape (for free!) for every transcript that is corrected.
AAPB has digitized and preserved more than 50,000 hours of public television and radio programming created by stations and producers across the United States. This unique historic material, created as early as the 1940s and often lacking closed captioning, represents our shared and diverse cultural heritage. Yet it is not highly discoverable to researchers, educators, students, lifelong learners, journalists and the public because it lacks robust descriptive information, or “metadata.”.
For the Transcribe to Digitize Challenge, a minimum of 20 transcripts must be corrected for a station to meet the challenge, and George Blood will then provide free digitization for 20 tapes selected by that station. Up to 100 transcripts can be corrected for 100 tapes to be digitized per station. The digitized materials will be delivered back to each station, and a copy will also go to the AAPB for long-term preservation at the Library of Congress and access through the AAPB website!
The AAPB started creating transcripts as part of our “Improving Access to Time-Based Media through Crowdsourcing and Machine-Learning” grant from the Institute of Museum and Library Services (IMLS). For the initial 40,000 hours of the AAPB’s collection, we worked with Pop Up Archive to create machine-generated transcripts, which are primarily used for keyword indexing, to help users find otherwise under-described content. These transcripts are also being corrected through our crowdsourcing platforms FIX IT and FIX IT+.
As the AAPB continues to grow its collection, we have added transcript creation to our standard acquisitions workflow. Now, when the first steps of acquisition are done, i.e., metadata has been mapped and all of the files have been verified and ingested, the media is passed in to the transcription pipeline. The proxy media files are either copied directly off the original drive or pulled down from Sony Ci, the cloud-based storage system that serves americanarchive.org’s video and audio files. These are copied into a folder on the WGBH Archives’ server, and then they wait for an available computer running transcription software.
The AAPB uses the docker image of PopUp Archive’s Kaldi running on many machines across WGBH’s Media Library and Archives. Rather than paying additional money to run this in the cloud or on a super computer, we decided to take advantage of the resources we already had sitting in our department. AAPB and Archives staff at WGBH that regularly leave their computers in the office overnight are good candidates for being part of the transcription team. All they have to do is follow instructions on the internal wiki to install Docker and a simple Macintosh application, built in-house, that runs scripts in the background and reports progress to the user. The application manages launching Docker, pulling the Kaldi image (or checking that you already have it pulled), and launching the image. The user doesn’t need any specific knowledge about how Docker images work to run the application. That app gets minimized on the dock and continues to run in the background as the staff members goes about their work during the day.* But that’s not all! When they leave for the night and their computer typically wouldn’t be doing anything, it continues to transcribe media files, making use of processing power that we were already paying for but hadn’t been utilizing.
*There have been reports of systems being perceptively slower when running this Docker image throughout the day. It has yet to have a significant impact on any staff member’s ability to do their job.
Now, we could just have multiple machines running Kaldi through Docker and that would let us create a lot of transcripts. However, it would be cumbersome and time-consuming to split the files into batches, manage starting a different batch on each computer, and collect the disparate output files from various machines at the end of the process. So we developed a centralized way of handling the input and output of each instance of Kaldi running on a separate machine.
That same Macintosh application that manages running the Kaldi Docker image also manages files in a network-shared folder on the Archives server. When a user launches the application, it checks that specific folder on the server for media files. If there are any media files in that folder, it takes the oldest file, copies it locally and starts transcribing it. When Kaldi has finished transcribing it, the output text and json formatted transcripts are copied to a subfolder on the Archives server, and the copy of the media file is deleted. Then the application checks the folder again, picks up the next media file, and the process continues.
Avoiding Duplicate Effort
Now, since we have multiple computers running in parallel, all looking at the same folder on the server, how do we make sure that multiple computers aren’t duplicating efforts by transcribing the same file? Well, the process first tries to rename the file to be processed, using the person’s name and a base-64 encoding of the original filename. If the renaming succeeds, the file is copied into the Docker container for local processing, and the process on every other workstation will ignore files named that way in their quest to pick up the oldest qualifying file. After a file is successfully processed by Kaldi, it is then deleted, so no one else can pick it up. When Kaldi fails on a file, then the file on the server is renamed to its original file name with “_failed” appended, and again the scripts know to ignore the file. A human can later go in to see if any files have failed and investigate why. (It is rare for Kaldi to fail on an AAPB media file, so this is not part of the workflow we felt we needed to automate further).
Handling Computer and Human Errors
The centralized workflow relies on the idea that the application is not quitting in the middle of a transcription. If someone shuts their laptop, the application will stop, but when they open it again, the application will pickup right where it left off. It will even continue transcribing the current file if the computer is not connected to the WGBH network, because it maintains a local copy of the file that is processing. This allows a little flexibility in terms of staff taking their computers home or to conferences.
The problem starts when the application quits, which could occur when someone quits it intentionally, someone accidentally hits the quit button rather than the minimize button, someone shuts down or restarts their computer, or a computer fails and shuts itself down automatically. We have built the application to minimize the effects of this problem. When the application is restarted it will just pick up the next available file and keep going as if nothing happened. The only reason this is a problem at all is because the file they were in the middle of working on is still sitting on the Archives server, renamed, so another computer will not pick it up.
We consider these few downsides to this set up completely manageable:
At regular intervals a human must look into the folder on the server to check that a file hasn’t been sitting renamed for a long time. These are easy to spot because there will be two renamed files with the same person’s name. The older of these two files is the one that was started and never finished. The filename can be changed to its original name by decoding the base-64 string. Once the name is changed, another computer will pick up the file and start transcribing.
Because the file stopped being transcribed in the middle of the process, the processing time spent on that interrupted transcription is wasted. The next computer to start transcribing this file will start again at the beginning of the process.
Because the AAPB has a busy acquisitions workflow, we wanted to make sure there was a way to manage prioritization of the media getting transcribed. Prioritization can be determined by many variables, including project timelines, user interest, and grant deadlines. Rather than spending a lot of time to build a system that let us track each file’s prioritization ranking, we opted for a simpler, more manual operation. While it does require human intervention, the time commitment is minimal.
As described above, the local desktop applications only look in one folder on the Archives server. By controlling what is copied into that folder, it is easy to control what files get transcribed next. The default is for a computer to pick up the oldest file in the folder. If you have a set of more recent files that you want transcribed before the rest of the files, all you have to do is remove any older files from that folder. You can easily put them in another folder, so that when the prioritized files are completed, it’s easy to move the rest of the files into the main folder.
For smaller sets of files that need to be transcribed, we can also have someone who is not running the application standup an instance of dockerized Kaldi and run the media through it locally. Their machine won’t be tied into the folder on the server, so they will only process those prioritized files they feed Kaldi locally.
Transforming the Output
At any point we can go to the Archives server and grab the transcripts that have been created so far. These transcripts are output as text files and as JSON files which pair time-stamp data with each word. However, the AAPB prefers JSON transcripts that are time-stamped at each 5-7 second phrase.
We use a script that parses the word-stamped JSON files and outputs phrase-stamped JSON files.
Word time-stamped JSON
Phrase time-stamped JSON
Once we have the transcripts in the preferred AAPB format, we can use them to make our collections more discoverable and share them with our users. More on the part of the workflow in Part 2 (coming soon!).
Today we’re launching ROLL THE CREDITS, a new Zooniverse project to engage the public in helping us catalog unseen content in the AAPB archive. Zooniverse is the “world’s largest and most popular platform for people-powered research.” Zooniverse volunteers (like you!) are helping the AAPB in classifying and transcribing the text from extracted frames of uncataloged public television programs, providing us with information we can plug directly into our catalog, closing the gap on our sparsely described collection of nearly 50,000 hours of television and radio.
Example frame from ROLL THE CREDITS
The American people have made a huge investment in public radio and television over many decades. The American Archive of Public Broadcasting (AAPB) works to ensure that this rich source for American political, social, and cultural history and creativity is saved and made available once again to future generations.
The improved catalog records will have verified titles, dates, credits, and copyright statements. With the updated, verified information we will be able to make informed decisions about the development of our archive, as well as provide access to corrected versions of transcripts available for anyone to search free of charge at americanarchive.org.
In conjunction with our speech-to-text transcripts from FIX IT, a game that asks users to correct and validate the transcripts one phrase at a time, ROLL THE CREDITS helps us fulfill our mission of preserving and making accessible historic content created by the public media, saving at-risk media before the contents are lost to prosperity.
Thanks for supporting AAPB’s mission! Know someone who might be interested? Feel free to share with the other transcribers and public media fans in your life!
At the AAPB “Crowdsourcing Anecdotes” meeting last Friday at the Association of Moving Image Archivists conference, I talked about a free “Dockerized” build of Kaldi made by Stephen McLaughlin, PHD student at UT Austin School of Information. I thought I would follow up on my introduction to it there by providing links to these resources, instructions for setting it up, and some anecdotes about using it. First, the best resource for this Docker Kaldi and Stephen’s work is here in the HiPSTAS Github: https://github.com/hipstas/kaldi-pop-up-archive. It also has detailed information for setting up and running the Docker Kaldi.
I confess that I don’t know much about computer programming and engineering besides what I need to get my work done. I am an archivist and I eagerly continue to gain more computer skills, but some of my terminology here might be kinda wrong or unclear. Anyways, Kaldi is a free speech-to-text tool that interprets audio recordings and outputs timestamped JSON and text files. This “Dockerized” Kaldi allows you to easily get a version of Kaldi running on pretty much any reasonably powerful computer. The recommended minimum is at least 6gb of RAM, and I’m not sure about the CPU. The more of both the better, I’m sure.
The Docker platform provides a framework to easily download and set up a computer environment in which Kaldi can run. Kaldi is pretty complicated, but Stephen’s Docker image (https://hub.docker.com/r/hipstas/kaldi-pop-up-archive) helps us all bypass setting up Kaldi. As a bonus, it comes set up with the language model that PopUp Archive created as part of our IMLS grant (link here) with them and HiPSTAS. They trained the model using AAPB recordings. Kaldi needs a trained language model dataset to interpret audio data put through the system. Because this build of Kaldi uses the PopUp Archive model, it is already trained for American English.
I set up my Docker on my Mac laptop, so the rest of the tutorial will focus on that system, but the GitHub has information for Windows or Linux and those are not very different. By the way, these instructions will probably be really easy for people that are used to interacting with tools in the command line, but I am going to write this post as if the reader hasn’t done that much. I will also note that while this build of Kaldi is really exciting and potentially useful, especially given all the fighting I’ve done with these kinds of systems in my career, I didn’t test it thoroughly because it is only Stephen’s experiment complimenting the grant project. I’d love to get feedback on issues you might encounter! Also I’ve got to thank Stephen and HiPSTAS!! THANK YOU Stephen!!
SET UP AND USE:
The first step is to download Docker (https://www.docker.com/). You then need to go into Docker’s preferences, under Advanced, and make sure that Docker has access to at least 6gb of RAM. Add more if you’d like.
Then navigate to the Terminal and pull Stephen’s Docker image for Kaldi. The command is “docker pull -a hipstas/kaldi-pop-up-archive”. (Note: Stephen’s GitHub says that you can run the pull without options, but I got errors if I ran it without “-a”). This is a big 12gb download, so go do something else while it finishes. I ate some Thanksgiving leftovers.
When everything is finished downloading, set up the image by running the command “docker run -it –name kaldi_pua –volume ~/Desktop/audio_in/:/audio_in/ hipstas/kaldi-pop-up-archive:v1”. This starts the Kaldi Docker image and creates a new folder on your desktop where you can add media files you want to run through Kaldi. This is also the place where Kaldi will write the output. Add some media to the folder BUT NOTE: the filenames cannot have spaces or uncommon characters or Kaldi will fail. My test of this setup ran well on some short mp4s. Also, your Terminal will now be controlling the Docker image, so your command line prompt will look different than it did, and you won’t be “in” your computer’s file system until you exit the Docker image.
Kaldi will run through a batch, and a ton of text will continue to roll through your Terminal. Don’t be afraid that it is taking forever. Kaldi is meant to run on very powerful computers, and running it this way is slow. I tested on a 30 minute recording, and it took 2.5 hrs to process. It will go faster the more computing power you assign permission for Docker to use, but it is reasonable to assume that on most computers the time to process will be around 5 times the recording length.
The setup script converts wav, mp3, and mp4 to a 16khz broadcast WAV, which is the input that Kaldi requires. You might need to manually convert your media to broadcast WAV if the setup script doesn’t work. I started out by test a broadcast WAV that I made myself with FFmpeg, but Kaldi and/or the setup script didn’t like it. I didn’t resolve that problem because the Kaldi image runs fine on media that it converts itself, so that saves me the trouble anyways.
When Kaldi is done processing, the text output will be in the “audio_in” folder, in the “transcripts” folder. There will be both a JSON and txt file for every recording processed, named the same as the original media file. The quality of the output depends greatly on the original quality of the recording, and how closely the recording resembles the language model (in this case, a studio recording of individuals speaking standard American English). That said, we’ve had some pretty good results in our tests. NOTE THAT if you haven’t assigned enough power to Docker, Kaldi will fail to process, and will do so without reporting an error. The failed files will create output JSON and txt files that are blank. If you’re having trouble try adding more RAM to Docker, or checking that your media file is successfully converting to broadcast WAV.
When you want to return your terminal to normal, use the command “exit” to shut down the image and return to your file system.
When you want to start the Kaldi image again to run another batch, open another session by running “docker start /kaldi_pua” and then “docker exec -it kaldi_pua bash”. You’ll then be in the Kaldi image and can run the batch with the “sh ./setup.sh” command.
I am sure that there are ways to update or modify the language model, or to use a different model, or to add different scripts to the Docker Kaldi, or to integrate it into bigger workflows. I haven’t spent much time exploring any of that, but I hope you found this post a helpful start. We’re going to keep it in mind as we build up our speech-to-text workflows, and we’ll be sure to share any developments. Happy speech-to-texting!!
In 2015, the Institute of Museum and Library Services (IMLS) awarded WGBH on behalf of the American Archive of Public Broadcasting a grant to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. Through a collaboration with Pop Up Archive and HiPSTAS at the University of Texas at Austin, our project has supported the creation of speech-to-transcripts for the initial 40,000 hours of historic public broadcasting preserved in the AAPB, the launch of a free open-source speech-to-text tool, and FIX IT, a game that allows the public to help correct our transcripts.
Now, our colleagues at HiPSTAS are debuting a new machine learning toolkit and DIY techniques for labeling speakers in “unheard” audio — audio that is not documented in a machine-generated transcript. The toolkit was developed through a massive effort using machine learning to identify notable speakers’ voices (such as Martin Luther King, Jr. and John F. Kennedy) from within the AAPB’s 40,000 hour collection of historic public broadcasting content.
This effort has vast potential for archivists, researchers, and other organizations seeking to discover and make accessible sound at scale — sound that otherwise would require a human to listen and identify in every digital file.