Oklahoma mentor Lisa Henry (left) cleaning a U-matic deck with Public Broadcasting Preservation Fellow Tanya Yule.
This Thursday, February 15th at 8 pm EST, American Archive of Public Broadcasting (AAPB) staff will host a webinar covering quality control tools and technologies used when ingesting digitized collections into the AAPB archive, including MDQC, MediaConch, Sonic Visualizer, and QCTools.
The public is welcome to join for the first half hour. The last half hour will be limited to Q&A with our Public Broadcasting Preservation Fellows, who are just now beginning the process of digitizing at-risk public broadcasting collections to be preserved in the AAPB.
At the AAPB “Crowdsourcing Anecdotes” meeting last Friday at the Association of Moving Image Archivists conference, I talked about a free “Dockerized” build of Kaldi made by Stephen McLaughlin, PHD student at UT Austin School of Information. I thought I would follow up on my introduction to it there by providing links to these resources, instructions for setting it up, and some anecdotes about using it. First, the best resource for this Docker Kaldi and Stephen’s work is here in the HiPSTAS Github: https://github.com/hipstas/kaldi-pop-up-archive. It also has detailed information for setting up and running the Docker Kaldi.
I confess that I don’t know much about computer programming and engineering besides what I need to get my work done. I am an archivist and I eagerly continue to gain more computer skills, but some of my terminology here might be kinda wrong or unclear. Anyways, Kaldi is a free speech-to-text tool that interprets audio recordings and outputs timestamped JSON and text files. This “Dockerized” Kaldi allows you to easily get a version of Kaldi running on pretty much any reasonably powerful computer. The recommended minimum is at least 6gb of RAM, and I’m not sure about the CPU. The more of both the better, I’m sure.
The Docker platform provides a framework to easily download and set up a computer environment in which Kaldi can run. Kaldi is pretty complicated, but Stephen’s Docker image (https://hub.docker.com/r/hipstas/kaldi-pop-up-archive) helps us all bypass setting up Kaldi. As a bonus, it comes set up with the language model that PopUp Archive created as part of our IMLS grant (link here) with them and HiPSTAS. They trained the model using AAPB recordings. Kaldi needs a trained language model dataset to interpret audio data put through the system. Because this build of Kaldi uses the PopUp Archive model, it is already trained for American English.
I set up my Docker on my Mac laptop, so the rest of the tutorial will focus on that system, but the GitHub has information for Windows or Linux and those are not very different. By the way, these instructions will probably be really easy for people that are used to interacting with tools in the command line, but I am going to write this post as if the reader hasn’t done that much. I will also note that while this build of Kaldi is really exciting and potentially useful, especially given all the fighting I’ve done with these kinds of systems in my career, I didn’t test it thoroughly because it is only Stephen’s experiment complimenting the grant project. I’d love to get feedback on issues you might encounter! Also I’ve got to thank Stephen and HiPSTAS!! THANK YOU Stephen!!
SET UP AND USE:
The first step is to download Docker (https://www.docker.com/). You then need to go into Docker’s preferences, under Advanced, and make sure that Docker has access to at least 6gb of RAM. Add more if you’d like.
Then navigate to the Terminal and pull Stephen’s Docker image for Kaldi. The command is “docker pull -a hipstas/kaldi-pop-up-archive”. (Note: Stephen’s GitHub says that you can run the pull without options, but I got errors if I ran it without “-a”). This is a big 12gb download, so go do something else while it finishes. I ate some Thanksgiving leftovers.
When everything is finished downloading, set up the image by running the command “docker run -it –name kaldi_pua –volume ~/Desktop/audio_in/:/audio_in/ hipstas/kaldi-pop-up-archive:v1”. This starts the Kaldi Docker image and creates a new folder on your desktop where you can add media files you want to run through Kaldi. This is also the place where Kaldi will write the output. Add some media to the folder BUT NOTE: the filenames cannot have spaces or uncommon characters or Kaldi will fail. My test of this setup ran well on some short mp4s. Also, your Terminal will now be controlling the Docker image, so your command line prompt will look different than it did, and you won’t be “in” your computer’s file system until you exit the Docker image.
Kaldi will run through a batch, and a ton of text will continue to roll through your Terminal. Don’t be afraid that it is taking forever. Kaldi is meant to run on very powerful computers, and running it this way is slow. I tested on a 30 minute recording, and it took 2.5 hrs to process. It will go faster the more computing power you assign permission for Docker to use, but it is reasonable to assume that on most computers the time to process will be around 5 times the recording length.
The setup script converts wav, mp3, and mp4 to a 16khz broadcast WAV, which is the input that Kaldi requires. You might need to manually convert your media to broadcast WAV if the setup script doesn’t work. I started out by test a broadcast WAV that I made myself with FFmpeg, but Kaldi and/or the setup script didn’t like it. I didn’t resolve that problem because the Kaldi image runs fine on media that it converts itself, so that saves me the trouble anyways.
When Kaldi is done processing, the text output will be in the “audio_in” folder, in the “transcripts” folder. There will be both a JSON and txt file for every recording processed, named the same as the original media file. The quality of the output depends greatly on the original quality of the recording, and how closely the recording resembles the language model (in this case, a studio recording of individuals speaking standard American English). That said, we’ve had some pretty good results in our tests. NOTE THAT if you haven’t assigned enough power to Docker, Kaldi will fail to process, and will do so without reporting an error. The failed files will create output JSON and txt files that are blank. If you’re having trouble try adding more RAM to Docker, or checking that your media file is successfully converting to broadcast WAV.
When you want to return your terminal to normal, use the command “exit” to shut down the image and return to your file system.
When you want to start the Kaldi image again to run another batch, open another session by running “docker start /kaldi_pua” and then “docker exec -it kaldi_pua bash”. You’ll then be in the Kaldi image and can run the batch with the “sh ./setup.sh” command.
I am sure that there are ways to update or modify the language model, or to use a different model, or to add different scripts to the Docker Kaldi, or to integrate it into bigger workflows. I haven’t spent much time exploring any of that, but I hope you found this post a helpful start. We’re going to keep it in mind as we build up our speech-to-text workflows, and we’ll be sure to share any developments. Happy speech-to-texting!!
Grant of $229,772 will fund students’ work on digitization of historic, at-risk public media content from underrepresented regions and communities
BOSTON, September 28, 2017– WGBH Educational Foundation is pleased to announce that the Institute of Museum and Library Services (IMLS) has awarded WGBH a $229,772 Laura Bush 21st Century Librarian Program grant to launch the Public Broadcasting Preservation Fellowship. The fellowship will fund 10 graduate students from across the United States to digitize at-risk audiovisual materials at public media organizations near their universities. The digitized content will ultimately be incorporated into the American Archive of Public Broadcasting (AAPB), a collaboration between Boston public media station WGBH and the Library of Congress working to digitize and preserve thousands of broadcasts and previously inaccessible programs from public radio and public television’s more than 60-year legacy.
“We are honored that the Institute of Museum and Library Services has chosen WGBH to lead the Public Broadcasting Preservation Fellowship,” said Casey Davis Kaufman, Associate Director of the WGBH Media Library and Archives and WGBH’s AAPB Project Manager. “This grant will allow us to prepare a new generation of library and information science professionals to save at-risk and historically significant public broadcasting collections, especially fragile audiovisual materials, from regions and communities underrepresented in the American Archive of Public Broadcasting.”
WGBH has developed partnerships with library and information science programs and archival science programs at five universities: Clayton State University, University of North Carolina at Chapel Hill, University of Oklahoma, University of Missouri, and San Jose State University. Each school will be paired with a public media organization that will serve as a host site for two consecutive fellowships: Georgia Public Broadcasting, WUNC, the Oklahoma Educational Television Authority, KOPN Community Radio, and the Center for Asian American Media in partnership with the Bay Area Video Coalition.
“As centers of learning and catalysts of community change, libraries and museums connect people with programs, services, collections, information, and new ideas in the arts, sciences, and humanities. They serve as vital spaces where people can connect with each other,” said IMLS Director Dr. Kathryn K. Matthew. “IMLS is proud to support their work through our grant making as they inform and inspire all in their communities.”
The first fellowship will take place during the 2018 spring semester, from January to April of 2018. The second fellowship will take place during the summer semester from June to August of 2018. The grant also will support participating universities in developing long-term audiovisual preservation curricula, including providing funding for audiovisual digitization equipment, and developing partnerships with local public media organizations.
About WGBH WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur, Curious George, and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.
About the Library of Congress The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.
About the American Archive of Public Broadcasting The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, nearly 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and more than 22,000 programs are available online at americanarchive.org.
About IMLS The Institute of Museum and Library Services is celebrating its 20th Anniversary. IMLS is the primary source of federal support for the nation’s 123,000 libraries and 35,000 museums. Our mission has been to inspire libraries and museums to advance innovation, lifelong learning, and cultural and civic engagement. For the past 20 years, our grant making, policy development, and research has helped libraries and museums deliver valuable services that make it possible for communities and individuals to thrive. To learn more, visit http://www.imls.gov and follow us on Facebook, Twitter and Instagram.
In 2015, the Institute of Museum and Library Services (IMLS) awarded WGBH on behalf of the American Archive of Public Broadcasting a grant to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. Through a collaboration with Pop Up Archive and HiPSTAS at the University of Texas at Austin, our project has supported the creation of speech-to-transcripts for the initial 40,000 hours of historic public broadcasting preserved in the AAPB, the launch of a free open-source speech-to-text tool, and FIX IT, a game that allows the public to help correct our transcripts.
Now, our colleagues at HiPSTAS are debuting a new machine learning toolkit and DIY techniques for labeling speakers in “unheard” audio — audio that is not documented in a machine-generated transcript. The toolkit was developed through a massive effort using machine learning to identify notable speakers’ voices (such as Martin Luther King, Jr. and John F. Kennedy) from within the AAPB’s 40,000 hour collection of historic public broadcasting content.
This effort has vast potential for archivists, researchers, and other organizations seeking to discover and make accessible sound at scale — sound that otherwise would require a human to listen and identify in every digital file.
In 2015, the Institute of Museum and Library Services awarded a generous grant to WGBH on behalf of the American Archive of Public Broadcasting (AAPB) to develop the AAPB National Digital Stewardship Residency (NDSR). Through this project, we have placed seven graduates of master’s degree programs in digital stewardship residencies at public media organizations around the country.
AAPB NDSR has already yielded dozens of great resources for the public media and audiovisual preservation community – and the residents aren’t even halfway done yet! As we near the program’s midpoint, we wanted to catch you up on the program so far.
In August 2016, the residents dispersed to their host stations, and began recording their experiences in a series of thoughtful blog posts, covering topics from home movies to DAM systems to writing in Python.
August also kicked off our first series of guest webinars, focusing on a range of topics of interest to audiovisual and digital preservation professionals. Most webinars were recorded, and all have slides available.
The residents also hosted two great panel presentations, first in September at the International Association of Sound and Audiovisual Archives Conference, and in November at the Association of Moving Image Archivists Conference. The AMIA session in particular generated a lot of Twitter chatter; you can see a roundup here.
To keep up with AAPB NDSR blog posts, webinar recordings, and project updates as they happen, follow the AAPB NDSR site at ndsr.americanarchive.org.
We are thrilled to announce that the Institute of Museum and Library Services has awarded WGBH, on behalf of the American Archive of Public Broadcasting, a National Leadership Grant for a project titled “Improving Access to Time-Based Media through Crowdsourcing and Machine Learning.”
Together, WGBH and Pop Up Archive plan to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. This 30-month project will combine technological and social approaches for metadata creation by leveraging scalable computation and engaging the public to improve access through crowdsourcing games for time-based media. The project will support several related areas of research and testing, including: speech-to-text and audio analysis tools to transcribe and analyze almost 40,000 hours of digital audio from the American Archive of Public Broadcasting; develop open source web-based tools to improve transcripts and descriptive data by engaging the public in a crowdsourced, participatory cataloging project; and create and distribute data sets to provide a public database of audiovisual metadata for use by other projects.
Our research questions are: How can crowdsourced improvements to machine-generated transcripts and tags increase the quality of descriptive metadata and enhance search engine discoverability for audiovisual content? How can a range of web-based games create news points of access and engage the public engagement with time-based media through crowdsource tools? What qualitative attributes of audiovisual public media content (such as speaker identities, emotion, and tone) can be successfully identified with spectral analysis tools, and how can feeding crowdsourced improvements back into audio analysis tools improve their future output and create training data that can be publicly disseminated to help describe other audiovisual collections at scale?
This project will use content from the AAPB to answer our questions. The project will fund 1) audio analysis tools – development and use of speech-to-text and audio analysis tools to create transcripts and qualitative waveform analysis for almost 40,000 hours of AAPB digital files (and participating stations can definitely receive copies of their own transcripts!); 2) metadata games – development of open-source web-based tools to improve transcripts and descriptive data by engaging the public in a crowd sourced, participatory cataloging project; 3) evaluating access – a measurement of improved access to media files from crowd sourced data; 4) sharing tools – open-source code release for tools developed over the course of the grant, and 5) teaching data set– the publication of initial and improved data sets to ‘teach’ tools and provide a public database of audiovisual metadata (audio fingerprint) for use by other projects working to create access to audiovisual material.
The 2014 National Digital Stewardship Agenda includes, “Engage and encourage relationships between private/commercial and heritage organizations to collaborate on the development of standards and workflows that will ensure long-term access to our recorded and moving image heritage.” These partnerships are critical in order to move the needle of audiovisual access issues of national significance. The AAPB and Pop Up Archive are eager to continue building such a relationship so that the innovations in technology, workflows, and data analysis advanced by the private sector are fully and sustainably leveraged for U.S. public media and cultural heritage organizations.