Oklahoma mentor Lisa Henry (left) cleaning a U-matic deck with Public Broadcasting Preservation Fellow Tanya Yule.
This Thursday, February 15th at 8 pm EST, American Archive of Public Broadcasting (AAPB) staff will host a webinar covering quality control tools and technologies used when ingesting digitized collections into the AAPB archive, including MDQC, MediaConch, Sonic Visualizer, and QCTools.
The public is welcome to join for the first half hour. The last half hour will be limited to Q&A with our Public Broadcasting Preservation Fellows, who are just now beginning the process of digitizing at-risk public broadcasting collections to be preserved in the AAPB.
Historic WRVR-FM Archives Receives CLIR
Digitizing Hidden Special Collections and Archives Award
More than 4,000 hours of cultural and political radio programming from the 60s and 70s to be made public
Morningside Heights, NY – The Council on Library and Information Resources has awarded a grant of $330,000 to digitize, preserve, and make publicly accessible previously unavailable archives of the Peabody Award winning radio station WRVR. Public Radio as a Tool for Cultural Engagement in New York in the 60s and early 70s: Digitizing the Broadcasts of WRVR-FM Public Radio is a joint project between The Riverside Church in the City of New York and the American Archive of Public Broadcasting, a collaboration between the Library of Congress and the WGBH Educational Foundation. The collection includes culturally significant non-commercial programming, including interviews, speeches, and musical interpretations on matters such as civil rights, war, and fine arts, from laypersons to famed scholars, including Martin Luther King, Jr., Malcolm X, and Pete Seeger.
Funded by the Andrew W. Mellon Foundation, the Council on Library and Information Resources’ Digitizing Hidden Collections program supports the creation of digital representations of unique content of high scholarly significance. This award will support the preservation and digitization of over 3,502 recordings representing 4,000 hours of programming from WRVR from the 1960s and early 1970s. Owned and operated by The Riverside Church from 1961-1976, WRVR was the first station to win a Peabody for its entire programming, in part for its coverage of the Civil Rights movement in 1963 Birmingham. In addition to featuring progressive religious and philosophical discussions with Riverside clergy, theologians, and scholars, such as Rev. Dr. Martin Luther King, Jr., WRVR programming included culturally significant topics, speakers, and performances, such as Langston Hughes’ “Jericho-Jim Crow” directed by Alvin Ailey, and interviews and readings by Robert Frost, John Ashbery, and Allen Ginsberg. The station also featured the program “Just Jazz with Ed Beach,” which collection currently resides at the Library of Congress.
Preservation of these materials will enhance study in many disciplines, including theology/religion, political science, and communications, especially related to American Christianity, homiletics, progressive responses to the Civil Rights movement, contemporary issues of race and sexuality, the cultural impact of the 1960s, and public radio as a tool for cultural engagement and social media precursor.
These recordings will be made publicly available at the American Archive of Public Broadcasting (AAPB), a collaboration between the Library of Congress and WGBH. The AAPB coordinates a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years.
Sample recordings include:
“Back to School in Birmingham; Birmingham: Testament of Nonviolence, Part 4 [1 of 2].” May 1963 Riverside Radio, WRVR
Robert Polk (Riverside Church) interviews teenagers recently released from jail for participating in the 1963 Children’s Crusade in Birmingham. Over 1800 children, some as young as six years old, peacefully protested and were met with fire hoses and police dogs.
“The American People; What is Patriotism, Part 1 [1 of 2].” 1964 Riverside Radio, WRVR Interviews with various Americans exploring attitudes about patriotism in the middle part of the twentieth century through discussing flag waving, nationalism vs. patriotism, and critically thinking about one’s country.
Arthur Miller. Statement for World Theater Day, March 27, 1963 Riverside Radio, WRVR, Riverside Archives (The Riverside Church) Arthur Miller remarks on theater’s ability to speak universal truths and understanding in art, and how this particular art form, above many others, informs society’s response to war, politics, freedoms, and all matters of the human condition across nations and cultures.
“Listen! William Sloane Coffin Jr.: Conscience, Protest & War.” Interview on WRVR, March 31, 1968 Riverside Radio, WRVR. Riverside Archives (The Riverside Church) William Sloane Coffin Jr., chaplain at Yale University (later Riverside Senior Minister, 1977-1987), discusses his indictment for conspiracy to encourage draft evasion and the politics of the Vietnam War; peace activism, civil rights and Dr. King’s Poor People’s Campaign, and how Dr. Coffin’s privilege informs his work as a clergyperson, activist, and American.
About The Riverside Church Located in Morningside Heights on the Upper West Side, The Riverside Church in the City of New York is one of the leading voices of Progressive Christianity, influential on America’s religious and political landscapes for more than 85 years. Built by John D. Rockefeller Jr. and currently led by The Rev. Dr. Amy Butler, the interracial, interdenominational, and international church has long been a forum for important civic and spiritual leaders, including Dr. Martin Luther King, Jr., Nelson Mandela, President Clinton, the Dalai Lama, and countless others. Visit www.trcnyc.org or find us on social media to learn more about our rich history and the latest news and events.
About the American Archive of Public Broadcasting The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and more than 30,000 programs are available online at americanarchive.org.
WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia™, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.
About the Library of Congress
The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.
About CLIR The Council on Library and Information Resources is an independent, nonprofit organization that forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning.
About the Mellon Foundation
Founded in 1969, the Andrew W. Mellon Foundation endeavors to strengthen, promote, and, where necessary, defend the contributions of the humanities and the arts to human flourishing and to the well-being of diverse and democratic societies by supporting exemplary institutions of higher education and culture as they renew and provide access to an invaluable heritage of ambitious, path-breaking work. Additional information is available at mellon.org.
The American Archive of Public Broadcasting (AAPB) has submitted a break-out session proposal for the 2018 PBS Annual Meeting this coming May. Please consider voting for our presentation proposal (detailed below). It’s on the second page of the SurveyMonkey voting form, titled “Engage your Community to Celebrate Your History: Tools from the AAPB.” If selected, together with archivists and volunteer managers at Louisiana Public Broadcasting,Rocky Mountain PBS, and Wisconsin Public Television, we will discuss outreach methods and tools to activate local communities through the preservation of public media’s rich legacy.
Thank you for your support and please share with your fellow #pubmedia fans!
Engage your community to celebrate your history: tools from the AAPB
The American Archive of Public Broadcasting (AAPB) coordinates a national effort to preserve public media. The AAPB preserves over 50,000 hours of historic content from over 100 stations and is acquiring up to 25,000 hours of digitized content annually. At this session, AAPB staff will 1) demo crowdsourcing games, and metadata automation tools used by the AAPB to improve access 2) provide marketing and community engagement toolkits and tips for promoting and enhancing your stations’ archive 3) discuss the workflow and requirements for contributing to the AAPB, including an overview of grant opportunities for digitization and suggested partnerships and 4) demo PBCore tools developed by the AAPB for use by stations in managing their content.
Takeaways: Attendees will learn how to use the AAPB crowdsourcing games to engage stations’ local communities; utilize marketing toolkit to increase interest in station history; develop ideas to pursue grants and funding to support your station contribution to the AAPB; and better maintain content libraries.
Interactivity: The session will be hands on activities playing the crowdsourcing games, using the tools, and brainstorming methods to engage communities with these tools. Demos of game and tools will be given and participants will be encouraged to use them.
Laura Sampson, Rocky Mountain PBS
Leslie Bourgeois, Archivist, Louisiana Public Broadcasting
Ann Wilkens, Wisconsin Public Television
Casey Davis Kaufman, Associate Director MediaLibraryand Archives, WGBH
Karen Cariani, Senior Director, Project Director, WGBH and American Archive of Public Broadcasting Submitted by: Karen Cariani, WGBH
Following up on our post this past September announcing our IMLS-funded Public Broadcasting Preservation Fellowship (PBPF) project, we’re very excited to introduce our first cohort of Public Broadcasting Preservation Fellows!
PBPF fellows, mentors and project staff at Immersion Week in Boston
The PBPF supports students enrolled in non-specialized graduate programs to pursue digital preservation projects at public broadcasting organizations around the country. The Fellowship is designed to provide graduate students with the opportunity to gain hands-on experiences in the practices of audiovisual preservation; address the need for digitization of at-risk public media materials in underserved areas; and increase audiovisual preservation education capacity in Library and Information Science graduate programs around the country.
Over the spring semester of this year (and summer semester for our second cohort), each fellow will inventory, digitize, and catalog a small collection of audiovisual media; generate technical and preservation metadata; and process the digital files for ingest into the American Archive of Public Broadcasting. The fellows will collaborate with a faculty advisor at their university to document their work in a 3-5 page handbook and video demo. The fellowship will also support a digitization station at each university for the use by the fellows and future students enrolled at the universities.
Please welcome the members of our PBPF cohort:
Fellow: Virginia Angles
Program: Clayton State University
Host Organization: Georgia Public Broadcasting
Host Mentor: Tanya Ott, Vice President of Radio and News Content, Georgia Public Broadcasting
Faculty Advisor: Josh Kitchens, Director, Master of Archival Studies Program
Local Mentor: Kathy Christensen, former VP of News, Archives and Research at CNN
Virginia Angles is an aspiring archivist with a background in Art History and Chemistry. She is currently pursuing a second masters in Archival Studies with a focus in digital preservation.
Fellow: Rebecca Benson
Program: University of Missouri
Host Organization: KOPN Community Radio
Host Mentor: Jacqueline Casteel, KOPN Community Radio
Faculty Advisor: Sarah Buchanan, Assistant Professor, Library and Information Science
Local Mentor: James Hone, Digital Archivist, University Libraries, Washington University in St. Louis
Rebecca Benson is a graduate student in the Library and Information Science Program at the University of Missouri, where she works in the Special Collections and Rare Books department of Ellis Library. Her research interests include digital communities, story-telling and reception, and the preservation of ephemeral narratives.
Fellow: Evelyn Cox
Program: University of Oklahoma
Host Organization: Oklahoma Educational Television Authority
Host Mentor: Janette Thornbrue, Vice President of Operations, Oklahoma Educational Television Authority
Faculty Advisor: Susan Burke, Interim Director and Associate Professor, School of Library and Information Studies
Local Mentor: Lisa Henry, Curator/Archivist, Political Communication Center, Julian P. Kantor Political Commercial Archive
Evelyn Cox is a graduate student enrolled in the Masters of Library and Information Studies (MLIS) Program at the University of Oklahoma. She obtained her undergraduate degree in English from the University of California, Los Angeles and is a wife and mother of two. She was born on the beautiful island of Guam but currently resides in Oklahoma. Evelyn has been a public school English teacher for over seventeen years. She has earned her National Board Certification in English Language Arts, has been a Great Expectations Instructor, has coached track and field, and has served on multiple grant writing and curriculum development teams. Upon graduation of the MLIS Program, Evelyn seeks to pursue a career in archives where she can combine her love of literature, history, and culture. Through archiving, she plans to take an active role in documenting and preserving history that adds to the cultural identity and awareness of the Chamorro people of Guam.
Fellow: Dena Schulze
Program: University of North Carolina at Chapel Hill
Host Organization: WUNC
Host Mentor: Keith Weston, Web Producer and Back Porch Music Host, WUNC
Faculty Advisor: Helen Tibbo, Alumni Distinguished Professor, SILS
Local Mentor: Erica Titkemeyer, Project Director/AV Conservator, University of North Carolina at Chapel Hill
Dena Schulze is currently pursuing her Master’s degree at the University of North Carolina at Chapel Hill in Library Science with a concentration in archives and records management. She graduated from North Carolina State University with a bachelor’s in English. She is a major movie buff and that’s what got her started on the road to a/v archiving and preservation. Dena’s dream would be to work in a film archive when she graduates. When she is not working, reading, or watching movies, she is playing with her new puppy, Bodhi who just turned six months old! Dena is very excited about this opportunity and being a part of saving audiovisual material for future generations.
Fellow: Tanya Yule
Program: San Jose State University
Host Organization: Center for Asian American Media in collaboration with the Bay Area Video Coalition
Host Mentor: James Ott, Director of Finance and Administration, Center for Asian-American Media
Faculty Advisor: Alyce Scott, Lecturer, School of Information
Local Mentor: Jackie Jay, Preservation Technician, Bay Area Video Coalition
Tanya Yule is a current MLIS candidate at San José State University, focusing on archives and photography preservation; she received her BFA in photography from the San Francisco Art Institute, with a background in traditional darkroom methods, and photomechanical printing. Tanya is an intern at the Hoover Institution Archives at Stanford University, and resides in San Francisco with her husband and adorable dog Otto.
PBPF Fellows at Immersion Week in Boston – from left to right – Tanya Yule, Dena Schulze, Rebecca Benson, Virginia Angles, and Evelyn Cox.
Earlier this month the American Archive of Public Broadcasting staff hosted several workshops at the 2017 Association of Moving Image Archivists (AMIA) conference in New Orleans. Their presentations on workflows, crowdsourcing, and best copyright practices are now available online! Be sure to also check out AMIA’s YouTube channel for recorded sessions.
THURSDAY, November 30th
PBCore Advisory Sub-Committee Meeting Rebecca Fraimow reported on general activities of the Sub-Committee and the PBCore Development and Training Project. The following current activities were presented:
Archives that hold A/V materials are at a critical point, with many cultural heritage institutions needing to take immediate action to safeguard at-risk media formats before the content they contain is lost forever. Yet, many in the cultural heritage communities do not have sufficient education and training in how to handle the special needs that A/V archive materials present. In the summer of 2015, a handful of archive educators and students formed a pan-institutional group to help foster “educational opportunities in audiovisual archiving for those engaged in the cultural heritage sector.” The AV Competency Framework Working Group is developing a set of competencies for audiovisual archive training of students in graduate-level education programs and in continuing education settings. In this panel, core members of the working group will discuss the main goals of the project and the progress that has been made on it thus far.
Born-Digital audiovisual files continue to present a conundrum to archivists in the field today: should they be accepted as-is, transcoded, or migrated? Is transcoding to a recommended preservation format always worth the potential extra storage space and staff time? If so, what are the ideal target specifications? In this presentation, individuals working closely with born-digital audiovisual content from the University of North Carolina, WGBH, and the American Folklife Center at the Library of Conference will present their own use cases involving collections processing practices, from “best practice” to the practical reality of “good enough”. These use cases will highlight situations wherein video quality, subject matter, file size and stakeholder expectations end up playing important roles in directing the steps taken for preservation. From these experiences, the panel will put forth suggestions for tiered preservation decision making, recognizing that not all files should necessarily be treated alike.
How does the public play a role in making historical AV content accessible? The American Archive of Public Broadcasting has launched two games that engage the public in transcribing and describing 70+ years of audio and visual content comprising more than 50,000 hours.
(Speech-to-Text Transcript Correction) FIX IT is an online game that allows the public to identify and correct errors in our machine-generated transcripts. FIX IT players have exclusive access to historical content and long-lost interviews from stations across the country.
(Program Credits Cataloging) ROLL THE CREDITS is a game that allows the public to identify and transcribe information about the text that appears on the screen in so many television broadcasts. ROLL THE CREDITS asks users to collect this valuable information and classify it into categories that can be added to the AAPB catalog. To accomplish this goal, we’ve extracted frames from uncataloged video files and are asking for help to transcribe the important information contained in each frame.
Digitized collections often remain almost as inaccessible as they were on their original analog carriers, primarily due to institutional concerns about copyright infringement and privacy. The American Archive of Public Broadcasting has taken steps to overcome these challenges, making available online more than 22,000 historic programs with zero take-down notices since the 2015 launch. This copyright session will highlight practical and successful strategies for making collections available online. The panel will share strategies for: 1) developing template forms with standard terms to maximize use and access, 2) developing a rights assessment framework with limited resources (an institutional “Bucket Policy”), 3) providing limited access to remote researchers for content not available in the Online Reading Room, and 4) promoting access through online crowdsourcing initiatives.
The American Archive of Public Broadcasting seeks to preserve and make accessible significant historical public media content, and to coordinate a national effort to save at-risk public media recordings. In the four years since WGBH and the Library of Congress began stewardship of the project, significant steps have been taken towards accomplishing these goals. The effort has inspired workflows that function constructively, beginning with preservation at local stations and building to national accessibility on the AAPB. Archivists from two contributing public broadcasters will present their institutions’ local preservation and access workflows. Representatives from WGBH and the Library of Congress will discuss collaborating with contributors and the AAPB’s digital preservation and access workflows. By sharing their institutions’ roles and how collaborators participate, the speakers will present a full picture of the AAPB’s constructive inter-institutional work. Attendees will gain knowledge of practical workflows that facilitate both local and national AV preservation and access.
As an increasing number of audiovisual formats become obsolete and the available hours remaining on deteriorating playback machines decrease, it is essential for institutions to digitize their AV holdings to ensure long-term preservation and access. With an estimated hundreds of millions of items to digitize, it is impractical, even impossible, that institutions would be able to perform all of this work in-house before time runs out. While this can seem like a daunting process, why learn the hard way when you can benefit from the experiences of others? From those embarking on their first outsourced AV digitization project to those who have completed successful projects but are looking for ways to refine and scale up their process, everyone has something to learn from these speakers about managing AV digitization projects from start to finish.
How do you bring together a collection of broadcast materials scattered in various geographical locations across the country? National Education Television (NET), the precursor to PBS, distributed programs nationally to educational television stations from 1954-1972. Although this collection is tied together through provenance, it presents a challenge to processing due to differing approaches in descriptive practices across many repositories over many years. By aggregating inventories into one catalog and describing titles more fully, the NET Collection Catalog will help institutions holding these materials make informed preservation decisions. By its conclusion, AAPB will publish an online list of NET titles annotated with relevant descriptive information culled from NET textual records that will greatly improve discoverability of NET materials for archivists, scholars, and the general public. Examples of specific cataloging issues, including contradictory metadata documentation and legacy records, inconsistent titling practices, and the existence of international version will be explored.
ABOUT THE AAPB
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and almost 25,000 programs are available online at americanarchive.org.
Today we’re launching ROLL THE CREDITS, a new Zooniverse project to engage the public in helping us catalog unseen content in the AAPB archive. Zooniverse is the “world’s largest and most popular platform for people-powered research.” Zooniverse volunteers (like you!) are helping the AAPB in classifying and transcribing the text from extracted frames of uncataloged public television programs, providing us with information we can plug directly into our catalog, closing the gap on our sparsely described collection of nearly 50,000 hours of television and radio.
Example frame from ROLL THE CREDITS
The American people have made a huge investment in public radio and television over many decades. The American Archive of Public Broadcasting (AAPB) works to ensure that this rich source for American political, social, and cultural history and creativity is saved and made available once again to future generations.
The improved catalog records will have verified titles, dates, credits, and copyright statements. With the updated, verified information we will be able to make informed decisions about the development of our archive, as well as provide access to corrected versions of transcripts available for anyone to search free of charge at americanarchive.org.
In conjunction with our speech-to-text transcripts from FIX IT, a game that asks users to correct and validate the transcripts one phrase at a time, ROLL THE CREDITS helps us fulfill our mission of preserving and making accessible historic content created by the public media, saving at-risk media before the contents are lost to prosperity.
Thanks for supporting AAPB’s mission! Know someone who might be interested? Feel free to share with the other transcribers and public media fans in your life!
Seeking information about the workflows and requirements for contributing digitized content and/or metadata to the AAPB?
Writing a grant proposal and want to explore collaborating with the AAPB to preserve copies of your digitized collections and/or provide an access point to your collections through the AAPB metadata portal?
Then this webinar is for you!
On Tuesday, December 12, 2017 at 12:00pm ET, the AAPB will host a webinar focused on grant writing for digitization and subsequent contribution of digital files and metadata to the AAPB.
By the end of this webinar, participants will gain an understanding of:
AAPB’s background and infrastructure,
how contributing to the AAPB could benefit your collection
steps to becoming an AAPB contributor,
metadata and digital file format requirements and recommendations,
delivery procedures, and
other workflows and considerations for contributing digital files and/or metadata to the AAPB.
the value of your collection as part of a national collection and how to express that in a proposal
This webinar and future AAPB webinars are generously funded by The Andrew W. Mellon Foundation.
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and almost 25,000 programs are available online at americanarchive.org.
At the AAPB “Crowdsourcing Anecdotes” meeting last Friday at the Association of Moving Image Archivists conference, I talked about a free “Dockerized” build of Kaldi made by Stephen McLaughlin, PHD student at UT Austin School of Information. I thought I would follow up on my introduction to it there by providing links to these resources, instructions for setting it up, and some anecdotes about using it. First, the best resource for this Docker Kaldi and Stephen’s work is here in the HiPSTAS Github: https://github.com/hipstas/kaldi-pop-up-archive. It also has detailed information for setting up and running the Docker Kaldi.
I confess that I don’t know much about computer programming and engineering besides what I need to get my work done. I am an archivist and I eagerly continue to gain more computer skills, but some of my terminology here might be kinda wrong or unclear. Anyways, Kaldi is a free speech-to-text tool that interprets audio recordings and outputs timestamped JSON and text files. This “Dockerized” Kaldi allows you to easily get a version of Kaldi running on pretty much any reasonably powerful computer. The recommended minimum is at least 6gb of RAM, and I’m not sure about the CPU. The more of both the better, I’m sure.
The Docker platform provides a framework to easily download and set up a computer environment in which Kaldi can run. Kaldi is pretty complicated, but Stephen’s Docker image (https://hub.docker.com/r/hipstas/kaldi-pop-up-archive) helps us all bypass setting up Kaldi. As a bonus, it comes set up with the language model that PopUp Archive created as part of our IMLS grant (link here) with them and HiPSTAS. They trained the model using AAPB recordings. Kaldi needs a trained language model dataset to interpret audio data put through the system. Because this build of Kaldi uses the PopUp Archive model, it is already trained for American English.
I set up my Docker on my Mac laptop, so the rest of the tutorial will focus on that system, but the GitHub has information for Windows or Linux and those are not very different. By the way, these instructions will probably be really easy for people that are used to interacting with tools in the command line, but I am going to write this post as if the reader hasn’t done that much. I will also note that while this build of Kaldi is really exciting and potentially useful, especially given all the fighting I’ve done with these kinds of systems in my career, I didn’t test it thoroughly because it is only Stephen’s experiment complimenting the grant project. I’d love to get feedback on issues you might encounter! Also I’ve got to thank Stephen and HiPSTAS!! THANK YOU Stephen!!
SET UP AND USE:
The first step is to download Docker (https://www.docker.com/). You then need to go into Docker’s preferences, under Advanced, and make sure that Docker has access to at least 6gb of RAM. Add more if you’d like.
Then navigate to the Terminal and pull Stephen’s Docker image for Kaldi. The command is “docker pull -a hipstas/kaldi-pop-up-archive”. (Note: Stephen’s GitHub says that you can run the pull without options, but I got errors if I ran it without “-a”). This is a big 12gb download, so go do something else while it finishes. I ate some Thanksgiving leftovers.
When everything is finished downloading, set up the image by running the command “docker run -it –name kaldi_pua –volume ~/Desktop/audio_in/:/audio_in/ hipstas/kaldi-pop-up-archive:v1”. This starts the Kaldi Docker image and creates a new folder on your desktop where you can add media files you want to run through Kaldi. This is also the place where Kaldi will write the output. Add some media to the folder BUT NOTE: the filenames cannot have spaces or uncommon characters or Kaldi will fail. My test of this setup ran well on some short mp4s. Also, your Terminal will now be controlling the Docker image, so your command line prompt will look different than it did, and you won’t be “in” your computer’s file system until you exit the Docker image.
Kaldi will run through a batch, and a ton of text will continue to roll through your Terminal. Don’t be afraid that it is taking forever. Kaldi is meant to run on very powerful computers, and running it this way is slow. I tested on a 30 minute recording, and it took 2.5 hrs to process. It will go faster the more computing power you assign permission for Docker to use, but it is reasonable to assume that on most computers the time to process will be around 5 times the recording length.
The setup script converts wav, mp3, and mp4 to a 16khz broadcast WAV, which is the input that Kaldi requires. You might need to manually convert your media to broadcast WAV if the setup script doesn’t work. I started out by test a broadcast WAV that I made myself with FFmpeg, but Kaldi and/or the setup script didn’t like it. I didn’t resolve that problem because the Kaldi image runs fine on media that it converts itself, so that saves me the trouble anyways.
When Kaldi is done processing, the text output will be in the “audio_in” folder, in the “transcripts” folder. There will be both a JSON and txt file for every recording processed, named the same as the original media file. The quality of the output depends greatly on the original quality of the recording, and how closely the recording resembles the language model (in this case, a studio recording of individuals speaking standard American English). That said, we’ve had some pretty good results in our tests. NOTE THAT if you haven’t assigned enough power to Docker, Kaldi will fail to process, and will do so without reporting an error. The failed files will create output JSON and txt files that are blank. If you’re having trouble try adding more RAM to Docker, or checking that your media file is successfully converting to broadcast WAV.
When you want to return your terminal to normal, use the command “exit” to shut down the image and return to your file system.
When you want to start the Kaldi image again to run another batch, open another session by running “docker start /kaldi_pua” and then “docker exec -it kaldi_pua bash”. You’ll then be in the Kaldi image and can run the batch with the “sh ./setup.sh” command.
I am sure that there are ways to update or modify the language model, or to use a different model, or to add different scripts to the Docker Kaldi, or to integrate it into bigger workflows. I haven’t spent much time exploring any of that, but I hope you found this post a helpful start. We’re going to keep it in mind as we build up our speech-to-text workflows, and we’ll be sure to share any developments. Happy speech-to-texting!!
For #GivingTuesday, please consider donating to the American Archive of Public Broadcasting! Help us continue to preserve and make accessible the archives and legacy of public media from across the nation.
A donation directly enables us to, among other things:
Grow the amount of content available to the public in our Online Reading Room
Improve our website with new features and improve functionality and discoverability of the collection
Sustain AAPB technical infrastructure so that we can continue to provide online access to the collection
Five weeks ago we started our month-long commemoration of the 50thanniversary of the Public Broadcasting Act, signed by President Lyndon Johnson on November 7, 1967. The goal of each challenge was to engage in community, discover histories, share those stories with the public, and start dialogues. We can’t tell you how much we appreciate your participation and look forward to seeing your posts this week on Current Initiatives and Memorabilia!
Show us your posters, commercials, first logos, historic photographs, and mascots! How are you using your preserved history? What initiatives are you working on now?
We invite public broadcasting organizations, museums, archives, libraries, historians, public media fans, and other cultural organizations to personalize #PubMedia50 and share the stories in your own holdings and memories.
See you there!
To get started–
“We’re teaming up with @amarchivepub and #PubMedia50 stations to celebrate #PubMedia! Join in and share your history & content!”
“We’re joining @amarchivepub in celebrating the 50th Anniversary of the Public Broadcasting Act at #PubMedia50!”
To commemorate the 50th anniversary of the signing of the Public Broadcasting Act of 1967, we’ll be posting content to celebrate the history and preservation of public broadcasting! Teaming up with @amarchivepub, #PubMedia50 stations, academics, and community members we’ll have a new #PubMedia50 theme each week. Join the conversation by tagging your post with #PubMedia50.