The National Association of Educational Broadcasters (NAEB) Collection Now Available on AAPB

Screen Shot 2018-03-28 at 11.50.31 AM.png

The National Association of Educational Broadcasters (NAEB) Collection, now available on the AAPB website, consists of more than 5,500 radio programs from the 1950s and 1960s, created by over 100 NAEB member stations. The collection includes radio documentaries, coverage of events (hearings, meetings, conferences, and seminars), interviews, debates, and lectures on public affairs topics such as civil rights, foreign affairs, health, politics, education, and broadcasting.

These broadcasts, mostly stemming from university and public school-run radio stations, provide an in-depth look at the engagements and events of American history, as they were broadcast to and received by the general public in the twentieth century. Interview subjects and/or program participants feature a “who’s who” of mid-20th century public figures, including Hubert Humphrey, Betty Shabazz, Robert Frost, Frank Lloyd Wright, Alistair Cooke, Dr. Benjamin Spock, Margaret Mead, Studs Terkel, Dr. Albert Schweitzer, Marshall McLuhan, and Aldous Huxley. The collection also contains a notably large percentage of local content and voices, from a WDET Detroit series about local civil defense plans and policies called “Prepare for Survival,” to a series entitled “Document: Deep South,” a documentary series produced by WOUA at the University of Alabama depicting the increasing importance of the South in the economic development of the United States, to a show entitled “Search for Mental Health,” a series of talks about advances in psychiatry from the University of Chicago.

The NAEB was established in 1934 from a precursor organization, the Association of College and University Broadcasting Stations, that formed in 1925. The mission of the NAEB was to use communications technology for education and social purposes. It was an extremely successful and effective trade organization that, throughout its 60 years of existence, ushered in or helped to enable major changes in early educational broadcasting policy. In 1951, NAEB established a tape duplication exchange system in Urbana, IL, where programs produced by university radio stations across the country were copied and distributed to member stations, an early networking scheme that influenced the history of later public radio and television systems. The forerunner of CPB and its arms, NPR and PBS, the NAEB served as the primary organizer, developer, and distributor for noncommercial broadcast production and analysis between 1925 and 1981.

The NAEB Collection was contributed to the AAPB by the University of Maryland’s National Public Broadcasting Archives. The paper records of the NAEB are housed at University of Maryland and additional related materials are located at the Wisconsin Historical Society.

Access the collection here: http://americanarchive.org/special_collections/naeb

Special thanks to Stephanie Sapienza for her contributions to the curation of this collection.

AAPB Announces Collaboration with Dartmouth College Media Ecology Project

 

cropped-mep_banner5112000px-Dartmouth_College_wordmark.svg

The American Archive of Public Broadcasting (AAPB) and Dartmouth College are pleased to announce a new collaboration in which AAPB’s Online Reading Room of public television and radio programming will now be accessible through the Media Ecology Project (MEP) at Dartmouth.

The Media Ecology Project is a digital resource directed by Dartmouth Associate Professor of Film and Media Studies Mark J. Williams. MEP provides researchers with not only online access to archival moving image collections but also with tools to participate in new interdisciplinary scholarship that produces metadata about the content of participating archives. By providing annotated knowledge about the archival materials, students and scholars add value back to the archives, making these materials more searchable in the future. The MEP aims to facilitate the awareness of and critical study of media ecology—helping to save and preserve at-risk historical media and contribute to our understanding of their role in the public sphere and in popular memory.

Through this new AAPB-Dartmouth collaboration, historic public broadcasting programs available in the AAPB Online Reading Room will be accessible through the MEP platform. Scholars, researchers and students using the MEP platform will be able to access AAPB collection materials for research, in-classroom presentations and other assignments as part of their academic and scholarly work. MEP scholarly participation spans the disciplines from Arts and Humanities to the Social Sciences, Computer Science and Medical Science. One topic that Williams will immediately pursue with students and colleagues is coverage of the civil rights era that exists in the collection.

While conducting their research via MEP, scholars will be able to give back to AAPB by creating time-based annotations and metadata under a public domain license. Basic descriptive metadata such as credit information for video and audio files is desired, but more granular time-based annotations that describe specific sub-clips within media files will designate more particular areas of scholarly interest. These sub-clips can then be utilized in research essays that are open to scholarly emphases across the academic disciplines. The annotations that students and scholars produce will be made available on the AAPB website for improved searching, navigation and discoverability across the collection and within individual digitized programs and recordings.

The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years. To date, over 50,000 hours of television and radio programming by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and almost 31,000 programs are available online at: americanarchive.org.
For more information or to request access to specific materials at either of the two sites, researchers can request a research appointment.

Making the AAPB more accessible, useable, and engaging for scholars, researchers and students furthers AAPB’s mission to facilitate the use of historic public broadcasting materials. Further, the capacity of participants in the MEP to generate and provide tagged annotations and metadata to the AAPB will support the archive in becoming a centralized web portal for discovery of the historic content created by public broadcasting over the past 70+ years.

Historic WRVR-FM Archives to be Digitized, Preserved and Made Available in the American Archive of Public Broadcasting

Historic WRVR-FM Archives Receives CLIR
Digitizing Hidden Special Collections and Archives Award

More than 4,000 hours of cultural and political radio programming from the 60s and 70s to be made public

 

Morningside Heights, NY – The Council on Library and Information Resources has awarded a grant of $330,000 to digitize, preserve, and make publicly accessible previously unavailable archives of the Peabody Award winning radio station WRVR. Public Radio as a Tool for Cultural Engagement in New York in the 60s and early 70s: Digitizing the Broadcasts of WRVR-FM Public Radio is a joint project between The Riverside Church in the City of New York and the American Archive of Public Broadcasting, a collaboration between the Library of Congress and the WGBH Educational Foundation. The collection includes culturally significant non-commercial programming, including interviews, speeches, and musical interpretations on matters such as civil rights, war, and fine arts, from laypersons to famed scholars, including Martin Luther King, Jr., Malcolm X, and Pete Seeger.

Funded by the Andrew W. Mellon Foundation, the Council on Library and Information Resources’ Digitizing Hidden Collections program supports the creation of digital representations of unique content of high scholarly significance. This award will support the preservation and digitization of over 3,502 recordings representing 4,000 hours of programming from WRVR from the 1960s and early 1970s. Owned and operated by The Riverside Church from 1961-1976, WRVR was the first station to win a Peabody for its entire programming, in part for its coverage of the Civil Rights movement in 1963 Birmingham. In addition to featuring progressive religious and philosophical discussions with Riverside clergy, theologians, and scholars, such as Rev. Dr. Martin Luther King, Jr., WRVR programming included culturally significant topics, speakers, and performances, such as Langston Hughes’ “Jericho-Jim Crow” directed by Alvin Ailey, and interviews and readings by Robert Frost, John Ashbery, and Allen Ginsberg. The station also featured the program “Just Jazz with Ed Beach,” which collection currently resides at the Library of Congress.

Preservation of these materials will enhance study in many disciplines, including theology/religion, political science, and communications, especially related to American Christianity, homiletics, progressive responses to the Civil Rights movement, contemporary issues of race and sexuality, the cultural impact of the 1960s, and public radio as a tool for cultural engagement and social media precursor.

These recordings will be made publicly available at the American Archive of Public Broadcasting (AAPB), a collaboration between the Library of Congress and WGBH. The AAPB coordinates a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years.

Sample recordings include:

  • Arthur Miller. Statement for World Theater Day, March 27, 1963 Riverside Radio, WRVR, Riverside Archives (The Riverside Church) Arthur Miller remarks on theater’s ability to speak universal truths and understanding in art, and how this particular art form, above many others, informs society’s response to war, politics, freedoms, and all matters of the human condition across nations and cultures.
  • “Listen! William Sloane Coffin Jr.: Conscience, Protest & War.” Interview on WRVR, March 31, 1968 Riverside Radio, WRVR. Riverside Archives (The Riverside Church) William Sloane Coffin Jr., chaplain at Yale University (later Riverside Senior Minister, 1977-1987), discusses his indictment for conspiracy to encourage draft evasion and the politics of the Vietnam War; peace activism, civil rights and Dr. King’s Poor People’s Campaign, and how Dr. Coffin’s privilege informs his work as a clergyperson, activist, and American.

About The Riverside Church
riverside
Located in Morningside Heights on the Upper West Side, The Riverside Church in the City of New York is one of the leading voices of Progressive Christianity, influential on America’s religious and political landscapes for more than 85 years.  Built by John D. Rockefeller Jr. and currently led by The Rev. Dr. Amy Butler, the interracial, interdenominational, and international church has long been a forum for important civic and spiritual leaders, including Dr. Martin Luther King, Jr., Nelson Mandela, President Clinton, the Dalai Lama, and countless others.  Visit www.trcnyc.org or find us on social media to learn more about our rich history and the latest news and events.

About the American Archive of Public Broadcasting
AAPB_Logo_Color_4Square
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and more than 30,000 programs are available online at americanarchive.org.

About WGBH
wgbh_logoWGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia™, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.

About the Library of Congress
PrintThe Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office.  Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.

About CLIR
CLIR_red_w_wordmark
The Council on Library and Information Resources is an independent, nonprofit organization that forges strategies to enhance research, teaching, and learning environments in collaboration with libraries, cultural institutions, and communities of higher learning.

About the Mellon Foundation
Founded in 1969, the Andrew W. Mellon Foundation endeavors to strengthen, promote, and, where necessary, defend the contributions of the humanities and the arts to human flourishing and to the well-being of diverse and democratic societies by supporting exemplary institutions of higher education and culture as they renew and provide access to an invaluable heritage of ambitious, path-breaking work. Additional information is available at mellon.org.

AAPB Welcomes Public Broadcasting Preservation Fellowship Spring 2018 Cohort

Following up on our post this past September announcing our IMLS-funded Public Broadcasting Preservation Fellowship (PBPF) project, we’re very excited to introduce our first cohort of Public Broadcasting Preservation Fellows!

GetFileAttachment-2.jpeg

PBPF fellows, mentors and project staff at Immersion Week in Boston

The PBPF supports students enrolled in non-specialized graduate programs to pursue digital preservation projects at public broadcasting organizations around the country. The Fellowship is designed to provide graduate students with the opportunity to gain hands-on experiences in the practices of audiovisual preservation; address the need for digitization of at-risk public media materials in underserved areas; and increase audiovisual preservation education capacity in Library and Information Science graduate programs around the country.

Over the spring semester of this year (and summer semester for our second cohort), each fellow will inventory, digitize, and catalog a small collection of audiovisual media; generate technical and preservation metadata; and process the digital files for ingest into the American Archive of Public Broadcasting. The fellows will collaborate with a faculty advisor at their university to document their work in a 3-5 page handbook and video demo. The fellowship will also support a digitization station at each university for the use by the fellows and future students enrolled at the universities.

Please welcome the members of our PBPF cohort:

Fellow: Virginia Angles

  • Program: Clayton State University
  • Host Organization: Georgia Public Broadcasting
  • Host Mentor: Tanya Ott, Vice President of Radio and News Content, Georgia Public Broadcasting
  • Faculty Advisor: Josh Kitchens, Director, Master of Archival Studies Program
  • Local Mentor: Kathy Christensen, former VP of News, Archives and Research at CNN

 Virginia Angles is an aspiring archivist with a background in Art History and Chemistry. She is currently pursuing a second masters in Archival Studies with a focus in digital preservation.

Fellow: Rebecca Benson

  • Program: University of Missouri
  • Host Organization: KOPN Community Radio
  • Host Mentor: Jacqueline Casteel, KOPN Community Radio
  • Faculty Advisor: Sarah Buchanan, Assistant Professor, Library and Information Science
  • Local Mentor: James Hone, Digital Archivist, University Libraries, Washington University in St. Louis

Rebecca Benson is a graduate student in the Library and Information Science Program at the University of Missouri, where she works in the Special Collections and Rare Books department of Ellis Library. Her research interests include digital communities, story-telling and reception, and the preservation of ephemeral narratives.

Fellow: Evelyn Cox

  • Program: University of Oklahoma
  • Host Organization: Oklahoma Educational Television Authority
  • Host Mentor: Janette Thornbrue, Vice President of Operations, Oklahoma Educational Television Authority
  • Faculty Advisor: Susan Burke, Interim Director and Associate Professor, School of Library and Information Studies
  • Local Mentor: Lisa Henry, Curator/Archivist, Political Communication Center, Julian P. Kantor Political Commercial Archive

Evelyn Cox is a graduate student enrolled in the Masters of Library and Information Studies (MLIS) Program at the University of Oklahoma.  She obtained her undergraduate degree in English from the University of California, Los Angeles and is a wife and mother of two. She was born on the beautiful island of Guam but currently resides in Oklahoma. Evelyn has been a public school English teacher for over seventeen years. She has earned her National Board Certification in English Language Arts, has been a Great Expectations Instructor, has coached track and field, and has served on multiple grant writing and curriculum development teams. Upon graduation of the MLIS Program, Evelyn seeks to pursue a career in archives where she can combine her love of literature, history, and culture. Through archiving, she plans to take an active role in documenting and preserving history that adds to the cultural identity and awareness of the Chamorro people of Guam.

 Fellow: Dena Schulze

  • Program: University of North Carolina at Chapel Hill
  • Host Organization: WUNC
  • Host Mentor: Keith Weston, Web Producer and Back Porch Music Host, WUNC
  • Faculty Advisor: Helen Tibbo, Alumni Distinguished Professor, SILS
  • Local Mentor: Erica Titkemeyer, Project Director/AV Conservator, University of North Carolina at Chapel Hill

Dena Schulze  is currently pursuing her Master’s degree at the University of North Carolina at Chapel Hill in Library Science with a concentration in archives and records management. She graduated from North Carolina State University with a bachelor’s in English. She is a major movie buff and that’s what got her started on the road to a/v archiving and preservation. Dena’s dream would be to work in a film archive when she graduates. When she is not working, reading, or watching movies, she is playing with her new puppy, Bodhi who just turned six months old! Dena is very excited about this opportunity and being a part of saving audiovisual material for future generations.

Fellow: Tanya Yule

  • Program: San Jose State University
  • Host Organization: Center for Asian American Media in collaboration with the Bay Area Video Coalition
  • Host Mentor: James Ott, Director of Finance and Administration, Center for Asian-American Media
  • Faculty Advisor: Alyce Scott, Lecturer, School of Information
  • Local Mentor: Jackie Jay, Preservation Technician, Bay Area Video Coalition

Tanya Yule is a current MLIS candidate at San José State University, focusing on archives and photography preservation; she received her BFA in photography from the San Francisco Art Institute, with a background in traditional darkroom methods, and photomechanical printing. Tanya is an intern at the Hoover Institution Archives at Stanford University, and resides in San Francisco with her husband and adorable dog Otto.

GetFileAttachment.jpeg

PBPF Fellows at Immersion Week in Boston – from left to right – Tanya Yule, Dena Schulze, Rebecca Benson, Virginia Angles, and Evelyn Cox.

Upcoming Webinar: Building AAPB Participation into Digitization Grant Proposals

cropped-aapb_logo_color_1line7.png

Building AAPB Participation into Digitization Grant Proposals: Requirements, Recommendations and Workflows

Tuesday, December 12, 2017
12:00pm ET

Webinar Registration form: https://goo.gl/forms/lWWU5GgFkv09bNFi2
Direct meeting URL: http://wgbh1.adobeconnect.com/aapb_grant-proposals-1/

Curious about getting involved in the American Archive of Public Broadcasting (AAPB)?

Seeking information about the workflows and requirements for contributing digitized content and/or metadata to the AAPB?

Writing a grant proposal and want to explore collaborating with the AAPB to preserve copies of your digitized collections and/or provide an access point to your collections through the AAPB metadata portal?

Then this webinar is for you!

On Tuesday, December 12, 2017 at 12:00pm ET, the AAPB will host a webinar focused on grant writing for digitization and subsequent contribution of digital files and metadata to the AAPB.

By the end of this webinar, participants will gain an understanding of:

  • AAPB’s background and infrastructure,
  • how contributing to the AAPB could benefit your collection
  • steps to becoming an AAPB contributor,
  • metadata and digital file format requirements and recommendations,
  • delivery procedures, and
  • other workflows and considerations for contributing digital files and/or metadata to the AAPB.
  • the value of your collection as part of a national collection and how to express that in a proposal

Attendees will also receive advice on how to incorporate AAPB contribution into their CLIR Recordings at Risk (applications due February 9, 2018!), CLIR Digitizing Hidden Collections, or other grant proposal timelines and work plans.

Fill out this brief form to receive info about future webinars and to receive a webinar meeting invitation sent to your calendar: https://goo.gl/forms/lWWU5GgFkv09bNFi2

Anyone can join the webinar at this URL: http://wgbh1.adobeconnect.com/aapb_grant-proposals-1/

This webinar and future AAPB webinars are generously funded by The Andrew W. Mellon Foundation.

The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and almost 25,000 programs are available online at americanarchive.org.

“Dockerized” Kaldi Speech-to-Text Tool

At the AAPB “Crowdsourcing Anecdotes” meeting last Friday at the Association of Moving Image Archivists conference, I talked about a free “Dockerized” build of Kaldi made by Stephen McLaughlin, PHD student at UT Austin School of Information. I thought I would follow up on my introduction to it there by providing links to these resources, instructions for setting it up, and some anecdotes about using it. First, the best resource for this Docker Kaldi and Stephen’s work is here in the HiPSTAS Github: https://github.com/hipstas/kaldi-pop-up-archive. It also has detailed information for setting up and running the Docker Kaldi.

I confess that I don’t know much about computer programming and engineering besides what I need to get my work done. I am an archivist and I eagerly continue to gain more computer skills, but some of my terminology here might be kinda wrong or unclear. Anyways, Kaldi is a free speech-to-text tool that interprets audio recordings and outputs timestamped JSON and text files. This “Dockerized” Kaldi allows you to easily get a version of Kaldi running on pretty much any reasonably powerful computer. The recommended minimum is at least 6gb of RAM, and I’m not sure about the CPU. The more of both the better, I’m sure.

The Docker platform provides a framework to easily download and set up a computer environment in which Kaldi can run. Kaldi is pretty complicated, but Stephen’s Docker image (https://hub.docker.com/r/hipstas/kaldi-pop-up-archive) helps us all bypass setting up Kaldi. As a bonus, it comes set up with the language model that PopUp Archive created as part of our IMLS grant (link here) with them and HiPSTAS. They trained the model using AAPB recordings. Kaldi needs a trained language model dataset to interpret audio data put through the system. Because this build of Kaldi uses the PopUp Archive model, it is already trained for American English.

I set up my Docker on my Mac laptop, so the rest of the tutorial will focus on that system, but the GitHub has information for Windows or Linux and those are not very different. By the way, these instructions will probably be really easy for people that are used to interacting with tools in the command line, but I am going to write this post as if the reader hasn’t done that much. I will also note that while this build of Kaldi is really exciting and potentially useful, especially given all the fighting I’ve done with these kinds of systems in my career, I didn’t test it thoroughly because it is only Stephen’s experiment complimenting the grant project. I’d love to get feedback on issues you might encounter! Also I’ve got to thank Stephen and HiPSTAS!! THANK YOU Stephen!!

SET UP AND USE:

The first step is to download Docker (https://www.docker.com/). You then need to go into Docker’s preferences, under Advanced, and make sure that Docker has access to at least 6gb of RAM. Add more if you’d like.

Screen Shot 2017-12-04 at 12.51.04 PM.png
Give Docker more power!

Then navigate to the Terminal and pull Stephen’s Docker image for Kaldi. The command is “docker pull -a hipstas/kaldi-pop-up-archive”. (Note: Stephen’s GitHub says that you can run the pull without options, but I got errors if I ran it without “-a”). This is a big 12gb download, so go do something else while it finishes. I ate some Thanksgiving leftovers.

When everything is finished downloading, set up the image by running the command “docker run -it –name kaldi_pua –volume ~/Desktop/audio_in/:/audio_in/ hipstas/kaldi-pop-up-archive:v1”. This starts the Kaldi Docker image and creates a new folder on your desktop where you can add media files you want to run through Kaldi. This is also the place where Kaldi will write the output. Add some media to the folder BUT NOTE: the filenames cannot have spaces or uncommon characters or Kaldi will fail. My test of this setup ran well on some short mp4s. Also, your Terminal will now be controlling the Docker image, so your command line prompt will look different than it did, and you won’t be “in” your computer’s file system until you exit the Docker image.

Screen Shot 2017-12-04 at 2.06.49 PM.png

Now you need to download the script that initiates the Kaldi process. The command to download it is “wget https://raw.githubusercontent.com/hipstas/kaldi-pop-up-archive/master/setup.sh”. Once that is downloaded to the audio_in folder (and you’ve added media to the same folder) you can run a batch by executing the command “sh ./setup.sh”.

Kaldi will run through a batch, and a ton of text will continue to roll through your Terminal. Don’t be afraid that it is taking forever. Kaldi is meant to run on very powerful computers, and running it this way is slow. I tested on a 30 minute recording, and it took 2.5 hrs to process. It will go faster the more computing power you assign permission for Docker to use, but it is reasonable to assume that on most computers the time to process will be around 5 times the recording length.

Screen Shot 2017-12-04 at 1.54.55 PM.png
Picture of Kaldi doing its thing

The setup script converts wav, mp3, and mp4 to a 16khz broadcast WAV, which is the input that Kaldi requires. You might need to manually convert your media to broadcast WAV if the setup script doesn’t work. I started out by test a broadcast WAV that I made myself with FFmpeg, but Kaldi and/or the setup script didn’t like it. I didn’t resolve that problem because the Kaldi image runs fine on media that it converts itself, so that saves me the trouble anyways.

When Kaldi is done processing, the text output will be in the “audio_in” folder, in the “transcripts” folder. There will be both a JSON and txt file for every recording processed, named the same as the original media file. The quality of the output depends greatly on the original quality of the recording, and how closely the recording resembles the language model (in this case, a studio recording of individuals speaking standard American English). That said, we’ve had some pretty good results in our tests. NOTE THAT if you haven’t assigned enough power to Docker, Kaldi will fail to process, and will do so without reporting an error. The failed files will create output JSON and txt files that are blank. If you’re having trouble try adding more RAM to Docker, or checking that your media file is successfully converting to broadcast WAV.

Screen Shot 2017-12-04 at 1.54.27 PM.png

When you want to return your terminal to normal, use the command “exit” to shut down the image and return to your file system.

When you want to start the Kaldi image again to run another batch, open another session by running “docker start /kaldi_pua” and then “docker exec -it kaldi_pua bash”. You’ll then be in the Kaldi image and can run the batch with the “sh ./setup.sh” command.

I am sure that there are ways to update or modify the language model, or to use a different model, or to add different scripts to the Docker Kaldi, or to integrate it into bigger workflows. I haven’t spent much time exploring any of that, but I hope you found this post a helpful start. We’re going to keep it in mind as we build up our speech-to-text workflows, and we’ll be sure to share any developments. Happy speech-to-texting!!

WGBH Awarded Grant by Institute of Museum and Library Services for Public Broadcasting Preservation Fellowship

Grant of $229,772 will fund students’ work on digitization of historic, at-risk public media content from underrepresented regions and communities

BOSTON, September 28, 2017 – WGBH Educational Foundation is pleased to announce that the Institute of Museum and Library Services (IMLS) has awarded WGBH a $229,772 Laura Bush 21st Century Librarian Program grant to launch the Public Broadcasting Preservation Fellowship. The fellowship will fund 10 graduate students from across the United States to digitize at-risk audiovisual materials at public media organizations near their universities. The digitized content will ultimately be incorporated into the American Archive of Public Broadcasting (AAPB), a collaboration between Boston public media station WGBH and the Library of Congress working to digitize and preserve thousands of broadcasts and previously inaccessible programs from public radio and public television’s more than 60-year legacy.

“We are honored that the Institute of Museum and Library Services has chosen WGBH to lead the Public Broadcasting Preservation Fellowship,” said Casey Davis Kaufman, Associate Director of the WGBH Media Library and Archives and WGBH’s AAPB Project Manager. “This grant will allow us to prepare a new generation of library and information science professionals to save at-risk and historically significant public broadcasting collections, especially fragile audiovisual materials, from regions and communities underrepresented in the American Archive of Public Broadcasting.”

WGBH has developed partnerships with library and information science programs and archival science programs at five universities: Clayton State University, University of North Carolina at Chapel Hill, University of Oklahoma, University of Missouri, and San Jose State University. Each school will be paired with a public media organization that will serve as a host site for two consecutive fellowships: Georgia Public Broadcasting, WUNC, the Oklahoma Educational Television Authority, KOPN Community Radio, and the Center for Asian American Media in partnership with the Bay Area Video Coalition.

“As centers of learning and catalysts of community change, libraries and museums connect people with programs, services, collections, information, and new ideas in the arts, sciences, and humanities. They serve as vital spaces where people can connect with each other,” said IMLS Director Dr. Kathryn K. Matthew. “IMLS is proud to support their work through our grant making as they inform and inspire all in their communities.”

The first fellowship will take place during the 2018 spring semester, from January to April of 2018. The second fellowship will take place during the summer semester from June to August of 2018. The grant also will support participating universities in developing long-term audiovisual preservation curricula, including providing funding for audiovisual digitization equipment, and developing partnerships with local public media organizations.

### 

About WGBH
WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur, Curious George, and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.

About the Library of Congress
The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office.  Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.

About the American Archive of Public Broadcasting
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, nearly 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and more than 22,000 programs are available online at americanarchive.org.

About IMLS
The Institute of Museum and Library Services is celebrating its 20th Anniversary. IMLS is the primary source of federal support for the nation’s 123,000 libraries and 35,000 museums. Our mission has been to inspire libraries and museums to advance innovation, lifelong learning, and cultural and civic engagement. For the past 20 years, our grant making, policy development, and research has helped libraries and museums deliver valuable services that make it possible for communities and individuals to thrive. To learn more, visit http://www.imls.gov and follow us on Facebook, Twitter and Instagram.

Introducing an audio labeling toolkit

In 2015, the Institute of Museum and Library Services (IMLS) awarded WGBH on behalf of the American Archive of Public Broadcasting a grant to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. Through a collaboration with Pop Up Archive and HiPSTAS at the University of Texas at Austin, our project has supported the creation of speech-to-transcripts for the initial 40,000 hours of historic public broadcasting preserved in the AAPB, the launch of a free open-source speech-to-text tool, and FIX IT, a game that allows the public to help correct our transcripts.

Now, our colleagues at HiPSTAS are debuting a new machine learning toolkit and DIY techniques for labeling speakers in “unheard” audio — audio that is not documented in a machine-generated transcript. The toolkit was developed through a massive effort using machine learning to identify notable speakers’ voices (such as Martin Luther King, Jr. and John F. Kennedy) from within the AAPB’s 40,000 hour collection of historic public broadcasting content.

This effort has vast potential for archivists, researchers, and other organizations seeking to discover and make accessible sound at scale — sound that otherwise would require a human to listen and identify in every digital file.

Read more about the audio labeling toolkit here, and stay tuned for more posts in this series.

Audio_Labeler_The_World

PBS NewsHour Digitization Project Update: “Asset Review” and Access and Description Workflows

I’ve previously written about developing and automating management of our workflows for the NewsHour project (click for link), and WGBH’s processes for ingesting and preserving the NewsHour digitizations (click for link). Now that the project is moving along, and over one thousand episodes of the NewsHour are already on the AAPB (with recently added transcript search functionality!!), I thought I would share more information about our access workflows and how we make NewsHour recordings available.

In this post I will describe our “Asset Review” and “Online Workflow” phases. The “Asset Review” phase is where we determine what work we will need to do to a recording to make it available online, and the “Online Workflow” phase is where we extract metadata from a transcript, add the metadata to our repository, and make the recording available online.

The goals and realities of the NewsHour project necessitate an item level content review of each recording. The reasons for this are distinct and compounding. The scale of the collection (nearly 10,000 assets) meant that the inventories from which we derived our metadata were generated only from legacy databases and tape labels, which are sometimes wrong. At no point were we able to confirm that the content on any tape is complete and correct prior to digitization. In fact, some of the tapes are unplayable before being prepared to be digitized. Additionally, there is third-party content that needs to be redacted from some episodes of the NewsHour before they can be made available. A major complication is that the transcripts only match 7pm Eastern broadcasts, and sometimes 9pm or 11pm updates would be recorded and broadcast if breaking news occurred. The tapes are not always marked with broadcast times, and sometimes do not contain the expected content – or even an episode of the NewsHour!

These complications would be fine if we were only preserving the collection, but our project goal is to make each recording and corresponding transcript or closed caption file broadly accessible. To accomplish that goal each record must have good metadata, and to have that we must review and describe each record! Luckily, some of the description, redaction, and our workflow tracking is automatable.

Access and Description Workflow Overview

As I’ve mentioned before, we coordinate and document all our NewsHour work in a large Google Sheet we call the “NewsHour Workflow workbook” (click here for link). The chart below explains how a GUID moves through sheets of the NewsHour workbook throughout our access and description work.

NewsHour_AccessWorkflowChart.png
AAPB NewsHour Acces and Description workflow chart

After a digitized recording has been delivered to WGBH and preserved, it is automatically placed in queue on the “Asset Review” sheet of our workbook. During the Asset Review, the reviewer answers thirteen different questions about the GUID. Using these responses, the Google Sheet automatically places the assets into the appropriate workflow trackers in our workbook. For instance, if a recording doesn’t have a transcript, it is placed in the “No Transcript tracker”, which has extra workflow steps for generating a description and subject metadata. A GUID can have multiple issues that place it into multiple trackers simultaneously. For instance, a tape that is not an episode will also not have a transcript, and will be placed on both the “Not an Episode tracker” and the “No Transcript tracker”. The Asset Review is critical because the answers determine the work we must perform, and ensures that each record will be correctly presented to the public when work on it is completed.

A GUID’s status in the various trackers is reflected in the “Master GUID Status sheet”, and is automatically updated when different criteria in the trackers are met and documented. When a GUID’s workflow tasks have been completely resolved in all the trackers, it appears as “Ready to go online” on the “Master GUID Status sheet.” The GUID is then automatically placed into to the “AAPB Online Status tracker”, which presents the metadata necessary to put the GUID online and indicates if tasks have been completed in the “Online Workflow tracker”. When all tasks are completed, the GUID will be online and our work on the GUID is finished.

In this post I am focusing on a workflow that follows digitizations which don’t have problems. This means the GUIDs are episodes, contain no technical errors, and have transcripts that match (green arrows in the chart). In future blog posts I’ll elaborate on our workflows for recordings that go into the other trackers (red arrows).

Asset Review

NewsHour_AssetReview
An image of a portion of our Access Review spreadsheet

Each row of the “Asset Review sheet” represents one asset, or GUID. Columns A-G (green cell color) on the sheet are filled with descriptive and administrative metadata describing each item. This metadata is auto-populated from other sheets in the workbook. Columns H-W (yellow cell color) are the reviewer’s working area, with questions to answer about each item reviewed. As mentioned earlier, the answers to the questions determines the actions that need to be taken before the recording is ready to go online, and place the GUID into the appropriate workflow trackers.

The answers to some questions on the sheet impact the need to answer others, and cells auto-populate with “N/A” when one answer precludes another. Almost all the answers require controlled values, and the cells will not accept input besides those values. If any of the cells are left blank (besides questions #14 and #15) the review will not register as completed on the “Master GUID Status Sheet”. I have automated and applied value control to as much of the data entry in the workbook as possible, because doing so helps mitigate human error. The controlled values also facilitate workbook automation, because we’ve programmed different actions to trigger when specific expected text strings appear in cells. For instance, the answer to “Is there a transcript for this video?” must be “Yes” or “No”, and those are the only input the cell will accept. A “No” answer places the GUID on the “No Transcript tracker”, and a “Yes” does not.

To review an item, staff open the GUID on an access hard drive. We have a multiple access drives which contain copies of all the proxy files delivered NewsHour digitizations. Reviewers are expected to watch between one and a half to three minutes of the beginning, middle, and end of a recording, and to check for errors while fast-forwarding through everything not watched. The questions reviewers answer are:

  1. Is this video a nightly broadcast episode?
  2. If an episode, is the recording complete?
  3. If incomplete, describe the incompleteness.
  4. Is the date we have recorded in the metadata correct?
  5. If not, what is the corrected date?
  6. Has the date been updated in our metadata repository, the Archival Management System?
  7. Is the audio and video as expected, based on the digitization vendor’s transfer notes?
  8. If not, what is wrong with the audio or video?
  9. Is there a transcript for this video?
  10. If yes, what is the transcript’s filename?
  11. Does the video content completely match the transcript?
  12. If no, in what ways and where doesn’t the transcript match?
  13. Does the closed caption file match completely (if one exists)?
  14. Should this video be part of a promotional exhibit?
  15. Any notes to project manager?
  16. Date the review is completed.
  17. Initials of the reviewer.

Our internal documentation has specific guidelines on how to answer each of these questions, but I will spare you those details! If you’re conducting quality control and description of media at your institution, these questions are probably familiar to you. After a bit of practice reviewers become adept at locating transcripts, reviewing content, and answering the questions. Each asset takes about ten minutes to review if the transcript matches, the content is the expected recording, and the digitization is error free. If any of those criteria are not true, the review will take longer. The review is laborious, but an essential step to make the records available.

Online Workflow

A large majority of recordings are immediately ready to go online following the asset review. These ready GUIDs are automatically placed into the “AAPB Online Status tracker,” where we track the workflow to generate metadata from the transcript and upload that and the recording to the AAPB.

About once a month I use the “AAPB Online Status tracker” to generate a list of GUIDs and corresponding transcripts and closed caption files that are ready to go online. To do this, all I have to do is filter for GUIDs in the “AAPB Online Status tracker” that have the workflow status “Incomplete” and copy the relevant data for those GUIDs out of the tracker and into a text file. I import this list into a FileMaker tool we call “NH-DAVE” that our Systems Analyst constructed for the project.

NewsHour_NHDAVE.png
A screenshot of our FileMaker tool “NH-DAVE”

“NH-DAVE” is a relational database containing all of the metadata that was originally encoded within the NewsHour transcripts. The episode transcripts provided by NewsHour contained the names of individuals appearing and subject terms for that episode in marked up values. Their subject terms were much more specific than ours, so we mapped them to the more broad AAPB controlled vocabulary we use to facilitate search and discovery on our website. When I ingest a list of GUIDs and transcripts to “NH-DAVE” and click a few buttons, it uses an AppleScript to match metadata from the transcript to the corresponding NewsHour metadata records in our Archival Management System and generate SQL statements. We use the statements to insert the contributor and subject metadata from the transcripts into the GUIDs’ AAPB metadata records in the Archival Management System.

Once the transcript metadata has been ingested we use both a Bash and a Ruby script to upload the proxy recordings to our streaming service, Sony Ci, and the transcripts and closed caption SRT files to our web platform, Amazon. We run a Bash script to generate another set of SQL statements to add the Sony Ci URLs and some preservation metadata (generated during the digital preservation phase) to our Archival Management System. We then export the GUIDs’ Archival Management System records into PBCore XML and ingest the XML into the AAPB’s website. As each step of this process is completed, we document it in the “Online Workflow tracker,” which will eventually register that work on the GUID is completed. When the PBCore ingest is completed and documented on the “Online Workflow tracker,” the recording and transcript are immediately accessible online and the record displays as complete on the “Master GUID Status spreadsheet”!

We consider a record that has an accurate full text transcript, contributor names, and subject terms to be sufficiently described for discovery functions on the AAPB. The transcript and terms will be fully indexed to facilitate searching and browsing. When a transcript matches, our descriptive process for NewsHour is fully automated. This is because we’re able to utilize the NewsHour’s legacy data. Without that data, the descriptive work required for this collection would be tremendous.

A large majority of NewsHour records follow the workflow I’ve described in this post in their journey to the AAPB. If, unlike those covered here, a record is not an episode, does not have a matching transcript, needs to be redacted, or has technical errors, then it requires more work than I have outlined. Look forward to blog posts about those records in the future! Click here to see a NewsHour record that went through this workflow. If you’re interested in our workflow, I encourage you to open the workbook and use “Find” to follow this GUID (“cpb-aacip-507-0r9m32nr3f”) through the various trackers. Click here to see all NewsHour records that have been put online!

WGBH Awarded $1 Million Grant by Andrew W. Mellon Foundation to Support American Archive of Public Broadcasting

Grant will bolster capacity and usability of the American Archive of Public Broadcasting

BOSTON (June 22, 2017) – WGBH Educational Foundation is pleased to announce that the Andrew W. Mellon Foundation has awarded WGBH a $1 million grant to support the American Archive of Public Broadcasting (AAPB). The AAPB, a collaboration between Boston public media station WGBH and the Library of Congress, has been working to digitize and preserve nearly 50,000 hours of broadcasts and previously inaccessible programs from public radio and public television’s more than 60-year legacy.

WGBH will use the grant funds to build technical capacity for the intake of new content, develop collaborative initiatives, build training and support services for AAPB contributors and foster scholarly use and enhance public access for the collection. These efforts will include the creation of advisory committees for scholars, stations and educators.

“The work of the American Archive of Public Broadcasting is crucial for preserving our public media history and making this rich vault of content available to all,” said WGBH President and CEO Jon Abbott. “I am grateful that the Mellon Foundation has recognized the invaluable efforts of our archivists to save these historic programs for the future. WGBH is honored to accept this generous grant.”

WGBH also will hire a full-time Engagement and Use Manager to lead outreach and engagement activities for the AAPB. Candidates can find the job posting on WGBH’s employment website: http://www.wgbh.org/about/employmentopportunities.cfm.

The AAPB is a national effort to preserve at-risk public media and provide a central web portal for access to the programming that public stations and producers have created over the past 60 years. In its initial phase, the AAPB digitized approximately 40,000 hours of radio and television programming and related materials selected by more than 100 public media stations and organizations across the country. The entire collection is available for research on location at WGBH and the Library, and currently more than 20,000 programs are available in the AAPB’s Online Reading Room at americanarchive.org to anyone in the United States.

###

About WGBH

WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur, Curious George, and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.

About the Library of Congress

The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office.  Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.

About the American Archive of Public Broadcasting

The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 40,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and more than 20,000 programs are available online at americanarchive.org.

About the Andrew W. Mellon Foundation

Founded in 1969, the Andrew W. Mellon Foundation endeavors to strengthen, promote, and, where necessary, defend the contributions of the humanities and the arts to human flourishing and to the well-being of diverse and democratic societies by supporting exemplary institutions of higher education and culture as they renew and provide access to an invaluable heritage of ambitious, path-breaking work. Additional information is available at mellon.org.