Today we’re launching ROLL THE CREDITS, a new Zooniverse project to engage the public in helping us catalog unseen content in the AAPB archive. Zooniverse is the “world’s largest and most popular platform for people-powered research.” Zooniverse volunteers (like you!) are helping the AAPB in classifying and transcribing the text from extracted frames of uncataloged public television programs, providing us with information we can plug directly into our catalog, closing the gap on our sparsely described collection of nearly 50,000 hours of television and radio.
Example frame from ROLL THE CREDITS
The American people have made a huge investment in public radio and television over many decades. The American Archive of Public Broadcasting (AAPB) works to ensure that this rich source for American political, social, and cultural history and creativity is saved and made available once again to future generations.
The improved catalog records will have verified titles, dates, credits, and copyright statements. With the updated, verified information we will be able to make informed decisions about the development of our archive, as well as provide access to corrected versions of transcripts available for anyone to search free of charge at americanarchive.org.
In conjunction with our speech-to-text transcripts from FIX IT, a game that asks users to correct and validate the transcripts one phrase at a time, ROLL THE CREDITS helps us fulfill our mission of preserving and making accessible historic content created by the public media, saving at-risk media before the contents are lost to prosperity.
Thanks for supporting AAPB’s mission! Know someone who might be interested? Feel free to share with the other transcribers and public media fans in your life!
Seeking information about the workflows and requirements for contributing digitized content and/or metadata to the AAPB?
Writing a grant proposal and want to explore collaborating with the AAPB to preserve copies of your digitized collections and/or provide an access point to your collections through the AAPB metadata portal?
Then this webinar is for you!
On Tuesday, December 12, 2017 at 12:00pm ET, the AAPB will host a webinar focused on grant writing for digitization and subsequent contribution of digital files and metadata to the AAPB.
By the end of this webinar, participants will gain an understanding of:
AAPB’s background and infrastructure,
how contributing to the AAPB could benefit your collection
steps to becoming an AAPB contributor,
metadata and digital file format requirements and recommendations,
delivery procedures, and
other workflows and considerations for contributing digital files and/or metadata to the AAPB.
the value of your collection as part of a national collection and how to express that in a proposal
This webinar and future AAPB webinars are generously funded by The Andrew W. Mellon Foundation.
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and almost 25,000 programs are available online at americanarchive.org.
At the AAPB “Crowdsourcing Anecdotes” meeting last Friday at the Association of Moving Image Archivists conference, I talked about a free “Dockerized” build of Kaldi made by Stephen McLaughlin, PHD student at UT Austin School of Information. I thought I would follow up on my introduction to it there by providing links to these resources, instructions for setting it up, and some anecdotes about using it. First, the best resource for this Docker Kaldi and Stephen’s work is here in the HiPSTAS Github: https://github.com/hipstas/kaldi-pop-up-archive. It also has detailed information for setting up and running the Docker Kaldi.
I confess that I don’t know much about computer programming and engineering besides what I need to get my work done. I am an archivist and I eagerly continue to gain more computer skills, but some of my terminology here might be kinda wrong or unclear. Anyways, Kaldi is a free speech-to-text tool that interprets audio recordings and outputs timestamped JSON and text files. This “Dockerized” Kaldi allows you to easily get a version of Kaldi running on pretty much any reasonably powerful computer. The recommended minimum is at least 6gb of RAM, and I’m not sure about the CPU. The more of both the better, I’m sure.
The Docker platform provides a framework to easily download and set up a computer environment in which Kaldi can run. Kaldi is pretty complicated, but Stephen’s Docker image (https://hub.docker.com/r/hipstas/kaldi-pop-up-archive) helps us all bypass setting up Kaldi. As a bonus, it comes set up with the language model that PopUp Archive created as part of our IMLS grant (link here) with them and HiPSTAS. They trained the model using AAPB recordings. Kaldi needs a trained language model dataset to interpret audio data put through the system. Because this build of Kaldi uses the PopUp Archive model, it is already trained for American English.
I set up my Docker on my Mac laptop, so the rest of the tutorial will focus on that system, but the GitHub has information for Windows or Linux and those are not very different. By the way, these instructions will probably be really easy for people that are used to interacting with tools in the command line, but I am going to write this post as if the reader hasn’t done that much. I will also note that while this build of Kaldi is really exciting and potentially useful, especially given all the fighting I’ve done with these kinds of systems in my career, I didn’t test it thoroughly because it is only Stephen’s experiment complimenting the grant project. I’d love to get feedback on issues you might encounter! Also I’ve got to thank Stephen and HiPSTAS!! THANK YOU Stephen!!
SET UP AND USE:
The first step is to download Docker (https://www.docker.com/). You then need to go into Docker’s preferences, under Advanced, and make sure that Docker has access to at least 6gb of RAM. Add more if you’d like.
Then navigate to the Terminal and pull Stephen’s Docker image for Kaldi. The command is “docker pull -a hipstas/kaldi-pop-up-archive”. (Note: Stephen’s GitHub says that you can run the pull without options, but I got errors if I ran it without “-a”). This is a big 12gb download, so go do something else while it finishes. I ate some Thanksgiving leftovers.
When everything is finished downloading, set up the image by running the command “docker run -it –name kaldi_pua –volume ~/Desktop/audio_in/:/audio_in/ hipstas/kaldi-pop-up-archive:v1”. This starts the Kaldi Docker image and creates a new folder on your desktop where you can add media files you want to run through Kaldi. This is also the place where Kaldi will write the output. Add some media to the folder BUT NOTE: the filenames cannot have spaces or uncommon characters or Kaldi will fail. My test of this setup ran well on some short mp4s. Also, your Terminal will now be controlling the Docker image, so your command line prompt will look different than it did, and you won’t be “in” your computer’s file system until you exit the Docker image.
Kaldi will run through a batch, and a ton of text will continue to roll through your Terminal. Don’t be afraid that it is taking forever. Kaldi is meant to run on very powerful computers, and running it this way is slow. I tested on a 30 minute recording, and it took 2.5 hrs to process. It will go faster the more computing power you assign permission for Docker to use, but it is reasonable to assume that on most computers the time to process will be around 5 times the recording length.
The setup script converts wav, mp3, and mp4 to a 16khz broadcast WAV, which is the input that Kaldi requires. You might need to manually convert your media to broadcast WAV if the setup script doesn’t work. I started out by test a broadcast WAV that I made myself with FFmpeg, but Kaldi and/or the setup script didn’t like it. I didn’t resolve that problem because the Kaldi image runs fine on media that it converts itself, so that saves me the trouble anyways.
When Kaldi is done processing, the text output will be in the “audio_in” folder, in the “transcripts” folder. There will be both a JSON and txt file for every recording processed, named the same as the original media file. The quality of the output depends greatly on the original quality of the recording, and how closely the recording resembles the language model (in this case, a studio recording of individuals speaking standard American English). That said, we’ve had some pretty good results in our tests. NOTE THAT if you haven’t assigned enough power to Docker, Kaldi will fail to process, and will do so without reporting an error. The failed files will create output JSON and txt files that are blank. If you’re having trouble try adding more RAM to Docker, or checking that your media file is successfully converting to broadcast WAV.
When you want to return your terminal to normal, use the command “exit” to shut down the image and return to your file system.
When you want to start the Kaldi image again to run another batch, open another session by running “docker start /kaldi_pua” and then “docker exec -it kaldi_pua bash”. You’ll then be in the Kaldi image and can run the batch with the “sh ./setup.sh” command.
I am sure that there are ways to update or modify the language model, or to use a different model, or to add different scripts to the Docker Kaldi, or to integrate it into bigger workflows. I haven’t spent much time exploring any of that, but I hope you found this post a helpful start. We’re going to keep it in mind as we build up our speech-to-text workflows, and we’ll be sure to share any developments. Happy speech-to-texting!!
Next week, American Archive of Public Broadcasting staff are hosting at several workshops on workflows, crowdsourcing, and copyright at the 2017 Association of Moving Image Archivists (AMIA) conference in New Orleans!
Check out sessions and events featuring presentations by AAPB staff below. We hope to see you there! If you are unable to attend the conference, follow along with the conversations on Twitter at #AMIA17!
THURSDAY, November 30th
1pm – 2pm, PBCore Advisory Sub-Committee Meeting Rebecca Fraimow will report on general activities of the Sub-Committee and the PBCore Development and Training Project. The following current activities will also be presented:
3:30 – 4:30 pm, Let the Computer and the Public do the Metadata Work! Speakers: Karen Cariani, Senior Director, WGBH Media Library and Archives & AAPB Project Director
Tali Singer, Pop Up Archive
Tanya Clement, University of Texas at Austin, School of Information
Archives that hold A/V materials are at a critical point, with many cultural heritage institutions needing to take immediate action to safeguard at-risk media formats before the content they contain is lost forever. Yet, many in the cultural heritage communities do not have sufficient education and training in how to handle the special needs that A/V archive materials present. In the summer of 2015, a handful of archive educators and students formed a pan-institutional group to help foster “educational opportunities in audiovisual archiving for those engaged in the cultural heritage sector.” The AV Competency Framework Working Group is developing a set of competencies for audiovisual archive training of students in graduate-level education programs and in continuing education settings. In this panel, core members of the working group will discuss the main goals of the project and the progress that has been made on it thus far.
4:45 – 5:45 pm, Good Enough to Best, Tiered Born-Digital AV Processing Speakers: Rebecca Fraimow, Project Manager, WGBH Media Library and Archives
Erica Titkemeyer, University of North Carolina at Chapel Hill
Julia Kim, Library of Congress
Born-Digital audiovisual files continue to present a conundrum to archivists in the field today: should they be accepted as-is, transcoded, or migrated? Is transcoding to a recommended preservation format always worth the potential extra storage space and staff time? If so, what are the ideal target specifications? In this presentation, individuals working closely with born-digital audiovisual content from the University of North Carolina, WGBH, and the American Folklife Center at the Library of Conference will present their own use cases involving collections processing practices, from “best practice” to the practical reality of “good enough”. These use cases will highlight situations wherein video quality, subject matter, file size and stakeholder expectations end up playing important roles in directing the steps taken for preservation. From these experiences, the panel will put forth suggestions for tiered preservation decision making, recognizing that not all files should necessarily be treated alike.
5:45 – 6:45 pm, Crowdsourcing Anecdotes
Room: Arcadian I
THE QUESTION: How does the public play a role in making historical AV content accessible? The American Archive of Public Broadcasting has launched two games that engage the public in transcribing and describing 70+ years of audio and visual content comprising more than 50,000 hours.
Join us to hear lessons learned, give us feedback on our open source FIX IT game and Zooniverse “ROLL THE CREDITS” project, find out how to build an AV-focused Zooniverse project and make use of recently released speech-to-text Kaldi language models. There might also be New Orleans-themed surprise…
(Speech-to-Text Transcript Correction)
FIX IT is an online game that allows the public to identify and correct errors in our machine-generated transcripts. FIX IT players have exclusive access to historic content and long-lost interviews from stations across the country. Website: fixit.americanarchive.org.
ROLL THE CREDITS is a game that allows the public to identify and transcribe information about the text that appears on the screen in so many television broadcasts. ROLL THE CREDITS asks users to collect this valuable information and classify it into categories that can be added to the AAPB catalog. To accomplish this goal, we’ve extracted frames from uncataloged video files and are asking for help to transcribe the important information contained in each frame.
SATURDAY, Dec 2nd
9:45 – 10:45 am, Put it on your Bucket List: Navigating Copyright to Expose Digital AV Collections at Scale Speakers: Casey Davis Kaufman, Associate Director, WGBH Media Library and Archives & Project Manager, American Archive of Public Broadcasting
Jay Fialkov, Deputy General Counsel, WGBH
Hope O’Keeffe, Associate General Counsel, Library of Congress
Digitized collections often remain almost as inaccessible as they were on their original analog carriers, primarily due to institutional concerns about copyright infringement and privacy. The American Archive of Public Broadcasting has taken steps to overcome these challenges, making available online more than 22,000 historic programs with zero take-down notices since the 2015 launch. This copyright session will highlight practical and successful strategies for making collections available online. The panel will share strategies for: 1) developing template forms with standard terms to maximize use and access, 2) developing a rights assessment framework with limited resources (an institutional “Bucket Policy”), 3) providing limited access to remote researchers for content not available in the Online Reading Room, and 4) promoting access through online crowdsourcing initiatives.
11am – 12 pm, Building the AAPB: Inter-Institutional Preservation and Access Workflows Speakers: Charles Hosale, Special Projects Assistant, WGBH/AAPB
Leslie Bourgeois, Archivist, Louisana Public Broadcasting
Ann Wilkens, Archivist, Wisconsin Public Television
Rachel Curtis, AAPB Project Coordinator, Library of Congress
The American Archive of Public Broadcasting seeks to preserve and make accessible significant historical public media content, and to coordinate a national effort to save at-risk public media recordings. In the four years since WGBH and the Library of Congress began stewardship of the project, significant steps have been taken towards accomplishing these goals. The effort has inspired workflows that function constructively, beginning with preservation at local stations and building to national accessibility on the AAPB. Archivists from two contributing public broadcasters will present their institutions’ local preservation and access workflows. Representatives from WGBH and the Library of Congress will discuss collaborating with contributors and the AAPB’s digital preservation and access workflows. By sharing their institutions’ roles and how collaborators participate, the speakers will present a full picture of the AAPB’s constructive inter-institutional work. Attendees will gain knowledge of practical workflows that facilitate both local and national AV preservation and access.
3:30 – 4:30 pm, Preservation is Painless: A Guide to Outsourced AV Digitization Project Management Speakers: Biz Maher Gallo, George Blood Audio/Video/Film/Data Charles Hosale, WGBH Media Library & Archives
Robin Pike, University of Maryland Libraries
Emily Vinson, University of Houston Libraries
Rebecca Holte, New York Public Library
Erica Titkemeyer, UNC Chapel Hill Libraries
Kimbery Tarr, New York University Libraries
As an increasing number of audiovisual formats become obsolete and the available hours remaining on deteriorating playback machines decrease, it is essential for institutions to digitize their AV holdings to ensure long-term preservation and access. With an estimated hundreds of millions of items to digitize, it is impractical, even impossible, that institutions would be able to perform all of this work in-house before time runs out. While this can seem like a daunting process, why learn the hard way when you can benefit from the experiences of others? From those embarking on their first outsourced AV digitization project to those who have completed successful projects but are looking for ways to refine and scale up their process, everyone has something to learn from these speakers about managing AV digitization projects from start to finish.
Poster Session – Design for Context: Cataloging and Linked Data for Exposing National Educational Television (NET) Content
Presenters: Sadie Roosa, Project Manager, National Educational Television Collection Catalog Project Rachel Curtis, AAPB Project Coordinator, Library of Congress
Christopher Pierce, Metadata Specialist, Library of Congress
How do you bring together a collection of broadcast materials scattered in various geographical locations across the country? National Education Television (NET), the precursor to PBS, distributed programs nationally to educational television stations from 1954-1972. Although this collection is tied together through provenance, it presents a challenge to processing due to differing approaches in descriptive practices across many repositories over many years. By aggregating inventories into one catalog and describing titles more fully, the NET Collection Catalog will help institutions holding these materials make informed preservation decisions. By its conclusion, AAPB will publish an online list of NET titles annotated with relevant descriptive information culled from NET textual records that will greatly improve discoverability of NET materials for archivists, scholars, and the general public. Examples of specific cataloging issues, including contradictory metadata documentation and legacy records, inconsistent titling practices, and the existence of international version will be explored.
ABOUT THE AAPB
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 70 years. To date, over 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and almost 25,000 programs are available online at americanarchive.org.
The Library of Congress and WGBH have acquired and preserved original, full-length interviews from The Civil War, Eyes on the Prize and American Masters
The American Archive of Public Broadcasting (AAPB) recently acquired three collections of original, full-length interviews from groundbreaking public television documentaries: Ken Burns’ The Civil War, Eyes on the Prize and American Masters. Only excerpts of these interviews were included in previously released, edited programs. Now, the full-length interviews from these landmark series will be available to view online at americanarchive.org or in person at the Library of Congress and at WGBH, preserved for future generations to learn about our nation’s history.
The AAPB, a collaboration between the Library of Congress and Boston public media station WGBH, has digitized and preserved more than 50,000 hours of broadcasts and previously inaccessible programs from public radio and public television’s more than 60-year legacy.
Interviews from Ken Burns’ The Civil War
The Civil War, an epic nine-episode series by the award-winning documentary filmmaker Ken Burns and produced in conjunction with WETA, Washington, DC and American Documentaries, Inc., first aired in September 1990 to an audience of 40 million viewers. The film is the recipient of 40 major film and television awards, including two Emmys and two Grammys.
The AAPB The Civil War collection includes eight digitized, full-length interviews with distinguished historians and commentators Barbara J. Fields, C. Vann Woodward, Robert Penn Warren, William Safire, James Symington, Stephen B. Oates, Ed Bearss and Daisy Turner. The Civil War collection is available online athttp://americanarchive.org/special_collections/ken-burns-civil-war.
Interviews from Eyes on the Prize
Eyes on the Prize: America’s Civil Rights Years 1954–1965 tells the definitive story of the civil rights era from the point of view of the ordinary men and women whose extraordinary actions launched a movement that changed the fabric of American life, and embodied a struggle whose reverberations continue to be felt today. The award-winning documentary series recounts the fight to end decades of discrimination and segregation from the murder of Emmett Till and the Montgomery bus boycott in 1955 and 1956 to the 1965 Voting Rights Campaign in Selma, Alabama. Eyes on the Prize was produced by Blackside, Inc and aired on PBS in 1987.
The Eyes on the Prize interviews collection comes from Washington University Libraries’ Henry Hampton Collection and includes 75 hours of full-length interviews with leaders and activists such as Rosa Parks, Constance Baker Motley, James Farmer, Robert Moses, Andrew Young, John Lewis, Ralph Abernathy, Stokely Carmichael and Myrlie Evers. The Eyes on the Prize collection is available online at http://americanarchive.org/special_collections/eotp-i-interviews.
Interviews from American Masters
American Masters is an award-winning biography series that celebrates American arts and culture. Launched in 1986 on PBS, the series set the standard for documentary film profiles, and is produced by New York’s flagship PBS station THIRTEEN for WNET.
AAPB has preserved more than 800 full-length interviews filmed for American Masters with cultural luminaries such as David Bowie, Yoko Ono, Robert Plant, Tim Burton, Nora Ephron, Denzel Washington, Carol Burnett, Andrew Lloyd Webber, Quincy Jones and Jimmy Carter. The interviews, digitized for In Their Own Words: The American Masters Digital Archive and the American MastersPodcast, will be archived for long-term storage at the Library of Congress to ensure their survival for future generations. Researchers can access the full collection on locationat the Library of Congress and at WGBH. Information about the American Masters collection is available at http://americanarchive.org/special_collections/american-masters-interviews.
The AAPB is a national effort to preserve at-risk public media and provide a central web portal for access to the programming that public stations and producers have created over the past 60 years. In its initial phase, the AAPB digitized approximately 40,000 hours of radio and television programming and related materials selected by more than 100 public media stations and organizations across the country. The entire collection is available for research on location at the Library of Congress and WGBH, and currently more than 20,000 programs are available in the AAPB’s Online Reading Room at americanarchive.org to anyone in the United States.
– – –
WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia™, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.
About the Library of Congress
The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship atcopyright.gov.
About the American Archive of Public Broadcasting
The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, over 40,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at the Library of Congress and WGBH, and more than 20,000 programs are available online atamericanarchive.org.
The Library of Congress and Boston public broadcaster WGBH will celebrate the 50th anniversary of the passage of the Public Broadcasting Act of 1967 with a series of panels featuring pioneers and experts in public broadcasting Friday, Nov. 3, 2 p.m.–6 p.m. The symposium—“Preserving Public Broadcasting at 50 Years”—will be held in the Montpelier room on the sixth floor of the Library’s James Madison Memorial Building, 101 Independence Ave., SE, Washington, D.C.
Signed by President Lyndon Johnson, the act established public broadcasting as it is organized today and also authorized the Corporation for Public Broadcasting (CPB) to establish and maintain a library and archives of non-commercial educational television and radio programs. CPB established the American Archive of Public Broadcasting (AAPB) in 2009 and, in 2013, the Library of Congress and WGBH assumed responsibility of AAPB, coordinating a national effort to preserve and make accessible significant at-risk public media.
A Library report on television and video preservation in 1997 cited the importance of public broadcasting:
“[I]t is still not easy to overstate the immense cultural value of this unique audiovisual legacy, whose loss would symbolize one of the great conflagrations of our age, tantamount to the burning of Alexandria’s library in the age of antiquity.”
The initial AAPB archive, donated by more than 100 public broadcasting stations, contained more than 40,000 hours of content from the early 1950s to the present. The full collection, now more than 50,000 hours of preserved content, is available on-site to researchers at the Library in Washington, D.C., and WGBH in Boston, Massachusetts. Nearly a third of the files, however, are now available online for research, educational and informational purposes at http://americanarchive.org.
During the symposium, panelists will examine the history of public broadcasting, the origins of its news and public affairs programming, the importance of preservation and the educational uses of public broadcasting programs for K-12 and college education, scholarship and adult education. Also highlighted will be some of AAPB’s most significant collections, such as the “PBS NewsHour” and its predecessors, which are currently being digitized for online access, and full interviews conducted for “Eyes on the Prize” and “American Experience” documentaries.
The program schedule is subject to change, but confirmed participants include:
2 p.m. –Introductions and Welcoming Remarks
Carla Hayden, Librarian of Congress
Jon Abbott, President and CEO, WGBH
Patricia Harrison, President and CEO, CPB
2:15 p.m. –Origins
Nicholas Johnson, FCC commissioner, 1966-73
Bill Siemering, NPR co-founder, creator of “All Things Considered”
Newton Minow, FCC chairman, 1961-63, via video
Ervin Duggan, FCC commissioner (1990-93); President of PBS (1993-99)
Cokie Roberts, NPR and MacNeil/Lehrer contributor; AAPB adviser (moderator)
3:10 p.m. –News and Public Affairs Talk Shows
Jim Lehrer, co-anchor, “MacNeil/Lehrer NewsHour”
Dick Cavett, host of “The Dick Cavett Show,” 1977-1982
Cokie Roberts, NPR and MacNeil/Lehrer contributor; AAPB adviser
Hugo Morales, co-founder, Radio Bilingüe
Sharon Percy Rockefeller, CEO, WETA-TV
Judy Woodruff, “PBS NewsHour” (moderator)
4:10 p.m. – Documentaries: Style and the Use of Archives
David Fanning, creator, “FRONTLINE”
Clayborne Carson, founder and director of the Martin Luther King, Jr. Research andEducation Institute; senior adviser, “Eyes on the Prize”
Stephen Gong, director, Center for Asian American Media
Margaret Drain, former executive producer of “American Experience”
Patricia Aufderheide, university professor of Communication Studies at American University (moderator)
5:10 p.m. –Educational Uses of Public Broadcasting
Lloyd Morrisett, co-creator, “Sesame Street”
Paula Apsell, executive producer of “NOVA”
Debra Sanchez, Senior Vice President for Education and Children’s Content Operations, Corporation for Public Broadcasting
Kathryn Ostrofsky, instructor, Angelo State University, Department of History
Jennifer Lawson, founding chief programming executive, PBS (moderator)
The Library of Congress is the world’s largest library, offering access to the creative record of the United States—and extensive materials from around the world—both on-site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.
WGBH Boston is America’s pre-eminent public broadcaster and the largest producer of PBS content for TV and the web, including “Masterpiece,” “Antiques Roadshow,” “FRONTLINE,” “NOVA,” “American Experience,” “Arthur,” “Curious George” and more than a dozen other prime-time, lifestyle and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards … even two Oscars. Find more information at www.wgbh.org.
Grant of $229,772 will fund students’ work on digitization of historic, at-risk public media content from underrepresented regions and communities
BOSTON, September 28, 2017– WGBH Educational Foundation is pleased to announce that the Institute of Museum and Library Services (IMLS) has awarded WGBH a $229,772 Laura Bush 21st Century Librarian Program grant to launch the Public Broadcasting Preservation Fellowship. The fellowship will fund 10 graduate students from across the United States to digitize at-risk audiovisual materials at public media organizations near their universities. The digitized content will ultimately be incorporated into the American Archive of Public Broadcasting (AAPB), a collaboration between Boston public media station WGBH and the Library of Congress working to digitize and preserve thousands of broadcasts and previously inaccessible programs from public radio and public television’s more than 60-year legacy.
“We are honored that the Institute of Museum and Library Services has chosen WGBH to lead the Public Broadcasting Preservation Fellowship,” said Casey Davis Kaufman, Associate Director of the WGBH Media Library and Archives and WGBH’s AAPB Project Manager. “This grant will allow us to prepare a new generation of library and information science professionals to save at-risk and historically significant public broadcasting collections, especially fragile audiovisual materials, from regions and communities underrepresented in the American Archive of Public Broadcasting.”
WGBH has developed partnerships with library and information science programs and archival science programs at five universities: Clayton State University, University of North Carolina at Chapel Hill, University of Oklahoma, University of Missouri, and San Jose State University. Each school will be paired with a public media organization that will serve as a host site for two consecutive fellowships: Georgia Public Broadcasting, WUNC, the Oklahoma Educational Television Authority, KOPN Community Radio, and the Center for Asian American Media in partnership with the Bay Area Video Coalition.
“As centers of learning and catalysts of community change, libraries and museums connect people with programs, services, collections, information, and new ideas in the arts, sciences, and humanities. They serve as vital spaces where people can connect with each other,” said IMLS Director Dr. Kathryn K. Matthew. “IMLS is proud to support their work through our grant making as they inform and inspire all in their communities.”
The first fellowship will take place during the 2018 spring semester, from January to April of 2018. The second fellowship will take place during the summer semester from June to August of 2018. The grant also will support participating universities in developing long-term audiovisual preservation curricula, including providing funding for audiovisual digitization equipment, and developing partnerships with local public media organizations.
About WGBH WGBH Boston is America’s preeminent public broadcaster and the largest producer of PBS content for TV and the Web, including Masterpiece, Antiques Roadshow, Frontline, Nova, American Experience, Arthur, Curious George, and more than a dozen other prime-time, lifestyle, and children’s series. WGBH also is a leader in educational multimedia, including PBS LearningMedia, and a pioneer in technologies and services that make media accessible to the 36 million Americans who are deaf, hard of hearing, blind, or visually impaired. WGBH has been recognized with hundreds of honors: Emmys, Peabodys, duPont-Columbia Awards…even two Oscars. Find more information at www.wgbh.org.
About the Library of Congress The Library of Congress is the world’s largest library, offering access to the creative record of the United States – and extensive materials from around the world – both on site and online. It is the main research arm of the U.S. Congress and the home of the U.S. Copyright Office. Explore collections, reference services and other programs and plan a visit at loc.gov, access the official site for U.S. federal legislative information at congress.gov and register creative works of authorship at copyright.gov.
About the American Archive of Public Broadcasting The American Archive of Public Broadcasting (AAPB) is a collaboration between the Library of Congress and the WGBH Educational Foundation to coordinate a national effort to preserve at-risk public media before its content is lost to posterity and provide a central web portal for access to the unique programming that public stations have aired over the past 60 years. To date, nearly 50,000 hours of television and radio programming contributed by more than 100 public media organizations and archives across the United States have been digitized for long-term preservation and access. The entire collection is available on location at WGBH and the Library of Congress, and more than 22,000 programs are available online at americanarchive.org.
About IMLS The Institute of Museum and Library Services is celebrating its 20th Anniversary. IMLS is the primary source of federal support for the nation’s 123,000 libraries and 35,000 museums. Our mission has been to inspire libraries and museums to advance innovation, lifelong learning, and cultural and civic engagement. For the past 20 years, our grant making, policy development, and research has helped libraries and museums deliver valuable services that make it possible for communities and individuals to thrive. To learn more, visit http://www.imls.gov and follow us on Facebook, Twitter and Instagram.
In 2015, the Institute of Museum and Library Services (IMLS) awarded WGBH on behalf of the American Archive of Public Broadcasting a grant to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. Through a collaboration with Pop Up Archive and HiPSTAS at the University of Texas at Austin, our project has supported the creation of speech-to-transcripts for the initial 40,000 hours of historic public broadcasting preserved in the AAPB, the launch of a free open-source speech-to-text tool, and FIX IT, a game that allows the public to help correct our transcripts.
Now, our colleagues at HiPSTAS are debuting a new machine learning toolkit and DIY techniques for labeling speakers in “unheard” audio — audio that is not documented in a machine-generated transcript. The toolkit was developed through a massive effort using machine learning to identify notable speakers’ voices (such as Martin Luther King, Jr. and John F. Kennedy) from within the AAPB’s 40,000 hour collection of historic public broadcasting content.
This effort has vast potential for archivists, researchers, and other organizations seeking to discover and make accessible sound at scale — sound that otherwise would require a human to listen and identify in every digital file.
As part of our NEH-funded PBCore Development and Training Project, we’re developing tools and resources around PBCore, a metadata schema and data model designed to describe and manage audiovisual collections.
Based on feedback from a previous survey to users and potential users, we’ve generated a list of tools and resources that previous respondents indicated would be valuable to the archival and broadcasting communities. Now, we’re looking for feedback on what to prioritize that will be of real use to the archives and public media communities.
Please fill out this short survey – which should take at most five minutes – to check out our development plans and give your feedback on where we should focus our efforts: https://www.surveymonkey.com/r/WPF3QZD
Thanks for taking the time to fill out the survey! You can read more about the PBCore Development and Training Project here and see the PBCore website here.
In this post I will describe our “Asset Review” and “Online Workflow” phases. The “Asset Review” phase is where we determine what work we will need to do to a recording to make it available online, and the “Online Workflow” phase is where we extract metadata from a transcript, add the metadata to our repository, and make the recording available online.
The goals and realities of the NewsHour project necessitate an item level content review of each recording. The reasons for this are distinct and compounding. The scale of the collection (nearly 10,000 assets) meant that the inventories from which we derived our metadata were generated only from legacy databases and tape labels, which are sometimes wrong. At no point were we able to confirm that the content on any tape is complete and correct prior to digitization. In fact, some of the tapes are unplayable before being prepared to be digitized. Additionally, there is third-party content that needs to be redacted from some episodes of the NewsHour before they can be made available. A major complication is that the transcripts only match 7pm Eastern broadcasts, and sometimes 9pm or 11pm updates would be recorded and broadcast if breaking news occurred. The tapes are not always marked with broadcast times, and sometimes do not contain the expected content – or even an episode of the NewsHour!
These complications would be fine if we were only preserving the collection, but our project goal is to make each recording and corresponding transcript or closed caption file broadly accessible. To accomplish that goal each record must have good metadata, and to have that we must review and describe each record! Luckily, some of the description, redaction, and our workflow tracking is automatable.
Access and Description Workflow Overview
As I’ve mentioned before, we coordinate and document all our NewsHour work in a large Google Sheet we call the “NewsHour Workflow workbook” (click here for link). The chart below explains how a GUID moves through sheets of the NewsHour workbook throughout our access and description work.
After a digitized recording has been delivered to WGBH and preserved, it is automatically placed in queue on the “Asset Review” sheet of our workbook. During the Asset Review, the reviewer answers thirteen different questions about the GUID. Using these responses, the Google Sheet automatically places the assets into the appropriate workflow trackers in our workbook. For instance, if a recording doesn’t have a transcript, it is placed in the “No Transcript tracker”, which has extra workflow steps for generating a description and subject metadata. A GUID can have multiple issues that place it into multiple trackers simultaneously. For instance, a tape that is not an episode will also not have a transcript, and will be placed on both the “Not an Episode tracker” and the “No Transcript tracker”. The Asset Review is critical because the answers determine the work we must perform, and ensures that each record will be correctly presented to the public when work on it is completed.
A GUID’s status in the various trackers is reflected in the “Master GUID Status sheet”, and is automatically updated when different criteria in the trackers are met and documented. When a GUID’s workflow tasks have been completely resolved in all the trackers, it appears as “Ready to go online” on the “Master GUID Status sheet.” The GUID is then automatically placed into to the “AAPB Online Status tracker”, which presents the metadata necessary to put the GUID online and indicates if tasks have been completed in the “Online Workflow tracker”. When all tasks are completed, the GUID will be online and our work on the GUID is finished.
In this post I am focusing on a workflow that follows digitizations which don’t have problems. This means the GUIDs are episodes, contain no technical errors, and have transcripts that match (green arrows in the chart). In future blog posts I’ll elaborate on our workflows for recordings that go into the other trackers (red arrows).
Each row of the “Asset Review sheet” represents one asset, or GUID. Columns A-G (green cell color) on the sheet are filled with descriptive and administrative metadata describing each item. This metadata is auto-populated from other sheets in the workbook. Columns H-W (yellow cell color) are the reviewer’s working area, with questions to answer about each item reviewed. As mentioned earlier, the answers to the questions determines the actions that need to be taken before the recording is ready to go online, and place the GUID into the appropriate workflow trackers.
The answers to some questions on the sheet impact the need to answer others, and cells auto-populate with “N/A” when one answer precludes another. Almost all the answers require controlled values, and the cells will not accept input besides those values. If any of the cells are left blank (besides questions #14 and #15) the review will not register as completed on the “Master GUID Status Sheet”. I have automated and applied value control to as much of the data entry in the workbook as possible, because doing so helps mitigate human error. The controlled values also facilitate workbook automation, because we’ve programmed different actions to trigger when specific expected text strings appear in cells. For instance, the answer to “Is there a transcript for this video?” must be “Yes” or “No”, and those are the only input the cell will accept. A “No” answer places the GUID on the “No Transcript tracker”, and a “Yes” does not.
To review an item, staff open the GUID on an access hard drive. We have a multiple access drives which contain copies of all the proxy files delivered NewsHour digitizations. Reviewers are expected to watch between one and a half to three minutes of the beginning, middle, and end of a recording, and to check for errors while fast-forwarding through everything not watched. The questions reviewers answer are:
Is this video a nightly broadcast episode?
If an episode, is the recording complete?
If incomplete, describe the incompleteness.
Is the date we have recorded in the metadata correct?
If not, what is the corrected date?
Has the date been updated in our metadata repository, the Archival Management System?
Is the audio and video as expected, based on the digitization vendor’s transfer notes?
If not, what is wrong with the audio or video?
Is there a transcript for this video?
If yes, what is the transcript’s filename?
Does the video content completely match the transcript?
If no, in what ways and where doesn’t the transcript match?
Does the closed caption file match completely (if one exists)?
Should this video be part of a promotional exhibit?
Any notes to project manager?
Date the review is completed.
Initials of the reviewer.
Our internal documentation has specific guidelines on how to answer each of these questions, but I will spare you those details! If you’re conducting quality control and description of media at your institution, these questions are probably familiar to you. After a bit of practice reviewers become adept at locating transcripts, reviewing content, and answering the questions. Each asset takes about ten minutes to review if the transcript matches, the content is the expected recording, and the digitization is error free. If any of those criteria are not true, the review will take longer. The review is laborious, but an essential step to make the records available.
A large majority of recordings are immediately ready to go online following the asset review. These ready GUIDs are automatically placed into the “AAPB Online Status tracker,” where we track the workflow to generate metadata from the transcript and upload that and the recording to the AAPB.
About once a month I use the “AAPB Online Status tracker” to generate a list of GUIDs and corresponding transcripts and closed caption files that are ready to go online. To do this, all I have to do is filter for GUIDs in the “AAPB Online Status tracker” that have the workflow status “Incomplete” and copy the relevant data for those GUIDs out of the tracker and into a text file. I import this list into a FileMaker tool we call “NH-DAVE” that our Systems Analyst constructed for the project.
“NH-DAVE” is a relational database containing all of the metadata that was originally encoded within the NewsHour transcripts. The episode transcripts provided by NewsHour contained the names of individuals appearing and subject terms for that episode in marked up values. Their subject terms were much more specific than ours, so we mapped them to the more broad AAPB controlled vocabulary we use to facilitate search and discovery on our website. When I ingest a list of GUIDs and transcripts to “NH-DAVE” and click a few buttons, it uses an AppleScript to match metadata from the transcript to the corresponding NewsHour metadata records in our Archival Management System and generate SQL statements. We use the statements to insert the contributor and subject metadata from the transcripts into the GUIDs’ AAPB metadata records in the Archival Management System.
Once the transcript metadata has been ingested we use both a Bash and a Ruby script to upload the proxy recordings to our streaming service, Sony Ci, and the transcripts and closed caption SRT files to our web platform, Amazon. We run a Bash script to generate another set of SQL statements to add the Sony Ci URLs and some preservation metadata (generated during the digital preservation phase) to our Archival Management System. We then export the GUIDs’ Archival Management System records into PBCore XML and ingest the XML into the AAPB’s website. As each step of this process is completed, we document it in the “Online Workflow tracker,” which will eventually register that work on the GUID is completed. When the PBCore ingest is completed and documented on the “Online Workflow tracker,” the recording and transcript are immediately accessible online and the record displays as complete on the “Master GUID Status spreadsheet”!
We consider a record that has an accurate full text transcript, contributor names, and subject terms to be sufficiently described for discovery functions on the AAPB. The transcript and terms will be fully indexed to facilitate searching and browsing. When a transcript matches, our descriptive process for NewsHour is fully automated. This is because we’re able to utilize the NewsHour’s legacy data. Without that data, the descriptive work required for this collection would be tremendous.