Steve Wilcer, Public Broadcasting Preservation Fellow at WUNC

Wilcer profile photo.jpg
I was thrilled to experience the myriads of different programs from WUNC over the years and be able to directly contribute to their preservation for the future.

Hello! My name is Steve Wilcer. I coordinated with WGBH and WUNC Radio in Chapel Hill, North Carolina as a member of the second cohort of fellows for the AAPB Public Broadcast Preservation Fellowship. I am currently working towards a Master of Science in Library Science at the University of North Carolina and plan to graduate next spring. Prior to my time in North Carolina, I studied musicology at the Ohio State University and was exposed to a wide variety of media formats and materials, ranging from microfiche to medieval manuscripts. I developed a strong passion for libraries and archives through these experiences, which led me to pursue a second master’s degree in library science.

Learning as I work

As someone who just entered North Carolina last fall, my work with WUNC Radio offered me a unique opportunity to learn about the area and its people. Public radio provides a versatile platform for education, entertainment, and awareness programming. I was thrilled to experience the myriads of different programs from WUNC over the years and be able to directly contribute to their preservation for the future. During my portion of the fellowship, I was able to digitize approximately forty assets, with most of them being digital audio tapes. I also continued to develop the cataloging and documentation for WUNC, allowing me to experience the digitization and preservation process from a more holistic standpoint.

One particularly informative component of the fellowship for me was the North Carolina Voices special collection: This collection contains materials from two of WUNC’s special program series: Understanding Poverty and Civil War. Understanding Poverty offered a wide assortment of programs and features on various financial and social issues in the state, as well as how North Carolina has developed over the last several decades. The Civil War series contained family stories of ancestors that lived during or served in the United States Civil War. Both series provided me a valuable, more tangible insight into the people of Chapel Hill and North Carolina as I listened to their stories and firsthand experiences. I also had the artistic opportunity to design our thumbnail image for the special collection as it appears on the AAPB.

Building up foundations

Being the second UNC fellow for the project, I was fortunate that our digitization station was already set up and operational. Getting the station to work was a significant challenge for the first round of the fellowship, but fortunately, the station operated without any issues for me, thanks to all the hard work from everyone involved. One of my duties in the project was to build upon the records for the digitized materials and ensure that WUNC’s personal records were uniform and easy to understand. I frequently consulted with WUNC’s Keith Weston to confirm dates, names, and programming details. In some cases, newly rediscovered items forced us to reevaluate how we defined a particular series or piece of programming, and I would edit our records as necessary.

UNC SILS Digitization station

While the fellowship focuses on digitization, cataloging the physical DATs and cassettes I handled proved to be equally important. Without proper labeling and documentation, a given asset could be unknowingly re-recorded and cost extra time. In addition to our digital master table of records, I was responsible for labeling the physical objects and their cases with the newly-determined local identifiers for WUNC. With these markings, the cases can be quickly scanned for items that are yet to be digitized, which will make future digitization projects easier for WUNC.

I developed a strong personal connection to these items as I cataloged and marked them. Each DAT and cassette had a story to tell, and it was up to me to piece together their metadata and see that they were digitized and made publicly accessible so others could listen to them. Being one of the first North Carolina-based organizations to be included in the AAPB was very exciting for me, as our work here was not only a foundation for WUNC and its archives, but for North Carolina as a state, as well. Materials like the WUNC 1953 sign-on event reminded me how long ago some of these recordings were made, and how many more there may still be at WUNC, waiting to be digitized and heard once more.

Overall, the fellowship has been a wonderful opportunity for me. It allowed me to not only develop my abilities handling audio materials and digital records, but also provide me a way to learn about the area and its people and history. I am incredibly grateful for all the support and effort from everyone that allowed this project to be realized: my advisor, Dr. Helen Tibbo, Erica Titkemeyer from the Southern Folklife collection for her technical assistance, Dena Schultz, our first fellow for the project, Keith Weston at WUNC, and all the staff at WGBH for their supervision, planning, and feedback.

Written by Steve Wilcer, PBPF Summer 2018 Cohort

———

About PBPF

The Public Broadcasting Preservation Fellowship (PBPF), funded by the Institute of Museum and Library Services, supports ten graduate student fellows at University of North Carolina, San Jose State University, Clayton State University, University of Missouri, and University of Oklahoma in digitizing at-risk materials at public media organizations around the country. Host sites include the Center for Asian American Media, Georgia Public Broadcasting, WUNC, the Oklahoma Educational Television Authority, and KOPN Community Radio. Contents digitized by the fellows will be preserved in the American Archive of Public Broadcasting. The grant also supports participating universities in developing long-term programs around audiovisual preservation and ongoing partnerships with their local public media stations.

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work.

 

Riley Griffin, Public Broadcasting Preservation Fellow at GPB

Riley.png
When we toured WGBH, we took turns holding an Emmy Award trophy (Image: Riley Griffin, author, holding an Emmy Award)

Hi, everyone!  My name is Riley Griffin (xe/xir).  I am just now entering my second year of graduate school in Clayton State University’s Masters of Archival Studies program.  I am the second fellow, after Virginia Angles, to be a part of the American Archives of Public Broadcasting (AAPB) Public Broadcasting Preservation Fellowship (PBPF).  My part of the project focused on digitizing Georgia Public Broadcasting’s (GPB) Georgia Gazette under the incredibly trusting supervision of Ellen Reinhardt, Kathy Christensen, and Joshua Kitchens.  I was looking for summer opportunities when a chance at following a career path in my new-found love for preservation presented itself through the AAPBPBPF.  I was overjoyed by the scope of the fellowship, the organizations working with it, and the special collections it included.

Every fellowship starts with certain expectations only to end with different lessons and new perspectives.  At the start of my fellowship, I spent a lot of time comparing. There were a lot of things I was not expecting, my reactions being one of them.  As we visited Boston and learned about all the different types of digital media we could be working with I couldn’t help but begin to feel this sort of jealousy–wishing I could work with as many formats and topics as possible.

Of course, this hunger decreased to a low rumble as I became humbled by the Georgia Gazette materials.  I quickly realized I craved difficulty; so, I became grateful instead of jealous.  In training, we were prepared to scrub and scrub our machines clean, take precious time delicately fixing things, and balance everything to be just perfect.  However, my project was given a bit of grace by being a more modern collection. Digital Audio Tapes (DATs) are often considered one of the most fragile media formats. However, most of them were recorded at a decent quality from the 1990’s to the 2000’s, rewound to the beginning, and left alone and undisturbed in an air-conditioned radio station.  So, please forgive me when I am grateful that the worst of my worries is how many times I dropped the (very loose) pinch roller into the machine that day.

GPBDigStation.png
GPB Digitization Station (Image: Two desks with 2 computers, a DAT machine, cleaning materials, and various electronics everywhere)

The topics of everyone’s materials had me curious, too.  I was wondering what it was like to have video–as my project was only audio–and to have materials like oral histories to work with.  I quickly counted my blessings as I heard what my colleague was working on–images of war, tragedy, death, and disaster. I thanked GPB for having forward attitudes towards topics, reporters who were nearly-emotionless in comparison, and pert news reports.  I am a very sensitive soul and could imagine having to wait the tears out before being able to see what you’re working on. I also realized I was having a hard time with some of the Georgia Gazette material.  One thing I experience as an archivist who moves all over is major culture shock.  I think being an archivist is one of the best ways to learn about the place you have just moved to. But it also exposes you to things much quicker than you expect.

I’m from upstate New York, which has a different demographic and historical context; although I’m not unfamiliar with racism, being deeply embedded in Georgia’s racial history as I digitized GPB’s daily news was a new experience for me. I had moments of weeping at work as I listened to news reports about the Georgia General Assembly holding expensive special sessions in order to redistrict purely based on race, schoolchildren being prevented from going the schools they want as a result of segregation, and segregation’s long-term effects on Georgia school districts, which I still hear about today. Although I knew about these issues in the abstract, hearing them firsthand was very emotional for me and adding visuals might have been overwhelming.

I would be lying if I were to say I came away from this project without any further attachment to Georgia.  Although it has exposed me to some of the ugly parts I try to avoid in my daily life, it has also exposed me to so much more.  Even the drive to work showed me the oldest drive-in movie theater in the area that is still working.   I also got the opportunity to listen to all of the preparation and execution of the 1996 Olympics.  I am a huge fan of all things Olympics, so

DAT
Indeed, this was the “WORST Gazette ever” (Image: close-up of a DAT labelled “Maxell DAT; Gazette 01-20 95; WORST Gazette ever”)

this was a special treat for me. The Georgia Gazette has given me a sort of pseudo-pride of Georgia; every guest and topic on the show had a relation to Georgia.  Learning about popular historical figures like Blind Tom Wiggins or popular events like the National Grits Festival in Warwick gives me a great appreciation for where I live and the opportunities available to me here.  It has also given me a deeper and fuller appreciation for public broadcasting, something that had already been instilled in me.  In a time where everyone is flocking to Georgia for jobs, often displacing long-term Georgians, I remind myself that my brief time being here must be purposeful.  I hope to help make their history more accessible so that they can feel that true sense of pride they deserve.  With the Georgia Gazette, I hope I did just that–even if it was just a little bit.

 

Written by Riley Griffin, PBPF Summer 2018 Cohort

———

About PBPF

The Public Broadcasting Preservation Fellowship (PBPF), funded by the Institute of Museum and Library Services, supports ten graduate student fellows at University of North Carolina, San Jose State University, Clayton State University, University of Missouri, and University of Oklahoma in digitizing at-risk materials at public media organizations around the country. Host sites include the Center for Asian American Media, Georgia Public Broadcasting, WUNC, the Oklahoma Educational Television Authority, and KOPN Community Radio. Contents digitized by the fellows will be preserved in the American Archive of Public Broadcasting. The grant also supports participating universities in developing long-term programs around audiovisual preservation and ongoing partnerships with their local public media stations.

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work.

AAPB Transcription Workflow, Part 1

The AAPB started creating transcripts as part of our “Improving Access to Time-Based Media through Crowdsourcing and Machine-Learning” grant from the Institute of Museum and Library Services (IMLS). For the initial 40,000 hours of the AAPB’s collection, we worked with Pop Up Archive to create machine-generated transcripts, which are primarily used for keyword indexing, to help users find otherwise under-described content. These transcripts are also being corrected through our crowdsourcing platforms FIX IT and FIX IT+.

As the AAPB continues to grow its collection, we have added transcript creation to our standard acquisitions workflow. Now, when the first steps of acquisition are done, i.e., metadata has been mapped and all of the files have been verified and ingested, the media is passed in to the transcription pipeline. The proxy media files are either copied directly off the original drive or pulled down from Sony Ci, the cloud-based storage system that serves americanarchive.org’s video and audio files. These are copied into a folder on the WGBH Archives’ server, and then they wait for an available computer running transcription software.

Dockerized Kaldi

The AAPB uses the docker image of PopUp Archive’s Kaldi running on many machines across WGBH’s Media Library and Archives. Rather than paying additional money to run this in the cloud or on a super computer, we decided to take advantage of the resources we already had sitting in our department. AAPB and Archives staff at WGBH that regularly leave their computers in the office overnight are good candidates for being part of the transcription team. All they have to do is follow instructions on the internal wiki to install Docker and a simple Macintosh application, built in-house, that runs scripts in the background and reports progress to the user. The application manages launching Docker, pulling the Kaldi image (or checking that you already have it pulled), and launching the image. The user doesn’t need any specific knowledge about how Docker images work to run the application. That app gets minimized on the dock and continues to run in the background as the staff members goes about their work during the day.* But that’s not all! When they leave for the night and their computer typically wouldn’t be doing anything, it continues to transcribe media files, making use of processing power that we were already paying for but hadn’t been utilizing.

*There have been reports of systems being perceptively slower when running this Docker image throughout the day. It has yet to have a significant impact on any staff member’s ability to do their job.

Square application window that shows list of transcripts that have been processed
Application user-interface

Centralized Solution

Now, we could just have multiple machines running Kaldi through Docker and that would let us create a lot of transcripts. However, it would be cumbersome and time-consuming to split the files into batches, manage starting a different batch on each computer, and collect the disparate output files from various machines at the end of the process. So we developed a centralized way of handling the input and output of each instance of Kaldi running on a separate machine.

That same Macintosh application that manages running the Kaldi Docker image also manages files in a network-shared folder on the Archives server. When a user launches the application, it checks that specific folder on the server for media files. If there are any media files in that folder, it takes the oldest file, copies it locally and starts transcribing it. When Kaldi has finished transcribing it, the output text and json formatted transcripts are copied to a subfolder on the Archives server, and the copy of the media file is deleted. Then the application checks the folder again, picks up the next media file, and the process continues.

Screenshot of a file directory with many .mp4 files, a few folders, and a few files named with base64 encoded strings
Files on the Archives server: the files at the top are waiting to be processed, the files near the bottom are the ones being processed by local machines

Avoiding Duplicate Effort

Now, since we have multiple computers running in parallel, all looking at the same folder on the server, how do we make sure that multiple computers aren’t duplicating efforts by transcribing the same file? Well, the process first tries to rename the file to be processed, using the person’s name and a base-64 encoding of the original filename.  If the renaming succeeds, the file is copied into the Docker container for local processing, and the process on every other workstation will ignore files named that way in their quest to pick up the oldest qualifying file. After a file is successfully processed by Kaldi, it is  then deleted, so no one else can pick it up. When Kaldi fails on a file, then the file on the server is renamed to its original file name with “_failed” appended, and again the scripts know to ignore the file. A human can later go in to see if any files have failed and investigate why. (It is rare for Kaldi to fail on an AAPB media file, so this is not part of the workflow we felt we needed to automate further).

Handling Computer and Human Errors

The centralized workflow relies on the idea that the application is not quitting in the middle of a transcription. If someone shuts their laptop, the application will stop, but when they open it again, the application will pickup right where it left off. It will even continue transcribing the current file if the computer is not connected to the WGBH network, because it maintains a local copy of the file that is processing. This allows a little flexibility in terms of staff taking their computers home or to conferences.

The problem starts when the application quits, which could occur when someone quits it intentionally, someone accidentally hits the quit button rather than the minimize button, someone shuts down or restarts their computer, or a computer fails and shuts itself down automatically. We have built the application to minimize the effects of this problem. When the application is restarted it will just pick up the next available file and keep going as if nothing happened. The only reason this is a problem at all is because the file they were in the middle of working on is still sitting on the Archives server, renamed, so another computer will not pick it up.

We consider these few downsides to this set up completely manageable:

  • At regular intervals a human must look into the folder on the server to check that a file hasn’t been sitting renamed for a long time. These are easy to spot because there will be two renamed files with the same person’s name. The older of these two files is the one that was started and never finished. The filename can be changed to its original name by decoding the base-64 string. Once the name is changed, another computer will pick up the file and start transcribing.
  • Because the file stopped being transcribed in the middle of the process, the processing time spent on that interrupted transcription is wasted. The next computer to start transcribing this file will start again at the beginning of the process.

Managing Prioritization

Because the AAPB has a busy acquisitions workflow, we wanted to make sure there was a way to manage prioritization of the media getting transcribed. Prioritization can be determined by many variables, including project timelines, user interest, and grant deadlines. Rather than spending a lot of time to build a system that let us track each file’s prioritization ranking, we opted for a simpler, more manual operation. While it does require human intervention, the time commitment is minimal.

As described above, the local desktop applications only look in one folder on the Archives server. By controlling what is copied into that folder, it is easy to control what files get transcribed next. The default is for a computer to pick up the oldest file in the folder. If you have a set of more recent files that you want transcribed before the rest of the files, all you have to do is remove any older files from that folder. You can easily put them in another folder, so that when the prioritized files are completed, it’s easy to move the rest of the files into the main folder.

For smaller sets of files that need to be transcribed, we can also have someone who is not running the application standup an instance of dockerized Kaldi and run the media through it locally. Their machine won’t be tied into the folder on the server, so they will only process those prioritized files they feed Kaldi locally.

Transforming the Output

At any point we can go to the Archives server and grab the transcripts that have been created so far. These transcripts are output as text files and as JSON files which pair time-stamp data with each word. However, the AAPB prefers JSON transcripts that are time-stamped at each 5-7 second phrase.

We use a script that parses the word-stamped JSON files and outputs phrase-stamped JSON files.

Word time-stamped JSON

Screenshot from a text editor showing a json document with wrapping json object called words with sub-objects with keys for word, time, and duration
Snippet of Kaldi output as JSON transcript with timestamps for each word

Phrase time-stamped JSON

Screenshot from a text editor of JSON with a container object called parts and sub-objects with keys text, start time, and end time.
Snippet of transformed JSON transcript with timestamps for 5-7 second phrases

Once we have the transcripts in the preferred AAPB format, we can use them to make our collections more discoverable and share them with our users. More on the part of the workflow in Part 2 (coming soon!).

Upcoming AAPB Webinar Featuring Kathryn Gronsbell, Digital Collections Manager at Carnegie Hall

DUuoHWcVQAABjGA

Photo courtesy of Rebecca Benson, @jeybecques, PBPF Fellow at University of Missouri.

This Thursday, March 15th at 8 pm EST, American Archive of Public Broadcasting (AAPB) staff will host a webinar with Kathryn Gronsbell, Digital Collections Manager at Carnegie Hall and will cover topics in documentation, including why documentation is important, what to think about when recording workflows for future practitioners, and where to find examples of good documentation in the wild.

The public is welcome to join for the first half hour. The last half hour will be limited to Q&A with our Public Broadcasting Preservation Fellows, who have now begun to inventory their digitized public broadcasting collections to be preserved in the AAPB.

Webinar URL: http://wgbh1.adobeconnect.com/documentation/

For anyone who missed the last webinar on tools for Quality Control, it’s now also available for viewing through this link: http://wgbh1.adobeconnect.com/psv1042lp222/.

*******************************

For more updates on the Public Broadcasting Preservation Fellowship project, follow the project at pbpf.americanarchive.org and on Twitter at #aapbpf, and come back in a few months to check out the results of their work: digitized content preserved in the American Archive of Public Broadcasting from our collaborating host organizations WUNCKOPNOklahoma Educational Television AuthorityGeorgia Public Broadcasting, and the Center for Asian American Media as well as documentation created to support ongoing audio and video preservation education at the University of MissouriUniversity of OklahomaClayton State UniversityUniversity of North Carolina at Chapel Hill, and San Jose State University.

AMS is undergoing maintenance

Dear AAPB Participating Organizations:

Please be informed that the AAPB team is conducting some maintenance on the AMS server, so access to records will not be available during this time. If you have any questions, please feel free to contact the AAPB Project Manager at casey_davis [at] wgbh [dot] org. We will provide an update on the blog as soon as the AMS is back up and running.

We sincerely apologize for any inconveniences this may cause!