Moving Beyond the Allegory of the Lone Digital Archivist (& my day of Windows scripting at KBOO)

The “lone arranger” was a term I learned in my library sciences degree program and I accepted it. I visualized hard-working, problem-solving solo archivists in small staff-situations challenged with organizing, preserving, and providing access to the growing volumes of historically and culturally relevant materials that could be used by researchers. As much as the archives profession is about facilitating a deep relationship between researchers and records, the work to make archival records accessible to researchers needed to be completed first. The lone arranger term described professionals, myself to be one of them, working alone and with known limitations to meet this charge. This reality has encouraged archivists without a team to band together and be creative about forming networks of professional support. The Society of American Archivists (SAA) has organized support for lone arrangers since 1999, and now has a full-fledged Roundtable for professionals to meet and discuss their challenges. Similarly, support for the lone digital archivist was the topic of a presentation I heard at the recent 2017 Code4Lib conference held at UCLA by Elvia Arroyo-Ramirez, Kelly Bolding, and Faith Charlton of Princeton University.

Managing the digital record is a challenge that requires more attention, knowledge sharing, and training in the profession. At Code4Lib, digital archivists talked about how archivists in their teams did not know how to process born-digital works, that this was a challenge, but more than that unacceptable in this day and age. It was pointed out that our degree programs didn’t offer the same support for digital archiving compared to processing archival manuscripts and other ephemera. The NDSR program aims to close the gap on digital archiving and preservation, and the SAA has a Digital Archives Specialist credential program, but technology training in libraries and archives shouldn’t be limited to the few who are motivated to seek out this training. Many jobs for archivists will be in a small or medium-sized organizations, and we argued that processing born-digital works should always be considered part of archival responsibilities. Again, this was a conversation among proponents of digital archives work, and I recognize that it excludes many other thoughts and perspectives. The discussion would be more fruitful by including individuals who may feel there is a block to their learning and development to process born-digital records, and to focus the discussion on learning how to break down these barriers.

Code4Lib sessions (http://bit.ly/d-team-values, http://scottwhyoung.com/talks/participatory-design-code4lib-2017/) reinforced values of the library and archives profession, namely advocacy and empowering users. No matter how specialized an archival process is, digital or not, there is always a need to be able to talk about the work to people who know very little about archiving, whether they are stakeholders, potential funders, community members, or new team members. Advocacy is usually associated with external relations, but is an approach that can be taken when introducing colleagues to technology skills within our library and archives teams. Many sessions at Code4Lib were highly technical, yet the conversation always circled back to helping the users and staying in touch with humanity. When I say highly technical, I do not mean “scary.” Another session reminded us that technology can often cause anxiety, and can be misinterpreted as something that can solve all problems. When we talk to people, we should let them know what technology can do, and what it can’t do. The reality is that technology knowledge is attainable and shouldn’t be feared. It cannot solve all work challenges but having a new skill set and understanding of technology can help us reach some solutions. It can be a holistic process as well. The framing of challenges is a human-defined model, and finding ways to meet the challenges will also be human driven. People will always brainstorm their best solutions with the tools and knowledge they have available to them—so let’s add digital archiving and preservation tools and knowledge to the mix.

And the Windows scripting part?

I was originally going to write about my checksum validation process on Windows, without Python, and then I went to Code4Lib which was inspiring and thought-provoking. In the distributed cohort model I am a lone archivist if you frame your perspective around my host organization. But, I primarily draw my knowledge from my awesome cohort members and my growing professional network I connected with on Twitter (Who knew? Not me.). So I am not a lone archivist in this expanded view. When I was challenged to validate a large number of checksums without the ability to install new programs to my work computer, I asked for help from my colleagues. So below is my abridged process where you can discover how I was helped through an unknown process with a workable solution using not only my ideas, but ideas from my colleagues. Or scroll all the way down for “Just the solution.”

KBOO recently received files back from a vendor who digitized some of our open-reel content. Hooray! Like any good post-digitization work, ours had to start with verification of the files, and this meant validating checksum hash values. Follow me on my journey through my day of Powershell and Windows command line.

Our deliverables included a preservation wav, mezzanine wav, and mp3 access file, plus related jpgs of the items, an xml file, and a md5 sidecar for each audio file. The audio filenames followed our filenaming convention which was designated in advance, and files related to a physical item were in a folder with the same naming convention.

Md5deep can verify file hashes with two reports created with the program, but I had to make some changes to the format of the checksum data before they could be compared.

Can md5deep run recursively through folders? Yes, and it can recursively compare everything in a directory (and subdirectories) against a manifest.

Can md5deep selectively run on just .wav files? Not that I know of, so I’ll ask some people.

Twitter & Slack inquiry: Hey, do you have a batch process that runs on designated files recursively?

Response: You’d have to employ some additional software or commands like [some unix example]

@Nkrabben: Windows or Unix? How about Checksumthing?

Me: Windows, and I can’t install new programs, including Python at the moment

@private_zero: hey! I’ve done something similar but not on Windows. But, try this Powershell script that combines all sidecar files into one text file. And by the way, remember to sort the lines in the file so they match the sort of the other file you’re comparing it to.

Me: Awesome! When I make adjustments for my particular situation, it works like a charm. Can powershell scripts be given a clickable icon to run easily like windows batch files in my work setup where I can’t install new things?

Answer: Don’t know… [Update: create a file with extension .ps1 and call that file from a .bat file]

@kieranjol: Hey! If you run this md5deep command it should run just on wav files.

Me: Hm, tried it but doesn’t seem like md5deep is set up to run with that combination of Windows parameters.

@private_zero: I tried running a command, seems like md5deep works recursively but not picking out just wav files. Additional filter needed.

My afternoon of Powershell and command line: Research on FC (file compare), sort, and ways to remove characters in text files (the vendor file had an asterisk in front of every file name in their sidecar files that needed to be removed to match the output of an md5deep report).

??? moments:

Turns out using powershell forces output as UTF-8 BOM as compared to ascii/’plain’ utf output of md5deep text files. Needed to be resolved before comparing files.

The md5deep output that I created listed names only and not paths, but that left space characters at the end of lines! That needed to be stripped out before comparing files.

I tried to perform the same function of the powershell script in windows command line but was hitting walls so went ahead with my solution of mixing powershell and command line commands.

After I got 6 individual commands to run, I combined the Powershell ones and the Windows command line ones, and here is my process for validating checksums:

Just the solution:

It’s messy, yes, and there are better and cleaner ways to do this! I recently learned about this shell scripting guide that advocates for versioning, code reviews, continuous integration, static code analysis, and testing of shell scripts. https://dev.to/thiht/shell-scripts-matter

Create one big list of md5 hashes from vendor’s individual sidecar files using Powershell

–only include the preservation wav md5 sidecar files, look for them recursively through the directory structure, then sort them alphabetically. The combined file is named mediapreserve_20170302.txt. Remove the asterisk (vendor formatting) so that the text file matches the format of an md5deep output file. After removing asterisk, the vendor md5 hash values will be in the vendormd5edited.txt file.

open powershell

nav to new temp folder with vendor files

dir .* -exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav,*_mezz.wav,*.jpg,*.txt,*.bat -rec | gc | out-file -Encoding ASCII .vendormd5.txt; Get-ChildItem -Recurse A:mediapreserve_20170302 -Exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav.md5,*_mezz.wav,*.jpg,*.bat,*.txt | where { !$_.PSisContainer } | Sort-Object name | Select FullName | ft -hidetableheaders | Out-File -Encoding “UTF8” A:mediapreserve_20170302mediapreserve_20170302.txt; (Get-Content A:mediapreserve_20170302vendormd5.txt) | ForEach-Object { $_ -replace ‘*’ } | set-content -encoding ascii A:mediapreserve_20170302vendormd5edited.txt

Create my md5 hashes to compare to vendor’s

–run md5deep on txt list of wav files from inside the temp folder using Windows command line (Will take a long time to hash multiple wav files)

“A:md5deep-4.3md5deep.exe” -ebf mediapreserve_20170302.txt >> md5.txt

Within my new md5 value list text file, sort my md5 hashes alphabetically and trim the end space characters to match the format of the vendor checksum file. Then, compare my text file containing hashes with the file containing vendor hashes.

–I put in pauses to make sure the previous commands completed, and so I could follow the order of commands.

run combined-commands.bat batch file (which includes):

sort md5.txt /+34 /o md5sorteddata.txt

timeout /t 2 /nobreak

@echo off > md5sorteddata_1.txt & setLocal enableDELAYedeXpansioN

for /f “tokens=1* delims=]” %%a in (‘find /N /V “” ^<md5sorteddata.txt’) do ( SET “str=%%b” for /l %%i in (1,1,100) do if “!str:~-1!”==” ” set “str=!str:~0,-1!” >>md5sorteddata_1.txt SET /P “l=!str!”>md5sorteddata_1.txt echo.

)

timeout /t 5 /nobreak

fc /c A:mediapreserve_20170302vendormd5edited.txt A:mediapreserve_20170302md5sorteddata_1.txt

pause

The two files are the same, so all data within it matches, therefore, all checksums match. So, we’ve verified the integrity and authenticity of files transferred successfully to our server from the vendor.

This post was written by Selena Chau, resident at KBOO Community Radio.

One thought on “Moving Beyond the Allegory of the Lone Digital Archivist (& my day of Windows scripting at KBOO)

  1. Hey, great post, Thanks! I want to add to it and say that for people (like moi) who live outside of the US& Europe in a small state (Israel) it might feel even lonelier sometimes. This is why blogs, slides, webiners, international conference (although expensive for people outside of the region) etc are extremely valuable. I’m not a digital person by heart but I’m the person who has to lead this in the archive I work at. Bit by bit I learn this niche thank to people who are willing to and passionate abt shaing their knowledge on the world wide web so Thank you! 🙂 Hila

Leave a comment