Using Linked Data for the NET Collection Catalog

Who I Am

I am Chris Pierce, the Cataloger/Metadata Specialist for the American Archive of Public Broadcasting and the National Educational Television (NET) Collection Catalog project at the Library of Congress. The NET Collection Catalog Project is a collaboration between WGBH and Library of Congress and funded by the Council on Library and Information Resources (CLIR). The NET project involves the creation of a national catalog of records documenting the existence and robust description of titles distributed by NET, public media’s first national network and its earliest and among its most at-risk content.

In addition to cataloging moving image material distributed by NET during the mid to late fifties to early seventies, I am also working on a feasibility report on the implementation of linked data for the NET catalog.

Linked data? Huh?

What is linked data? The Wikipedia definition is “a method of publishing structured data so that it can be interlinked.” To put it simply, linked data is data that can be linked to other data, very much like how browsers manage hyperlinks.

Why would we want to implement linked data? There are several reasons:

  • AAPB/NET metadata contains valuable and largely undiscovered relationships that, when reused by others on the internet, can enhance the information already online.
  • It would open AAPB/NET metadata to web applications and making the metadata more discoverable and shareable on the web
  • It would contribute to the sustainability of metadata creation for future cataloging at the AAPB with metadata that is more deeply connected to external metadata, which could then be reused for description of AAPB material

Very often we talk about linked data being actionable, by which we mean that the data can be linked to other data through Uniform Resource Identifiers (URIs) (or hyperlinks that direct the user to more information about the resource or property). A key part of being actionable is that data that has been designed to be interlinked in such a way can be said to be a node in a traversable “web” of data. Thus, the model for linked data is a graph, and linked datasets are typically modelled on a graph model rather than relational or hierarchical structures. It is very common to see linked data visualized through this sort of image:

Image from The Oracle Alchemist

These links are structured through relationships expressed as triples. In the image above, these triples are represented in graph form, but they can also be serialized in machine readable code. In both the serialization and the graph, these triples are logical statements:

This person [has]realName Stephen King

This person hasTwitter @StephenKing

@StephenKing hasContent [pictures of his dog Molly aka Thing of Evil]

A triple is simply a relationship between a subject and an object communicated through a predicate:

SUBJECT——PREDICATE——OBJECT

The data model that supports the exchange of data structured in this way (as a web of interlinked nodes connected through relationships expressed as triples) is the Resource Description Framework (RDF). RDF can be semantically structured through specifications that define what types of data are being modelled. For instance, the RDF schema (RDFs) is a data modelling vocabulary that can be used to define classes and possible relationships between classes. BIBFRAME is another vocabulary that is being developed by the Library of Congress to represent library bibliographic metadata in RDF. Another example is EBUCORE, a vocabulary designed by the European Broadcasting Union to support linked data in various stages of the life cycle of broadcasting material, including production, business, and archives. Vocabularies such as these are central  to having every object, subject, and predicate defined and expressed as Uniform Resource Identifiers (URIs) rather than literal string values (strings that are not actionable through links), and they expand upon the types of things that can be described as linked data (at various levels of granularity).

This framework of linked data advances the principles proposed by Tim Berners-Lee as the foundation of linked data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF)
  4. Include links to other URIs, so that they can discover more things.

The NET project

The feasibility report on which my colleagues at the Library of Congress and I are working will focus on records generated through the NET catalog project (where I spend the majority of my day cataloging). We catalog these records in our content management system, MAVIS. MAVIS outputs the data to MAVISXML, which is a hierarchically structured format for representing metadata. We are looking at ways to transform MAVISXML to PBCORE (the XML schema in use by AAPB) and then to RDF linked data. We are examining existing technologies, vocabularies, and workflows, and identifying other problems we need to solve. The results of this research will be a benefit not only to the AAPB, but also to other cultural heritage institutions and the public broadcasting community taking efforts to implement linked data. I am currently on the “literature review” stage of the linked data research. Look forward to future posts about our process!

This post was written by Chris Pierce, AAPB and NET Cataloger/Metadata Specialist.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s