Augmentation and Enhancement of Crowdsourced Metadata for the Our Marathon Digital Archive

When I first entered my MLIS program at Simmons College, I never imagined that I would become a cataloger. I envisioned myself working at a reference desk, not sitting alone in a back room consulting volumes of arcane rules and standards. But after taking one cataloging course my entire perspective shifted. I realized that the metadata records that catalogers create let us map the bibliographic universe and provide the means for users to find and contextualize resources. Catalogers and metadata librarians are the often unsung heroes of the library world, and after a semester at Simmons I was eager to become one.

For the last several months, I have had the privilege of working on metadata augmentation for the Our Marathon digital archive migration project overseen by Northeastern University Libraries’ Digital Scholarship Group. The Our Marathon archive contains nearly 8,000 digital artifacts split into 22 collections that represent the public’s lived experience following the 2013 Boston Marathon bombings. The objects were collected from sources like Boston City Archives, WBUR, and direct submissions from the public. In 2017, the library created a plan to take over stewardship of the project by migrating the objects and metadata to the DRS and creating a new front-facing site using CERES.

My work has been divided into two parts. First, the cleanup of the metadata records following their migration into the DRS. The DRS uses a metadata format called MODS (Metadata Object Description Schema) encoded in XML, which is different than the format in from original Our Marathon site. A conversion of the records to MODS leaves some errors that need to be fixed. This is compounded by the fact that each of the 22 collections has slightly different metadata, echoing their separate provenance.

My second goal was to bring the metadata up to Digital Commonwealth standards, with the idea that Our Marathon will eventually be part of Digital Commonwealth’s collections. Digital Commonwealth also uses MODS for its metadata, but has a set of requirements that ensure interoperability of metadata records they collect. Digital Commonwealth acts as a hub for the Digital Public Library of America, so the Our Marathon collections will eventually become part of the DPLA, further increasing their visibility.

My process of cleanup up and enhancing the metadata involved looking at each collection individually and evaluating what needed to be changed. The collections vary in size wildly, with the smallest being less than 15 records and the largest over 4,000. Some changes could be accomplished mechanically, altering all records in the collection with a simple find/replace command, while others required looking over each record individually. Where needed, I added information such as Library of Congress Subject Headings (LCSH) or titles for untitled photographs, but overall the goal was to preserve the content of the records.

Here are a few of the collections that posed interesting challenges for me:

Example record of 3D object

Temporary Memorial Collection – Digitized or Born Digital?

This collection contains images of posters and 3D objects that were set up in Copley Square after the bombings and then sent to the Boston City Archives. The issue here was whether to consider the objects as digitized (a digital representation of a memorial shrine) or born digital (a digital photograph of a memorial shrine). To a non-cataloger this distinction might seem trivial, but it will determine whether the object is collocated with other photographs or with other 3D objects/posters when browsing. Ultimately, with the digital provenance of this collection still uncertain, I decided to consider all such objects as digitized in order to encourage expanded browsing by genre. This experience taught me that cataloging decisions are not always cut and dry and can be left up to individual judgment calls.

Letters to the City of Boston – Batch Metadata Work

This is by far the largest collection, equating to over half the total objects. It contains letters sent from around the world to Boston City Hall following the bombings. Working with this collection required a different strategy than with the smaller ones. Rather than looking through each record meticulously, I looked at each element (e.g. title, creator name, creation date) over all 4,000 records. This allowed me to find erroneous values and the standardize the information. One accomplishment that I’m proud of was the creation and standardization of creator names so that users can search by the name of the school or institution that sent the letters.

Example of typical keyword use in records containing keywords

Various Collections – User Tags and Keywords

One of the main issues with the Our Marathon metadata from the outset was the abundance of uncontrolled keywords used in the records. Catalogers generally discourage the use of keywords, which can often appear in different spellings/forms, in favor of controlled vocabularies such as LCSH. Looking through the collections I also often questioned the actual applicability of keywords to the objects they were describing. However, the goal of this project is to preserve the information that the public submitted. I learned to accept that the keywords, although perhaps not ideal retrieval devices, are a record of how people defined the objects they submitted to Our Marathon and deserve to be preserved.

MODS note fields for dealing with submitted text and associated metadata

Metadata Records as Information Artifacts

Working on this project helped me learn a lot about how to work with large batches of metadata records, but it also taught me about the importance of those records themselves. In the case of Our Marathon, many of the metadata records were derived from information directly submitted by the public. The information they submitted needs to be preserved just as much as the digital objects it describes, especially in the context of creating a public history of an event. In some cases, the metadata records in Our Marathon even contain people’s stories of their experiences, making them quite literally documents in their own right.

Metadata records are information artifacts. They sit side by side with objects and help us find and contextualize them. They are also living information artifacts. Metadata often needs to be migrated to new systems, and this process is rarely neutral. This project taught me that care needs to be taken to document the provenance of metadata for the future. Furthermore, the process of creating metadata is just as important for contextualizing information as the work of a reference librarian. Cataloging work is important, whether it comes from a professional with an MLIS or a member of the general public. I realized that I’m already a cataloger, as is anyone who has a hand in the description of objects.