Metadata Research Focuses: Archival Description

Brief Overview
Encoded Archival Description (EAD) and Linked Data

Our research process includes two major steps: for the first part, we have been examining archival descrip0ve and authority standards to see what informa0on may be useful as linked data. We are also looking at finding aid and authority exemplars, which we pulled from major sources of archival finding aids such as the OhioLink EAD repository. After reviewing standards and sample finding aids and authority records, we iden0fy common elements among them and consider the possibili0es for linking archival informa0on to linked open data proper0es—we are looking at both major access points and poten0al hidden access points not currently able to be linked easily because of how they are currently encoded.

In the second part of the research, we have iden0fied datasets that have poten0al to be relevant to users of archival collec0ons. These datasets are analyzed to discover what ontologies, metadata schemas, and/or applica0on profiles that are used. We also look at sample records and any documenta0on that we can find, using informa0on found on the Data Hub. Then, we attempt to crosswalk across datasets, to see how well a par0cular type of informa0on will match. We use categories common to ontology alignment such as equivalent, close match, broad match, narrow match, and related. Through this process, we iden0fy the major classes and proper0es useful to archives data.

This is one example of a mapping table that is generated as a result of the analysis we’re doing with the standards, ontologies, and datasets. In this mapping of EAD to FOAF, which focuses on personal, family, and corporate body names, we can make a couple of general statements about how well these two data models relate. First, we can say that for the EAD data elements , , and , at the class level they are narrower than the closest FOAF class equivalent, except in the case of Person. Matching to the property level, however, FOAF tends to be narrower in scope than the closest equivalent in EAD. Matching names found in EAD records to the property foaf:name would provide a basic level of interoperability, but would not differen0ate among first, last, and nicknames. Thus, EAD and FOAF are not completely equal in the granularity of their data models. Another observa0on that we have made in EAD is that there are several places in the EAD record where names can be found, but which are not and cannot be tagged as such, such as in the ScopeContent and BiogHist tags. The poten0al to link addi0onal names to external datasets is therefore untapped at this 0me, but revisers of the standard may wish to consider adding such func0onality to future versions of EAD.

Archival description and linked data: a preliminary study of opportunities and implementation challenges
[Secondary Header]

The archival universe is rich and varied, and descriptive practice for archival materials reflects that diversity. While finding aids and inventories continue to be a primary descriptive genre for archival material, in the last few decades archival description has evolved to establish new pathways into records and collections. These routes to archival materials date from the first surrogates that appeared in electronic catalogs in the form of machine-readable cataloging (MARC) records in the 1980s, to the distribution of finding aids online via HTML and Encoded Archival Description (EAD) in the 1990s. Inclusion of archival objects in institutional repositories further expanded access to materials, and the more recent application of Web 2.0 technologies such as wikis, social tagging, and various other annotation methods provided users with the tools to enhance archival description further. Throughout this history, the development of descriptive practice reveals the eagerness to explore, assess, and incorporate new technologies to improve documentation, search, retrieval, and use of archival materials.

Most recently, the archival community concurrently with other cultural heritage communities has begun to experiment with the innovations and potentials of the Semantic Web. Tim Berners-Lee has defined the Semantic Web, also sometimes referred to as Web 3.0, as an extension of the current web, ‘‘in which information is given well-defined meaning, better enabling computers and people to work in cooperation’’ (2001). The Semantic Web, although still largely unrealized in practice, relies on semantically structured knowledge in the form of machinereadable metadata in order to enable searching by automated agents. By semantically defining information in Web-accessible documents, information producers and users will be able to connect seamlessly to related information found elsewhere on the Web. This concept has come to be known as Linked Data. It should be noted that while the terms Linked Data and the Semantic Web are closely related, they are not synonymous. The Semantic Web is a more general term to describe the architecture that allows Linked Data to be represented, connected, shared, searched, and combined. Tim Davies has constructed a helpful diagram that explains the architecture underlying Linked Data (Davies 2011).

Whereas hyperlinks in the current Web connect documents, links in the Semantic Web actually connect the data contained with the documents and, as Coyle points out, they have specific meanings (Coyle 2012). By using meaningful links, the Semantic Web thus enables a global information space where information from disparate sources becomes more accessible. Semantically defined associations are at the heart of this information network, allowing searchers to gather many types of data about the topic of a search and display them in a unified way.

DBpedia, a semantically structured version of the data found on Wikipedia, provides an excellent example of an information service that provides easy access to Linked Open Data (LOD) about persons, organizations, places, concepts, etc.1 Developed by the Free University of Berlin, the University of Leipzig, and OpenLink Software, DBpedia defines linking data uniform resource identifiers (URIs) for millions of concepts. Many data providers provide links from their data sets to DBpedia. In the visualization of the LOD Cloud by Cyganiak (2011), DBpedia is one of the central interlinking hubs of datasets in the Linking Open Data Group (Cyganiak 2011).

The Google Knowledge Graph provides another example of a Linked Data application. It displays data about things, people, and places drawn from diverse sources such as Wikipedia, Freebase (a database owned and funded by Google, with an approach similar to that of DBpedia), other subject-specific sources, and Google’s own data stores (Singhal 2012). In the example depicted in Fig. 1, a search on ‘‘Paris’’ results in an infobox containing multiple types of information about the city, including a map, geographic coordinates, population, area, weather, local time, and related points of interest close by.

The underlying technology used to create the graph gathers each morsel of data by querying relevant databases using semantic search techniques, retrieving all relevant information that matches the semantically defined search criteria, and presenting it in a unified display. Linked Data applications such as the one illustrated in Fig. 1 rely on several standards and tools in order to accomplish this feat of garnering and gleaning relevant info-nuggets.

In order to make data accessible as Linked Data, institutions must follow four simple rules for data preparation and deployment, as first laid out by Berners-Lee:

1. Use URIs as names of things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).

4. Include links to other URIs so that they can discover more things (Heath & Bizer, 2011, p. Chapter 2).

As stated above, each entity being described must have a unique identifier and be addressable using a URI. Once converted to URIs, entities can be related to one another semantically. URIs are important to use when defining entities because they have more precision than natural language, and an URI for an entity will be the same, no matter the language used (Coyle 2012, p 12). The URI also gives a precise location for information about each entity in the form of a uniform resource locator (URL), i.e., Hypertext transfer protocol (HTTP) address.

RDF, or the resource description framework, is the language by which the structure of Linked Data is expressed and is used to define the triple, i.e., the basic unit of expression of Linked Data. As the name triple implies, each semantic unit consists of three components: subject ? predicate ? object, where the subject is the thing that one is referencing, the object is another entity with some kind of relationship to the subject, and the predicate defines that relationship. A generic example of a triple in the context of archival description might be as follows:

While RDF allows for the expression of semantic relationships among things, several other standards extend the functionality of Linked Data. First, the Simple Knowledge Organization System (SKOS) provides a structure for encoding thesauri and controlled lists of terms, specifically defining the broader, narrower, and related relationships that many are familiar with from the ISO 25964 thesauri construction and interoperability standard.2 By using SKOS to define relationships among Linked Data entities in different data sets, an institution can map terms and create alignments among data sets.

Second, the Web Ontology Language (OWL) is a standard that extends RDF to allow for sophisticated ontology specifications.3 While not all Linked Data must or should be expressed using OWL, for those who wish to develop formal ontologies for expression of knowledge domains, it is a valuable Semantic Web-compliant tool.

The third standard that is helpful for launching data on the Semantic Web is the standard query language called SPARQL.4 This tool allows one to query data that is in the form of RDF triples, using a format akin to SQL. While users can query SPARQL endpoints directly, it is more likely that these users will create queries in more familiar ways (such as through Google searches, or within institutional catalogs), and those queries will be translated into SPARQL searches automatically in order to return relevant data from various sources.

This work was supported by a grant from IMLS. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.