Semantic Analysis: Research Sample #1

Automated Semantic Analysis of Semi-Structured Data
Research Sample 1: Archival Descriptions

As noted above, two research samples were used to analyze the access points supplied by OpenCalais semantic analysis tool. The first sample includes 43 archival record groups from sixteen institutions, including university archives, government records archives, and manuscript/special collections repositories in various LAMs. Descriptive information such as creator histories and scope and content notes found in the archival finding aids, as well as abstracts from these descriptions, were put into the OpenCalais open service to generate extracted access point candidates. The whole process was automatic. Using an in-house-developed program, the software automatically obtained the archival records and sent them to the semantic analysis service supported by Calais. The output, which was in the JSON format, was then converted directly into a CSV file, which could be viewed as a Microsoft Excel spreadsheet. The resulting database contained the following fields: Entity-type, Entity-name, Relevance-ratio, and File-source. Using the OpenRefine tool, the data were clustered automatically to allow the researchers to clean up the data manually (e.g., merge the synonyms and delete incorrect extractions). Figure 2 illustrates this multi-step process.

Figure 2: Illustration of the Process


The analysis resulted in dozens and, at times, hundreds of potential entities and social tags that could be used to provide additional points of entry to these archival records. These entities and tags correspond almost exclusively to the first two layers of subject analysis (description and identification). Entity-based terms are in general more common than topical terms; it is very rare to find any terms at the third level of analysis (interpretation) in descriptions of archival materials, due to their evidentiary nature (see Figures 3 and 4).

Figure 3: Personal, corporate, and geographic entities generated by semantic analysis of an archival finding aid


Figure 4: Event entities generated by semantic analysis of an archival finding aid


Evaluation according to Erwin Panofsky's three-layers theory

Background: Erwin Panofsky's three-layers theory has been widely used by the researchers and practitioners examining subject access to images, particularly iconological themes found in the art of the Renaissance as well as art images in general (Panofsky, 1939; Shatford Layne, 1994; Klavans et al., 2014). The theory has also been extended to be the basis for subject analysis of all cultural objects, as suggested by the content standard Cataloging Cultural Objects: A Guide to Describing Cultural Works and Their Images (CCO) (Baca, 2006; Harpring, 2009). Panofsky (1939) summarized the coordination of the three layers of object interpretation as (I) primary or natural subject matter; (II) secondary or conventional subject matter; and (III) intrinsic meaning or content. The layers are aligned with the three types of interpretation: act of, equipment for, and controlling principle of interpretation. Simplified by CCO, the three layers become: description, identification, and interpretation. These are to be further discussed in the following section. (Refer to Research Method section.)

Entities correctly extracted via Calais analysis (at level I, or, description) included personal names (Person), corporate names (Company, Facility, Organization), and geographic names (City, Continent, Country, Natural Feature, ProvinceOrState, Region), and events (Holiday, PoliticalEvent). Calais provides relevance scores for each identified entity, which may be used as a valuable clue about the importance of that entity to the overall scope of the archival collection. While it is difficult to predict exactly what the cut-off relevance score might be for a system to include an entity as an indexed term, given the differences in description exhaustivity among different institutions, the relevance scores could certainly be used to suggest possible indexing terms. LAMs may also choose to perform analysis and generate relevance scores only on particular parts of the finding aids (such as the creator history and the scope and content note) to improve reliability of the scores.

In addition to entities, Calais also generated many topical terms describing the subject matter of the records (at level II, or, identification); these topics were often found as social tags or as entities under the “IndustryTerm” or “Product” category (see Figure 5). These categorizations were the least reliable in terms of accuracy; the Calais analytic engine often incorrectly identified text strings from the finding aids as products or industry terms. Many of these errors can be attributed to the raw data that was fed to the engine: the entire finding aid was used and this unedited text often included physical location information for the records and document formatting that generated significant noise for the analysis engine to sort through. Targeted analysis of particular areas of the finding aids may result in better accuracy for topical analysis.

Figure 3: Topical terms, called “social tags,” generated from the semantic analysis of an archival finding did


As a point of comparison to the automated analysis of the finding aids, the researchers also examined the controlled vocabulary topical terms and names assigned to the archival records. These terms and names are typically drawn from controlled vocabularies such as Library of Congress Name Authority File (LCNAF), Library of Congress Subject Headings (LCSH), and Art and Architecture Thesaurus. As with the entities and social tags generated by Calais, the headings can be primarily categorized according to the first and second layers of analysis: 1) Description: personal, family, corporate, and geographic names (note that the first three types of names can also be encoded as records creators in addition to being subjects depicted in the records); and, 2) Identification: topical terms (including occupations and functions represented in the records), genre and form terms. The depth of subject analysis is wildly variable—while some archival records groups were assigned dozens of headings, others received a minimal number. Government records are often not assigned subject headings at all, while personal papers and special collections are more likely to have a sizable number of headings (at least five or six, and often many more).

As noted above, certain factors such as the size of archival collections, varying institutional practices, and different approaches to the indexing of different types of archival materials may influence the exhaustivity of subject analysis. Under these circumstances, it is difficult to propose that automated semantic analysis will always result in a more exhaustive or accurate list of terms. This study suggests, however, that it would be well worth the effort for institutions to experiment with semantic analysis methods as either an initial step to suggest key entities and topics, or as a final check to ensure that important concepts or entities have not been overlooked. For certain types of records, particularly those for which subject indexing is not common, semantic analysis may provide entry points to archival records that were not previously available. Such techniques will enhance subject analysis at the first two levels (description and identification), but are unlikely to be useful for interpretation of the material.



This work was supported by a grant from IMLS. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.