Semantic Analysis: Value

Automated Semantic Analysis of Semi-Structured Data
The Research Question

The problem addressed by this study is the assessment of alternative approaches of generating subject access points to the materials that are usually not made available through regular library catalog routines. Subject access is critical for cross-institutional digital libraries, such as Europeana, which hold and provide access to a variety of information resources provided by libraries, archives, and museums (LAMs). LAMs have invested huge amounts of human resources in subject analysis. As the size and variety of accessible open resources grow exponentially, LAMs are recognizing the impracticality and impossibility of conducting exhaustive traditional subject analysis. Yet, without providing good quality subject access, LAMs will find that users’ search requests cannot often be satisfied. Limited subject access points are particularly critical with very large-scale resources of cross-institutional collections.

Using computerized subject analysis may prove to be promising in improving subject access to large heterogeneous collections. For example, advanced technologies in natural language processing and semantic annotation have resulted in enhanced, software-suggested access points (both named entities and topics) and even relations of the contents of a given resource. The following figure (Figure 1) is a screenshot showing manual and automatic subject analysis results.


Figure 1: Subject headings and keywords provided by original catalog record, (left), and topics, tags, and entities provided by an automatic semantic analysis tool (right).

On the left is an original doctoral dissertation’s metadata, including six keywords suggested by the dissertation author in the process of submitting to the electronic thesis and dissertation (ETD) repository and two standardized subject headings assigned by a library cataloger in the re-processing procedure. On the right is about 1/3 of the returned result after running the abstract of the dissertation through the semantic analysis tool OpenCalais (free version). The online software also displays the relevance ranking and count for each suggested tag (which the Calais called “social tag”). The processes of obtaining the original text, running it through the analysis, converting the resulting output into a database, cleaning up the data, and reconciliation can all be automated via a set of programs. Some portions needing judgment (e.g., merging synonyms, selecting preferred labels, or judging the appropriateness of a tag or entity name) would need either human assessment or further automatic processing.

This sounds very promising. But what kinds of “subject” matters can such tools identify? Are they applicable to assist in subject analysis and indexing, or even be used as a primary solution to enhance subject access for existing resources?

Automated Semantic Analysis of Semi-Structured Data
Review of Related Literature

The Cranfield project is considered the first systematic evaluation in information retrieval systems. Led by Cyril Cleverdon, it lasted ten years (from 1957) and focused on the effectiveness of different indexing languages. The project set the stage for further research in Information retrieval – and established subject access as the central topic (Cleverdon 1960). A review of the literature shows a long sequence of papers on various aspects of subject access, emphasizing its importance and the need to support it in bibliographic information systems in addition to known-item searching. Marcia Bates (2003) points out problems end-users have when searching on a topic and proposes an entry vocabulary as a complement to controlled vocabularies, but also encourages the use of automated methods: “The second question concerns the use of available software for generating access terms. Anything that can be well done automatically should be” (Bates 2003, 39).

In the last ten years we are witnessing heated discussions on whether controlled vocabularies—subject headings in particular—are still worth the investment. Many researchers and practitioners argue that keyword searching or user-generated tags make controlled vocabularies obsolete, inefficient, and unnecessary. Yet, Gross and Taylor (2005) discovered out that over one third of records retrieved through keyword searches are those where keywords were found in subject headings. The lack of controlled vocabularies would therefore seriously affect keyword searching, the predominant way users now search for information. William Badke (2012) sees the solution in user education, particularly in the academic environment, and concludes rather pessimistically: “If we fail to advocate and if we do not restore the prominence of such vocabularies, they will disappear because of disuse and a negative cost-benefit analysis.”

The growing use of user-generated tags in information systems has spurred numerous studies of tags’ efficacy in improving access to materials (Rolla 2009; Klavans, LaPlante, and Golbeck 2014). The conclusion of the first study, which compared LibraryThing tags and LCSH, are that both have strengths and weaknesses and the author suggests that libraries should combine both in supporting their users. The second study is an analysis of the nature of tags according to two facets based on Panofsky (1939) and Shatford (1986): subject matter (who, what, where, and when) and specificity (general, specific, abstract). While the researchers found that their test collection of digital art images was most likely to generate generic tags that describe people or things found in the images, they also suggest that this was not a universal finding for how people tag, and that “tag sets largely depend on the type of collection and the needs of the user” (Klavans, LaPlante, and Golbeck 2014, 10).

Recently reported applications in applying automatic or machine-assisted semantic analysis in LAM collections, especially those not in the routine cataloging coverage or in the analytical level subject indexing, have focused on semantic annotation, entity extraction, and relationship description. The theories and methods can be traced from the field of automatic summarization and semantic analysis involving many linguistics researchers (Mani 2001). One of the theories of Text Coherence is the Rhetorical Structure Theory (RST) that brought up four rhetorical relations: Circumstance, Motivation, Purpose, and Solutionhood. Among those the circumstance means that the satellite sets a temporal, spatial, or situational framework in the subject matter within which the reader is intended to interpret the situation presented in the nuclear text span (Mann and Thompson 1988). On the other hand, Robert Allen (2013a, 2013b) explains that RST does not seem well suited to large volumes of complex texts. Allen’s team proposes that the event-entity fabric be overlaid with additional structures to present causation, generalization, explanation, argumentation, and evidence. Using rich content such as historical texts as the case, the two articles by Allen suggest that schematic models, which describe the content of documents rather than descriptions about the documents, are the key for a new generation of descriptive systems.

For entity extraction, pioneer works include BBC’s automated interlinking of speech radio archives (Raimond and Lowis 2012) and experiments of entity extraction for BBC news (Tarling and Shearer 2013). Whether used to embed annotations inside the text (e.g., Brat and Pundit annotation tools) or to extract entities out of the text (e.g., OpenCalais), these tools “type” the entities according to classes or categories pre-defined or defined in the analytic processes. They present a great potential in subject analysis workflow in LAMs, combined with the ontologies, conceptual and data models, and metadata schemas developed in related domains and applicable to processing LAM materials. Examples include using Calais to enhance access to oral history materials (Perkins and Yoose 2011) and museum online collections (Catone 2008).

Erwin Panofsky’s three-layers theory has been widely used by the researchers and practitioners examining subject access to images, particularly iconological themes found in the art of the Renaissance as well as art images in general (Panofsky 1939; Shatford Layne 1994; Klavans, LaPlante, and Golbeck 2014). The theory has also been extended to be the basis for subject analysis of all cultural objects, as suggested by the content standard Cataloging Cultural Objects: A Guide to Describing Cultural Works and Their Images (CCO) (Baca et al. 2006; Harpring 2009). Panofsky (1939) summarized the coordination of the three layers of object interpretation as (I) primary or natural subject matter; (II) secondary or conventional subject matter; and (III) intrinsic meaning or content. The layers are aligned with the three types of interpretation: act of, equipment for, and controlling principle of interpretation. Simplified by CCO, the three layers become: description, identification, and interpretation. These are to be further discussed in the following section.

Automated Semantic Analysis of Semi-Structured Data
Conclusions and Future Research

The paper reports on the analysis of the resulted access point candidates based on Panofsky’s three layers, which indicate these subject access points fall at the “description” and “identification” levels, rather than the “interpretation level. At a certain point, we can say that results are also derived by inferencing (e.g., those generalized terms). The usefulness of access points at each level of analysis for users are summarized in Figure 8 below.

Figure 8: Summary of the usefulness of access point candidates based on the two sample results.


Since we are particularly focusing on large heterogeneous digital libraries, it would be interesting to analyze typical user queries of such tools. In a future study we could analyze user needs according to the three layers (or substitute the “interpretation” with “inferencing”) and thus understand their nature. This knowledge would help us predict the usefulness of existing semantic analysis tools.



This work was supported by a grant from IMLS. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.