Semantic Analysis: Research Sample #2

Automated Semantic Analysis of Semi-Structured Data
Research Sample 2: Philosophy Theses

In contrast with the methods used in the archival data sample, the second sample used manual processes in most of the procedures. The sample contains 44 philosophy theses consisting of a selected sub-sample (22) from KentLINK and a random sample (22) from OhioLINK. Abstracts, titles, keywords, and introduction paragraphs were submitted to OpenCalais separately to obtain the results. All of the candidate terms were counted according to Agent Names, Geographic Names, Corporate Name, and Topic Terms. They were manually validated to determine (1) the relevance to the thesis, (2) the type of a term (e.g., named entity, tag, or general heading), and its availability in LCNAF, LCSH, Wikipedia (as an entry), and the Stanford Encyclopedia of Philosophy.

Figure 7: Topical terms, called “social tags,” generated from the semantic analysis of an abstract of a philosophy thesis.

Evaluation according to Erwin Panofsky's three-layers theory

Background: Erwin Panofsky's three-layers theory has been widely used by the researchers and practitioners examining subject access to images, particularly iconological themes found in the art of the Renaissance as well as art images in general (Panofsky, 1939; Shatford Layne, 1994; Klavans et al., 2014). The theory has also been extended to be the basis for subject analysis of all cultural objects, as suggested by the content standard Cataloging Cultural Objects: A Guide to Describing Cultural Works and Their Images (CCO) (Baca, 2006; Harpring, 2009). Panofsky (1939) summarized the coordination of the three layers of object interpretation as (I) primary or natural subject matter; (II) secondary or conventional subject matter; and (III) intrinsic meaning or content. The layers are aligned with the three types of interpretation: act of, equipment for, and controlling principle of interpretation. Simplified by CCO, the three layers become: description, identification, and interpretation. These are to be further discussed in the following section. (Refer to Research Method section.)

Using the three-layers as the framework, the research found that the tags did very well in level I “description” and adequately in level II “identification.” In this part of the research, it was found that the semantic analysis based on the abstracts generated more successful tags than those based on the titles. Focusing on the tags generated by the software, it is interesting to see that the entity names missed in the Entity section (singular names such as Plato and Aristotle, or instances where the first name was not included) were often correctly extracted into the tags section. Major concepts were correctly identified in most cases. However the software often over-generalized the subjects by assigning very general terms (e.g., “philosophy,” for almost every philosophy thesis) and some terms that were unrelated to the subject of the thesis. This level is different from “identification” and “description,” and seems to be more akin to “inferencing.” Among the average of 9 tags per abstract in the KentLINK sub-sample, an average of 1.64 were overly broad topical terms and 3.45 were unrelated topical terms (slightly more than 1/3). The results for the tags in the OhioLINK sub-sample are similar to the KentLINK results generated by OpenCalais.

The tags that could be categorized as “inferencing” results seemed to be less valid according to the best practices of cataloging and subject indexing. The overly-broad topic terms are not wrong (e.g., philosophy, knowledge, science) but their relevance in terms of subject access is questionable. The promising news is that among the topical terms (including named entities as topics), LCSH together with LCNAF could match about 75% of them closely (we used the degree as closeMatch, in comparison to broadMatch, narrowMatch or noMatch), and DBpedia matches almost 98% with closeMatch degree for both sub-samples. These vocabulary sources hold great potential for these subject access points to become the linking point to the Linked Data datasets that use DBpedia and LC vocabulary URIs as their basis.

This work was supported by a grant from IMLS. It is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.