The Semantic Analysis Method (SAM) Project aims to develop an open source tool for identifying and analyzing unstructured descriptions of archives and special collections materials to generate potential access points suitable for linked data applications. This project reports on the development of the SAM tool, which is a software application that utilizes the semantic analysis engine Open Calais to linguistically process archival finding aids and generate potential metadata (access points) through entity extraction.The entities derived from the analysis are parsed and saved in the comma-separated value (CSV) database format, and can then easily be imported into a data cleanup application such as OpenRefine. This tool provides an important bridging application for converting valuable, unstructured information found in archival descriptions into usable, semantically-defined access points. (Source code available at: https://github.com/sammysemantics/SAM)
The research project to develop and test the SAM tool occurred in 4 stages: 1) the initial identification of problem and exploration of potential solutions; 2) the development of a software program to automate entity extraction process and parse the resulting data into a database (the SAM tool); 3) further refinement of the SAM tool to modularize each task and develop a user-friendly interface that could be easily adopted by archivists without extensive technical/programming experience; 4) exploration of the OpenRefine data clean-up tool to improve the quality of resulting dataset, and establish the limitations of linking entities from the data set to outside data sources.
How Semantic Analysis of of Semi-Structured Data contributes to the generation of Linked Data.
Using automated subject analysis
Tools and techniques for employing Semantic Analysis
Research Method and Preliminary Findings
43 archival record groups from sixteen institutions, including special collections, university, and government archives
44 philosophy theses consisting of a selected sub-sample (22) from KentLINK and a random sample (22) from OhioLINK