Research Spotlight

Rationale

Research Spotlight (RS), is a system that extracts information from research articles, enriches it with relevant information from other web sources, organizes it according to the Scholarly Ontology, and republishes it in the form of linked data. Existing information is leveraged by accessing SPARQL endpoints, scraping Web pages or through APIs. Harvested information is further used as background knowledge for training classifiers or for extracting information from semi-structured or unstructured texts.

Knowledge bases created using RS can support researchers in finding details of relevant work without reading the articles; discovering uses of resources, processes and methods in particular contexts; promoting communities of interests; formulating future directions and project proposals. Besides, funders and research councils can get a “bird’s eye view” of scholarly work useful for planning and evaluation.

system architecture

Research Spotlight (RS) provides an automated workflow that allows for the population of Scholarly Ontology’s core entities and relations. To do so, RS provides distance supervision techniques, allowing for easy training of machine learning models, interconnects with various APIs to harvest (linked) data and information from the web, and uses pretrained ML models along with lexico/semantic rules in order to extract information from text in research articles, associate it with information from article’s metadata and other digital repositories, and publish the infered knowledge as linked data. Simply put, Research Spotlight allows for the transformation of text from a research article into queriable knowledge graphs based on the semantics provided by the Scholarly Ontology.

RS employs a modular architecture that allows for flexible expansion and upgrade of its various components. It is writen in Python and makes use of various libraries such as SpaCy for parsing and syntactic analysis of text, Beautiful Soup for parsing the html/xml structure of web pages and scikit-learn for implementing advanced machine learning methodologies in order to extract entities and relations from text.

Layered Approach

In order to transform text in to queriable knowledge graphs, Research Spotlight (RS) follows a layered approach. The input comprises published research articles retrieved from repositories or web pages in the preferred html/xml format. The format is exploited in extracting the metadata of an article, such as authors’ information, references and their mentions in text, legends of figures, tables etc.. Entities, such as Activities, Methods, Goals, Propositions, etc., are extracted from the text of the article. These are associated in the relation extraction step, through various relations, e.g. follows, hasPart, hasObjective, resultsIn, hasParticipant, hasTopic, hasAffiliation, etc.. Encoded as RDF triples, these are published as linked data, using additional “meta properties”, such as owl:sameAs, owl:equivalentProperty, rdfs:Label, skos:altLabel, where appropriate.

Preprocessing

In Preprocessing, information is retrieved from sources such as DBpedia in order to build lists of named entities through the NE List Creation module. Specific queries using these entities are then submitted to the sources via theAPI Querying module. Retrieved articles are processed by the Text Cleaning module and the raw text at the output is added to a training corpus through the Automatic Annotation module that uses the entries of NE List to spot named entities in the text. The annotated texts are used to train a classifier to recognize the desired type of named entities.

Main Processing

Main Processing begins with harvesting research articles from Web sources, either using their APIs or by scraping publication web sites. The articles are scanned for metadata which are mapped to SO instances according to a set of rules. In addition, specific html/xml tags inside the articles indicating images, tables and references are extracted and associated with appropriate entities according to SO, while the rest of the unstructured, “raw” text is cleaned and segmented into sentences by the Text Cleaning & Segmentation module. The unstructured, “raw” text of the article is then input into the Named Entity Recognition module, where named entities of specific types are recognized. The segmented text is also inserted into a dependency parser using the Syntactic Analysis module. The output consists of annotated text -in the form of dependency trees based on the internal syntax of each sentence- which is further processed by the Non-Named Entities Extraction module, so that text segments that contain other entities (such as Activities, Goals or Propositions) can be extracted. The output of the above steps (named entities, non-named entities and metadata) is fed into the Relation Extraction module that uses four kinds of rules: (i) syntactic patterns based on outputs of the dependency parser; (ii) surface form of words and POS tagging; (iii) semantic rules derived from Scholarly Ontology; (iv) proximity constraints capturing structural idiosyncrasies of texts. Finally, based on the information extracted in the previous steps, URIs for the SO namespace are generated, and linked -when possible- to other strong URIs (such as the DBpedia entities stored in the named entities lists) in order to be published as linked data through a SPARQL endpoint.