FUSION
FUnctionality Sharing In Open eNvironments
Heinz Nixdorf Chair for Distributed Information Systems

Semantics 2016

Does Term Expansion Matter for the Retrieval of Biodiversity Data? – Supplementary Material

Felicitas Löffler, Friederike Klan

The overall goal in our evaluation was to figure out if a semantically enhanced query really outperforms a simple keyword-based query and which types of semantically related terms support researchers best in finding relevant datasets. Is a result set including data related to the labels of parent or sibling concepts of a user-provided search term actually relevant for the user or is it enough to incorporate synonyms and narrower concepts? Therefore, we set up a search engine over 92,856 biological metadata files and enriched 19 exemplary search questions proposed by biodiversity researchers with expansion terms from two vocabulary providers. Both queries, the original search terms and the expanded version, were executed by the search engine.

Experimental Setup

The overall flow is presented in the Figure below: We requested 6 experienced biodiversity researchers to provide five research questions related to their field of expertise each and also asked them to give proper search terms they would enter to find relevant data. Having both, the keywords entered for search and the underlying question, allows us to interpret the meaning of the search terms correctly and makes the query intent explicit. For all search terms provided by the researchers, we looked for matching concepts on two different ontology platforms (exact match to a concept label), namely the Terminology Server hosted by the German Federation for Biological Data (GFBio TS) and Bioportal. In case of a successful match, related terms were retrieved and selected following the strategy described in our publication. Both, the original set of keywords and the expanded version were sent to the search engine for dataset retrieval. This led to two different result sets that were displayed to the study participants in a portal-based user interface.

flow2

Overview of Questions

We got the permission from all 6 users to publish their questions and evaluation results.
Both, the questions and the associated search terms, were provided by the users.

Query Question search terms
Q1 How high are sulfate reduction rates at cold seeps? cold seeps, sulfate reduction rate
Q2 How high are benthic oxygen uptake rates in den Atlantic? Atlantic, oxygen uptake, respiration
Q3 How is the distribution of Holothuroidea in the Atlantic Ocean? Holothuroidea (sea cucumber), Atlantic
Q4 How high is the organic carbon content in arctic sediments? arctic, sediments, organic carbon
Q5 Where do I find mesopelagic fish of the genus Cyclothone? Cyclothone
Q6 How many eggs do copepods produce (e.g. #eggs/female/day)? egg production, copepoda
Q7 How variable is the oxygen concentration (e.g. in unit µmol/kg) of sea water in the mesopelagic zone (i.e. between 200-1000 m) of the global ocean? oxygen, µmol/kg, sea water, mesopelagic zone
Q8 What data exist for Neogloboquadrina pachyderma or Globigerina bulloides? Neogoboquadrina pachyderma, Globigerina bulloides
Q9 What data contains samples from surface water? surface water, water sample
Q10 What are associated taxa, for example an insect and its host plant? host (parasite), plant, insecta
Q11 What data exist for invasive grasses, e.g.,Poaceae? invasive grasses, e.g., (Poaceae)
Q12 What data is in the repository about ‘climate change’? climate change
Q13 What data is there for ‘root length’? root, length
Q14 What data exist for butterflies on oaks? lepidoptera, quercus
Q15 What data is there for ‘foraminifera’ and ‘benthic’? foraminifera, benthic
Q16 What data is there for nutrients in soil? nutrient, soil (terrestric)
Q17 What data is in the repository for the german tree of the year 2016 ‘Tilia cordata’? Tilia cordata, Germany
Q18 What data exist for ‘primula veris’ in Germany? Primula veris, Germany
Q19 Please show me all datasets about ‘sunflowers’! sunflower (Helianthus)

GFBio – Dataset:

We requested for 100,000 randomized datasets from GFBio’s elasticsearch API that can be downloaded with the following settings but obtained only 92,856. To generate reproducible results we set the ‘seed’ with one of the authors name ‘felicitas’:

GFBio search API

{ “fields”: [“xml”,”internal-datestamp”],
“query”: {
“function_score”:{
“query”:{“match_all”:{}},
“functions”:[{“random_score”:{“seed”:”felicitas”}}],
“score_mode”:”multiply”}
},
“size”:100000}}

Vocabularies & Terminology Services

The original search terms were expanded with services from GFBio’s Terminology Server(GFBio TS) and Bioportal. Preference was given to GFBio TS since the hosted vocabularies are tailored to the datasets.
The following services from GFBio’s Terminology Server have been used:

  • /terminologies/search?query=<queryString>&match_type=exact&first_hit=false&internal_only=true
  • /term?uri=<URI>
  • /broader?uri=<URI>
  • /allnarrower?uri=<URI>

The obtained concepts were found in the following vocabularies: NCBITAXON, CRISP, ENVO, CHEBI, OBOE, PATO, QUDT

We used Bioportal’s annotator with selected ontologies, namely: GO, GO-EXT, NCIT, CMO

Source code

The described expansion strategy for GFBio’s terminology server was implemented as JAVA command-line tool that will be published on GitHub.

Search Queries and Indexing

The randomly harvested 92,856 biodiversity metadata files from GFBio were indexed with the search engine GATE Mímir, version 5.1. The internal retrieval model is based on the classical TF-IDF approach.

User Interface

We integrated the results returned by GATE Mímir into the open-source portal Liferay. The left side of the portlet shows the results obtained from the original user-provided search terms, the right side displays the datasets that were retrieved using the expanded keywords.

user_interface_short

Evaluation Results

List of ontologies used for query expansion and matched resources

Source Vocabulary Concepts
GFBio TS National Center for Biotechnology Infor-
mation (NCBI) Organismal Classification
(NCBITAXON)
13
GFBio TS Computer Retrieval of Information on Scien-
tific Projects Thesaurus (CRISP)
9
GFBio TS ENVironmental Ontology (ENVO) 8
Bioportal National Cancer Institute Thesaurus (NCIT) 3
GFBio TS Chemical Entities of Biological Interest Ontol-
ogy (CHEBI)
2
GFBio TS Phenotypic Quality Ontology (PATO) 1
GFBio TS Quantities, Units, Dimensions, and Types On-
tology (QUDT)
1
Bioportal Clinical Measurement Ontology 1
Bioportal Gene Ontology (GO) Extension (EXT) 1

Relevance Judgments
relevance judgments – zip file: The zip file contains the excel sheets with relevance judgments from 6 users.


MAP and nDCG based on different relevance thresholds (>0 and >3) for the keyword (blue) and semantic search (limegreen)

MAP >0 MAP>3 nDCG MAP >0 MAP>3 nDCG
0.74 0.58 0.72 0.83 0.55 0.79