FUSION
FUnctionality Sharing In Open eNvironments
Heinz Nixdorf Chair for Distributed Information Systems
 

Towards Scientific Data Synthesis Using Deep Learning and Semantic Web

 

Startdate: 2021-06-04

Finishdate: 2021-05-10

Website: https://2021.eswc-conferences.org/

Member(s):
Alsayed Algergawy
Birgitta König-Ries
Hamdi Hamed

Abstract

One of the added values of long-running and large-scale collaborative projects is the ability to answer complex research questions based on the comprehensive set of data provided by their central repositories. In practice, however, finding data in such a repository to answer a specific question often proves to be a demanding task even for project scientists. In this paper, we aim to ease this task, thereby enabling cross-cutting analyses. To achieve that we introduce a new data analysis and summarization approach combining semantic web and machine learning approaches. In particular, the proposed approach makes use of the capability of machine learning to categorize a given dataset into a domain topic and to extract hidden links between its data attributes and data attributes from other datasets. The proposed approach has been developed in the frame of CRC AquaDivaand has been applied to its datasets.

Description

Motivation

  • The Collaborative Research Center (CRC) AquaDiva is a large collaborative project spanning a variety of domains including biology, geology, chemistry, and computer science with the common goal to better understand the Earth’s critical zone.
  • Datasets collected within AquaDiva, like those of many other cross-institutional, cross-domain research projects, are complex and difficult to reuse since they are highly diverse and heterogeneous.
  • This limits the dataset accessibility to the few people who were either involved in creating the datasets or have spent a significant amount of time aiming to understand them.
  • Furthermore, more time is needed to figure out the major theme of unfamiliar datasets.
  • We believe that dataset analysis and summarization can be used as an elegant way to provide a concise overview of an entire dataset.

Definitions:

  • A dataset is defined as a tuple of primary data and metadata organized for a specific purpose.
  • The primary data represents the actual data organized according to a specific structure, called data structure.
  • Each data structure consists of a set of data attributes,
  • Each data attribute has a name, datatype,(optional) unit, description, as well as annotation based on a domain ontology.
  • Each tuple in the primary data is a collection of data cells containing data values (called data points).
  • The metadata contains information about,e.g., the data owner, data curators, the methodology used to produce primary data, etc.
  • In our implementation, almost all data attributes of available datasets are annotated using the AquaDiva ontology (ADOn) as the domain-specific ontology.

 

Methodology

JeDaSS Architecture

We developed an approach that semantically classifies data attributes of scientific datasets (tabular data) based on a combination of semantic web and deep learning. This classification contributes to summarizing individual datasets, but also to link them to others. We view this as an important building block for a larger data summarization system. We believe that figuring out the subject of a dataset is an important and basic step in data summarization. To this end, the proposed approach categorizes a given dataset into a domain topic. With this topic, we then extract hidden links between different datasets in the repository. The proposed approach has two main phases:

  1. off-line to train and build a classification model using supervised deep learning using convolution layers and
  2. on-line making use of the pre-trained model to classify datasets into the learned categories.

Data Preparation

The main objective of this component is to prepare a large number of heterogeneous datasets for analysis. It is needed both during the model building (training phase) and the model deployment (operating phase) to convert each dataset into a structure suitable for the next component, image generation. To this end, we propose a new structure that combines several features from the dataset into a single container. The arising question here which parts of a dataset should be selected as representative for the dataset. We argue that data attributes are the most important parts of the dataset and the data preparation process will take place around them. Furthermore, metadata provides an important source for understanding and interpreting datasets. Therefore, constructing a new structure via data preparation is mainly based on data attributes and metadata. We gathered all information related to data attributes, where for each data attribute we consider the name, datatype, unit, data points attached to the data attribute as well as its semantic annotation. Furthermore, we use the dataset title contained in the metadata as a textual representation of the dataset.

Image Generation

To benefit from the salient features of the way deep learning algorithms with convolution layers deal with images, where the network can accept a sample as an image (i.e., a matrix withsizen×m) and perform feature extraction and classification via hidden layers, we propose to transform the constructed new structure for a dataset into a number of images, where an image is generated for each data attribute. Figure 2 illustrates the ”Airtemperaturemean” data attributes from the” Weather and soil data monitoring” along with its annotation, data type (decimal), unit(Celsius), and 30 data points.

Classification

In the current implementation, we use the ResNet18 convolutional neural network to build the classification model, as it achieved the best results in our trials. We build and test the proposed approach using datasets of the AquaDivadata portal3. We used 114 datasets representing different domain topics, such as weather monitoring, groundwater hydrochemistry, gene abundance, or soil physical parameters. 70% of these were used for training and 30% for evaluation. The total number of data attributes is 1300; the number of data points within a dataset ranges from 300 (5 data attributes×60 tuples) to 12,000,000. Example results were presented to domain experts from CRC AquaDiva. They confirmed the correctness and usefulness of this classification. Based on these promising first results, we aim to extend and further evaluate the approach in future work.

Preliminary results

 

Publications

  • Alsayed Algergawy, Hamdi Hamed, Birgitta könig-Ries:  Towards Scientific Data Synthesis Using Deep Learning and Semantic Web . The Semantic Web: ESWC 2021 Satellite Events: poster and demo, 2021,  (poster paper)

Availability

The resources related to the development of the proposed approach can be found at the GitHub repo.

Acknowledgment

This work has been funded by the Deutsche Forschungsgemeinschaft(CRC AquaDiva, Project 218627073).