Senior consultant at German Federation for Biological Data (GFBio e.V.) & National Research Data Infrastructure for Biodiversity (NFDI4BioDiversity). Senior head of the World Data Center PANGAEA
The application of semantic resources and technologies for the discovery and integration of geo- and biosciences data
Large-scale and complex questions in science, such as global warming, invasive species spread, and resource depletion, increasingly require the collection of disparate data sets from various data sources building on different knowledge domains in science and society. Structured data and metadata with consistent semantics are prerequisites for data usability, in particular for findability of data and efficient data integration. Ontologies, thesauri, and vocabularies for various science domains have been evolving tremendously during the last decade and play a key role for the harmonization of data. Nevertheless, the application of terminologies in the context of data production, archiving, and publication is still at its beginning. In addition, features and usability of terminology services vary greatly. The situation is aggravated by the complexity and dynamic growth of measurement and observation types (parameters) including used methods which are essential for integrating data from distributed sources.
The ISC World Data Center PANGAEA (www.pangaea.de) with ~200.000 parameters and methods linked to more than 400.000 data sets covers a large part of scientific fields in the earth and environmental sciences. For the harmonization of parameters and methods PANGAEA has (1) embedded a term catalogue (TC) comprising various relevant terminologies including taxonomies into its editorial system, (2) has conceptualized parameters and methods by setting up a basic syntax and rule set, and (3) has implemented routines based on full text search for matching parameter and method names with terms from the TC. Despite these measures being quite successful it must be noted that the approach is limited to PANGAEA as a single data provider – the needed effort is high.
More recently, Germany launched the National Research Data Infrastructure (NFDI – https://www.nfdi.de/) initiative with a number of consortia covering various science domains. The NFDI4BioDiversity (www.nfdi4biodiversity.org) consortium, having started in 2020, leads the development of a multi-cloud-based infrastructure supported by almost all existing consortia. The so-called NFDI Research Data Commons (RDC) will enable uniform access to data, software, and compute resources as well as sovereign data exchange and collaborative work. Harmonization of the semantics of data during the ETL process will be supported by terminology services enhanced by AI technologies. With the RDC, a paradigm shift from data to function shipping is initiated.
The approach aligns well with initiatives like the European Open Science Cloud, EOSC (https://eosc-portal.eu/), the NCI RDC (https://datascience.cancer.gov/data-commons) or the Australian RDC (https://ardc.edu.au/). Nevertheless, integrating and harmonizing data into the conceived cloud based systems remains a major challenge: (1) Terminology services need to improve in quality and functionality; (2) AI technologies are not yet part of the integration process; (3) more convergence towards cross-domain usable metadata standards like schema.org – allowing community specific extensions like BioSchemas (https://bioschemas.org/) – is needed to keep the ETL process manageable; and finally (4) a comprehensible, generally applicable model for the definition of parameters and methods would make the task considerably easier. The latter should build in part on the UCUM system (https://ucum.org/) for scientific units.
Dr. Michael Diepenbroek, Geologist and IT Specialist. 1992 PhD in Geology at the Free University of Berlin; 1992-94 computer center of the AWI, Bremerhaven; 1994-97 conception and implementation the scientific information system PANGAEA®; 1998-2021 at MARUM, University Bremen, where he was responsible for the management of PANGAEA®. During the last 10 years he took a leading role establishing PANGAEA as a global service provider for scientific data, in particular through mandates from the ISC (ISC World Data System – Vice-Chair of the Scientific Committee 2009-2016), the WMO (WMO Information System), collaborations with major science publishers, and as Chair/Co-chair in various RDA groups. Coordinator of the German Federation for Biological Data (GFBio) (2013-2020). Since 2017 engaged in the National Research Data Infrastructure Initiative (NFDI), in particular in the conception and preparation of NFDI4BioDiversity (https://www.nfdi4biodiversity.org). Since 2021 working for GFBio e.V. as part of NFDI4BioDiversity.
Keynote by Mark Schildhauer
Mark Schildhauer was the Director of Computing at NCEAS from its opening in 1995 until 2017, when he “semi-retired” to focus on research. His technology research interests are primarily in the areas of informatics, the semantic web, scientific workflows, data semantics, computer-supported collaborative research, and Open Science, all in the context of facilitating integrative environmental and conservation science. Schildhauer’s academic training (PhD) was in marine, population, and behavioral ecology.
The Semantic Diversity of Biodiversity Semantics: an Overview
Biodiversity is a relatively recent term, becoming prominent starting in the mid-1980’s as a descriptor for the study of the “diversity of life-forms” on Earth. This charismatic label served to focus concerns on recent alarming trends in the global loss of Biodiversity– including populations, species, communities, and ecosystems.
While coining of the term Biodiversity grew out of concerns over the accelerating disappearance of natural habitats and corresponding extinction of species — more recently its connotations have been much broader, encompassing a number of disciplines dedicated to characterizing, quantifying, conserving and understanding the remarkable variation in Earth’s life forms at various scales, ranging from the molecular to the global. Fundamental to these investigations is the discipline of “Modern Systematics” that focuses specifically on the naming and categorizing of biological, often organismal “types”, according to codes and rules established and maintained by the research community. The resulting “names” provide a unifying, though not entirely uncontested, reference terminology used by researchers in fields such as phylogenetics, evolutionary biology, paleontology, ecology, and conservation biology— to effectively clarify what types of biological entities were studied.
There is clearly an opportunity for these names (that can change over time as our understanding changes)— whether of clades, species, other taxonomic ranks, vegetation associations, habitat types, etc.; and the rules governing them — to be expressed in modern semantic languages and frameworks, in order to enable more effective discovery, interpretation, re-use, and exchange of biodiversity resources over the Web. In addition semantic technologies provide opportunities to de-silo and interoperate biodiversity terminologies more effectively with resources from disparate fields, ranging from ecological field data, to taxonomic descriptions, to conservation policy decisions.
In this talk I will provide some perspectives on current exciting work building biodiversity-relevant vocabularies, ontologies, and knowledge graphs, that collectively are helping to realize the goal of an interoperable “Biodiversity Semantics”. I will also attempt to identify some opportunities and challenges that remain.
Dr. Schildhauer was Director of Computing at the U.S. National Center for Ecological Analysis and Synthesis, NCEAS, from its opening in 1995 until 2017, when he “semi-retired” to focus on research. His technology research interests are primarily in the areas of environmental informatics, geospatial semantics, the Semantic Web, Knowledge Graph technologies, and Open Science. In the distant past, he was technical coordinator of the first major biodiversity assessment of the macrobiota of rocky intertidal habitats within the Southern California Bight, and later did extensive field work in coral reef fish communities in the southern Caribbean Sea. At NCEAS, Schildhauer worked closely with plant biologists designing vegetation classifications for North America, and a functional trait database for New World plants. He also participated in the creation of the formal XML metadata schema, EML (Ecological Metadata Language; https://eml.ecoinformatics.org/), and its recent extensions to accommodate semantic annotation. He co-developed the OBOÉ ontology for observations and measurements, and helped align it with the W3C SSN/SOSA recommendation ( https://www.w3.org/TR/vocab-ssn/). Schildhauer’s academic training was in marine, population, and behavioral ecology (PhD, Univ Calif, Santa Barbara); and an undergraduate degree in Biology (A.B., Harvard).