Together we can turn FAIR into reality - join us and your (meta)data community.
We invite you to the first virtual conference of the Helmholtz Metadata Collaboration (HMC).
HMC is a platform developing and implementing novel concepts and technologies for a sustainable handling of research data through high-quality metadata within the Helmholtz Association.
The HMC conference 2022 is the perfect platform
YOU can be part of it. We invite you to contribute to our conference and present your metadata project in our poster session (abstract submission deadline:
17 August 2022 31 August 2022, 23:59h CEST). More on the call for abstracts for posters can be found here.
Registration for the conference is closed (extended deadline 12 September 2022).
The Helmholtz Metadata Collaboration aims to make the research data [and software] produced by Helmholtz Centres FAIR for their own and the wider science community by means of metadata enrichment . Why metadata enrichment and why FAIR? Because the whole scientific enterprise depends on a cycle of finding, exchanging, understanding, validating, reproducing), integrating and reusing research entities across a dispersed community of researchers.
Metadata is not just “a love note to the future” , it is a love note to today’s collaborators and peers. Moreover, a FAIR Commons must cater for the metadata of all the entities of research – data, software, workflows, protocols, instruments, geo-spatial locations, specimens, samples, people (well as traditional articles) – and their interconnectivity. That is a lot of metadata love notes to manage, bundle up and move around. Notes written in different languages at different times by different folks, produced and hosted by different platforms, yet referring to each other, and building an integrated picture of a multi-part and multi-party investigation. We need a crate!
RO-Crate  is an open, community-driven, and lightweight approach to packaging research entities along with their metadata in a machine-readable manner. Following key principles - “just enough” and “developer and legacy friendliness - RO-Crate simplifies the process of making research outputs FAIR while also enhancing research reproducibility and citability. As a self-describing and unbounded “metadata middleware” framework RO-Crate shows that a little bit of packaging goes a long way to realise the goals of FAIR Digital Objects (FDO), and to not just overcome platform diversity but celebrate it while retaining investigation contextual integrity.
In this talk I will present the why, and how Research Object packaging eases Metadata Collaboration using examples in big data and mixed object exchange, mixed object archiving and publishing, mass citation, and reproducibility. Some examples come from the HMC, others from EOSC, USA and Australia, and from different disciplines.
Metadata is a love note to the future, RO-Crate is the delivery package.
Publishing data in a FAIR  way is already part of good scientific practice. While institutional policy as well as funding and publishing guidelines support this, scientist, technicians, and data stewards struggle to realize it when handling their research data. The reason is that the FAIR principles are high level principles and guidelines rather than concrete implementations. This is one of the key missions of HMC: support the Helmholtz community in making their data FAIR in an easy and comparable way. Developing a sustainable strategy for this requires a detailed understanding of practices, strengths, and deficiencies with respect to applying each of the FAIR principles. Here, tools that assess data FAIRness in comparison to a set of specific implementations in a quantitative fashion can help. When handling a dataset, such measures can aid the understanding of how FAIR a dataset actually is, as well as how to improve its FAIRness.
In this Blitzlicht-Talk, HMC Hub Matter and Hub Information will jointly present insights, benefits, and pitfalls from applying and further developing such metrics. For this we used the F-UJI tool [2,3], a python-based development by the FAIRsFAIR project, in two complementary projects.
In a “top-down” approach, we evaluate data repositories based on the data contained. The analyzed results are then used towards informing infrastructural development towards improving data FAIRness.
In a second, “bottom-up” approach, data publications from individual research centers or specific fields are evaluated with F-UJI. The results are gathered and visualized in an interactive pilot dashboard. This helps to identify and quantify the usage of repositories by Helmholtz‘s research communities as well as to better support the development of relevant infrastructure for FAIR data practices.
We discuss our experience from these automatic FAIR assessment approaches and compare them to complementary insights from a manual FAIR assessment of a particular data pipeline  using the FAIR Data Maturity Model . We discuss future plans for metric development and the potential use of such metrics in user-sided tooling.
The seismological community promotes since decades standardisation of formats and services as well as open data policies which are making easy data exchange an asset for this community. Thus, data is made perfectly Findable and Accessible as well as Interoperable and Reusable with enhancements expected for the latter two. The strict and technical domain specific standardisation may complicate the sharing of more exotic data within the domain itself as well as hinder interoperability throughout the earth science community. Within eFAIRs, leveraging on the know-how of the major OBS park operators and seismological data curators within the Helmholtz association, we aim at facilitating integration of special datasets from the ocean floor enhancing interoperability and reusability.
To achieve this goal, in close collaboration with AWI and Geomar, supported by IPGP, the seismological data archive of the GFZ has created special workflows for OBS data curation. In particular, with close interaction with AWI, new datasets have been archived defining a new workflow which is being translated into guidelines for the community. Domain specific software have been modified to allow OBS data inclusion with specific additional metadata. Among these metadata also persistent identifiers of the instruments in use have been included for the first time from the AWI sensor information system. Next steps are going to enlarge the portfolio of keywords and standard vocabularies in use to facilitate data discovery from scientists of different domains. Finally we plan to adopt the developed workflows for OBS data management.
Software as an important method and output of research should follow the RDA "FAIR for Research Software Principles". In practice, this means that research software, whether open, inner or closed source, should be published with rich metadata to enable FAIR4RS.
For research software practitioners, this currently often means following an arduous and mostly manual process of software publication. HERMES, a project funded by the Helmholtz Metadata Collaboration, aims to alleviate this situation. We develop configurable, executable workflows for the publication of rich metadata for research software, alongside the software itself.
These workflows follow a push-based approach: they use existing continuous integration solutions, integrated in common code platforms such as GitHub or GitLab, to harvest, unify and collate software metadata from source code repositories and code platform APIs. They also manage curation of unified metadata, and deposits on publication platforms. These deposits are based on deposition requirements and curation steps defined by a targeted publication platform, the depositing institution, or a software management plan.
In addition, the HERMES project works to make the widely-used publication platforms InvenioRDM and Dataverse "research software-ready", i.e., able to ingest software publications with rich metadata, and represent software publications and metadata in a way that supports findability, assessability and accessibility of the published software versions.
Beyond the open source workflow software, HERMES will openly provide templates for different continuous integration solutions, extensive documentation, and training material. Thus, researchers are enabled to adapt automated software publication quickly and easily.
In this presentation, we provide an overview of the project aims, its current status, and an outlook on future development.
The HELIPORT project aims to make the components or steps of the entire life cycle of a research project at the Helmholtz-Zentrum Dresden-Rossendorf (HZDR) and the Helmholtz-Institute Jena (HIJ) discoverable, accessible, interoperable and reusable according to the FAIR principles. In particular, this data management solution deals with the entire lifecycle of research experiments, starting with the generation of the first digital objects, the workflows carried out and the actual publication of research results. For this purpose, a concept was developed that identifies the different systems involved and their connections. By integrating computational workflows (CWL and others), HELIPORT can automate calculations that work with metadata from different internal systems (application management, Labbook, GitLab, and further). This presentation will cover the first year of the project, the current status and the path taken so far in the life cycle of the project.
To reach the declared goal of the Helmholtz Metadata Collaboration Platform, making the depth and breadth of research data produced by Helmholtz Centres findable, accessible, interoperable, and reusable (FAIR) for the whole science community, the concept of FAIR Digital Objects (FAIR DOs) has been chosen as top-level commonality across all research fields and their existing and future infrastructures.
Over the last years, not only by the Helmholtz Metadata Collaboration Platform, but on an international level, the roads towards realizing FAIR DOs has been paved more and more by concretizing concepts and implementing base services required for realizing FAIR DOs, e.g., different instances of Data Type Registries for accessing, creating, and managing Data Types required by FAIR DOs and technical components to support the creation and management of FAIR DOs: The Typed PID Maker providing machine actionable interfaces for creating, validating, and managing PIDs with machine-actionable metadata stored in their PID record, or the FAIR DO testbed, currently evolving into the FAIR DO Lab, serving as reference implementation for setting up a FAIR DO ecosystem. However, introducing FAIR DOs is not only about providing technical services, but also requires the definition and agreement on interfaces, policies, and processes.
A first step in this direction was made in the context of HMC by agreeing on a Helmholtz Kernel Information Profile. In the concept of FAIR DOs, PID Kernel Information is key to machine actionability of digital content. Strongly relying on Data Types and stored in the PID record directly at the PID resolution service, PID Kernel Information is allowed to be used by machines for fast decision making.
In this session, we will shortly present the Helmholtz Kernel Information Profile and a first demonstrator allowing the semi-automatic creation of FAIR DOs for arbitrary DOIs accessible via the well-known Zenodo repository.
Imaging the environment is an essential and crucial component in spatial science. This concerns nearly everything between the exploration of the ocean floor and investigating planetary surfaces. In and between both domains, this is applied at various scales – from microscopy through ambient imaging to remote sensing – and provides rich information for science. Due to recent the increasing number data acquisition technologies, advances in imaging capabilities, and number of platforms that provide imagery and related research data, data volume in nature science, and thus also for ocean and planetary research, is further increasing at an exponential rate. Although many datasets have already been collected and analyzed, the systematic, comparable, and transferable description of research data through metadata is still a big challenge in and for both fields. However, these descriptive elements are crucial, to enable efficient (re)use of valuable research data, prepare the scientific domains e.g. for data analytical tasks such as machine learning, big data analytics, but also to improve interdisciplinary science by other research groups not involved directly with the data collection. In order to achieve more effectiveness and efficiency in managing, interpreting, reusing and publishing imaging data, we here present a project to develop interoperable metadata recommendations in the form of FAIR  digital objects (FDOs)  for 5D (i.e. x, y, z, time, spatial reference) imagery of Earth and other planet(s). An FDO is a human and machine-readable file format for an entire image set, although it does not contain the actual image data, only references to it through persistent identifiers (FAIR marine images ). In addition to these core metadata, further descriptive elements are required to describe and quantify the semantic content of imaging research data. Such semantic components are similarly domain-specific but again synergies are expected between Earth and planetary research. We here present the current status of the project, with the specific tasks on joint metadata description of planetary and oceanic data.
Physical samples or specimen are often at the beginning of the “research chain” as they are the source for many data described in scholarly literature. The International Generic Sample Number (IGSN) is a globally unique and persistent identifier (PID) for physical samples and collections with discovery function in the internet. IGSNs enable to directly link data and publications with samples they originate from and thus close the last gap in the full provenance of research results. The modular IGSN metadata schema has a small number of mandatory and recommended metadata elements that can be individually extended with discipline-specific elements.
Based on three use cases that represent all states of digitisation - from individual scientists, collecting sample descriptions in their field books to digital sample management systems fed by an app that is used in the field - FAIR WISH will (1) develop standardised and discipline specific IGSN metadata schemes for different sample types from the Earth and Environment Sciences, (2) develop workflows to generate machine-readable IGSN metadata from different states of digitisation, (3) develop workflows to automatically register IGSNs and (4) prepare the resulting workflows for further use in the Earth Science community.
After investigating and identifying controlled linked-data vocabularies that can be included in our metadata schema, we recently have published the first data description template that includes new fields for biological and water samples. The template can be used by researchers to provide their sample descriptions and will serve as basis for semi-automated metadata generation.
We present three use cases which showcase methods of providing a detailed metadata description with the goal of increasing the reusability of data.
irst, Hub Energy presents a photovoltaic system which required ontology development and the implementation of data models based on standards like IEC 61850  or SensorML  as well as on FAIR Digital Objects (FDO) . The backend was realized using the Metastore  software from the Fair Data Commons while a FDO browser was implemented for visualization which offers a cascading search for metadata and application data.
In a second use case of Hub Energy, time series data of the energy consumption of the buildings on KIT's Campus North are described by automatically generated RO-Crates . This allows energy researchers to use these time series data without any knowledge about the structure of the database and provides a case study on working with RO-Crate technology.
The third use case is provided by Hub Matter, in the research field of high energy physics, and shows the optimization of a typical data set for data publication. To increase FAIRness of the distributed file set, (meta)data is (i) enriched by metadata, (ii) converted to a machine- as well as human-readable format and (iii) linked to a central file to create scientific context. By abstracting from community-specific details these measures can serve as a general approach to make data publishable.
The variety of use cases presented provides a menu of technologies and approaches implemented in diverse contexts to enhance the reusability of data, along with general advice for anyone looking to do the same.
A general photovoltaic device and materials data base compliant with FAIR principles is expected to greatly benefit research and development of solar cells. Because data are currently heterogeneous in different labs working on a variety of different materials and cell concepts, database development should be accompanied by ontology development. Based on a recently published literature database for perovskite solar cells, we have started an ontology for these devices and materials which could be extended to further photovoltaic applications. In order to facilitate data management at the lab scale and to allow easy upload of data and metadata to the database, electronic lab notebooks customized for perovskite solar research are developed in cooperation with the NFDI-FAIRmat project. Current status and challenges will be discussed.
Modern science is to a vast extent based on simulation research. With the advances in high-performance computing (HPC) technology, the underlying mathematical models and numerical workflows are steadily growing in complexity.
This complexity gain offers a huge potential for science and society, but simultaneously constitutes a threat for the reproducibility of scientific results. A main challenge in this field is the acquisition and organization of the metadata describing the details of the numerical workflows, which are necessary to replicate numerical experiments, and to explore and compare simulation results. In the recent past, various concepts and tools for metadata handling have been developed in specific scientific domains. It remains unclear to what extent these concepts are transferable to HPC based simulation research, and how to ensure interoperability in the face of the diversity of simulation based scientific applications. This project aims at developing a generic, cross-domain metadata management framework to foster reproducibility of HPC based simulation science, and to provide workflows and tools for an efficient organization, exploration and visualization of simulation data. Within the project, we so far did a review of existing approaches from different fields. A plethora of tools around metadata handling and workflows have been developed in the past years. We identified tools and formats like the odML that are useful for our work. The metadata management framework will address all components of simulation research and the corresponding metadata types, including model description, model implementation, data exploration, data analysis, and visualization. We have now developed a general concept to track, store and organize metadata. Next, the required tools within the concept will be developed such that they are applicable both in the Computational Neuroscience and Earth and Environmental Science.
Scientific technology, the supporting infrastructure and the resulting data are highly complex and extremely diverse. The work of the past decades has achieved digitization of many aspects in research, such as exeriments, instrumentation or the publishing process. However these individual parts mostly remain “digitized islands” and, so far, we ar lacking a systematic, broad and interoperable connection between them. Here, both formalization and standardisation of data descriptions within and across research fields, i.e. research data interoperability, remain a major challenge.
A core HMC action is to support interoperabiltiy within the Helmholtz digital ecosystem, and to ensure its alignment with global technology and standards at the same time. Both of these tasks require cooperation on many levels, ranging from the level of domain scientists and their research data, the level of data stewardship and knowledge engineering to the infrastructural and institutional level.
In this talk we will present HMC initiatives, developments and services that are all working towards an interoperable Helmholtz digital ecosystem: At the application and domain level, we are working with the electron microscopy community towards homogenized and interoperable semantic descriptions in these fields. At the level of data stewardship and knowledge engineering, HMC provides services such as our lightweight PID service PIDA and develops the “Helmholtz digitization ontology” (HDO). Once released, HDO will provide a harmonized, formal and machine-actionable understanding of the key concepts around digital dataspaces. At the infrastructural and institutional level, we are developing a Helmholtz Knowledge Graph. We will present first steps and a sketch about how this Knoledge Graph will link Helmholtz infrastructures to create a system that allows research data to be found and exchanged, both within the Helmholtz association and with the globally operating systems.
In an ever-changing world, field surveys, inventories and monitoring data are essential for prediction of biodiversity responses to global drivers such as land use and climate change. This knowledge provides the basis for appropriate management. However, field biodiversity data collected across terrestrial, freshwater and marine realms are highly complex and heterogeneous. The successful integration and re-use of such data depends on how FAIR (Findable, Accessible, Interoperable, Reusable) they are. ADVANCE aims at underpinning rich metadata generation with interoperable metadata standards using semantic artefacts. These are tools allowing humans and machines to locate, access and understand (meta) data, and thus facilitating integration and reuse of biodiversity monitoring data across terrestrial, freshwater and marine realms. To this end, we revised, adapted and expanded existing metadata standards, thesauri and vocabularies. We focused on the most comprehensive database of biodiversity monitoring schemes in Europe (DaEuMon) as the base for building a metadata schema that implements quality control and complies with the FAIR principles. In a further step, we will use biodiversity data to test, refine and illustrate the strength of the concept in cases of real use. ADVANCE thus complements semantic artefacts of the Hub Earth & Environment and other initiatives for FAIR biodiversity research, enabling assessments of the relationships between biodiversity across realms and associated environmental conditions. Moreover, it will facilitate future collaborations, joint projects and data-driven studies among biodiversity scientists of the Helmholtz Association and beyond.
Digital metadata solutions for epidemiological cohorts are lacking since most schemas and standards in the Health domain are clinically oriented and cannot be directly transferred. In addition, the environment plays an increasingly important role for human health and efficient linkage with the multitude of environmental and earth observation data is crucial to quantify human exposures. There are however currently no harmonized metadata standards for the different areas, so they cannot be merged routinely. Therefore, we aim to compile machine-readable and interoperable metadata schemas for exemplary data of our three domains Health (HMGU), Earth & Environment (UFZ), and Aeronautics, Space & Transport (DLR).
We will present our data use cases (HMGU: GINI/LISA cohort; UFZ: drought monitor; DLR: land cover), their current metadata formats and our strategy for metadata compilation, enrichment and mapping. UFZ and DLR will converge their metadata to the standard ISO 19115: Geographic Metadata Information. For HMGU, we reviewed several metadata standards for health data (e.g. CDISC ODM, Snomed CT, HL7 FIHR) and started to upload our metadata to the NFDI4health StudyHub, an inventory of German health studies on COVID-19 which is based on the Maelstrom catalogue. In addition, we have developed a workflow to transform base cohort information in an ISO 19115 compliant manner. The respective metadata sheet increases accessibility to researchers from other domains without exposing sensitive information about participants’ data.
The metadata mapping will be performed by location (spatial coverage) and date (time coverage) within GeoNetwork, a catalog application that we are currently setting up in a testing environment. We aim to have a server version ready by the end of the project that can be augmented with additional metadata from our domains, but also from other fields, to facilitate interdisciplinary research.
Details of less than 10% of the 80 million individual items in the collection at the Natural History Museum can be obtained via our Data Portal but much of it remains undigitized with other data associated with the collection recorded but not delivered in a coherent system. In 2018, 77 staff at the Natural History Museum, London, took part in a successful collections assessment exercise. 17 questions provided details of the Condition, Importance/Significance, Information available and Outreach use/potential about 2,602 Collection Units covering the entire Natural History science departments and Library and Archives. Results can be displayed and filtered via a bespoke dashboard in Microsoft Power BI, accessed via a web link available internally to all staff. The project successfully recorded the expertise of the curatorial staff and produced the first comprehensive assessment of the Natural History Museum’s collection. The methodology includes first attempts to automate scoring to include other data from our collections management systems such as environmental conditions and completeness of data coverage for the individual items that we deliver via our data portal. A few case studies are provided here to show how we have used this data and continue to refine the process of data capture and delivery for analysis with a key example showing how we are using this data to plan a major move for 40% of our collection.