11 June 2024
Forschungszentrum Jülich
Europe/Berlin timezone
Registration closed

Workshop on Foundation Models for Topological Data - Challenges and Opportunities

This workshop will be held in conjunction with the Helmholtz Foundation Model Initiative Get-together

 

When: 13:15 – 14:45 

Where: INM Seminar Room, building15.9, room 4001b [coordinates]
 


Timo Dickscheid1,2,3, Christian Schiffer1,2, Susanne Wenzel1,2, Martin Schultz4

1 Helmholtz AI, Forschungszentrum Jülich
2 Institute or Neuroscience and Medicine (INM-1), Forschungszentrum Jülich
3 Institute of Computer Science, Heinrich-Heine-University Düsseldorf
4 Jülich Super Computing Centre, Forschungszentrum Jülich

Half-day workshop

Foundation models have started to revolutionise the field of artificial intelligence and with that many scientific and industrial applications. So far, research on these powerful generalist models resolves around models that operate on images or written language, or sometimes both in combination. In comparison, topological data (e.g., attributed graphs) has received relatively little attention with respect to the development of foundation models, despite its central role in many scientific domains, including climate modelling, neuroscience, molecular science, and remote sensing. With this workshop, we will present and discuss applications from different scientific domains which rely on analysing topological data, and aim to identify common challenges and potential solutions towards developing foundation models for topological data.


Talks

AtmoRep: a probabilistic multi-purpose model for atmospheric dynamics

Ilaria Luise, Cern

Challenges and Opportunities of Foundation Models for Decoding the Human Brain

Christian Schiffer, Institute for Neuroscience and Medicine (INM-1), Forschungszentrum Jülich

DECODE: Cloud-connected Labs of Future for Energy Materials 

Kourosh Malek, Theory and Computation of Energy Materials (IEK-13) & Centre for Advanced Simulation and Analytics (CASA), Forschungszentrum Jülich

Lessons from enzyme function prediction, encoding protein topology for LFMs

Karel van der Weg, Institute of Bio- and Geosciences: Bioinformatics (IBG-4), Forschungszentrum Jülich

 

Latest information on github  [link]


Abstracts

AtmoRep: a probabilistic multi-purpose model for atmospheric dynamics

Ilaria Luise, Cern

AtmoRep is a first example of probabilistic multi-purpose model that extends the concept of representation learning to Earth system science, and in particular to atmospheric dynamics. A multidisciplinary collaboration between Magdeburg University, CERN and the Jülich Supercomputing Center recently released a first prototype (www.atmorep.org). Starting from a single pre-trained architecture used as a backbone, AtmoRep is able to achieve skilful results in multiple zero-shot applications such as nowcasting, temporal interpolation and scenario generation, compared to the state-of-the-art approaches. Thanks to a novel definition of the loss, the model is also probabilistic by design, as it outputs a set of ensemble members for each task, with well calibrated distributions as proven for weather forecasting. The talk will focus on the innovative aspects of the model architecture, the current results for the most relevant applications and the foreseen developments.

 

Challenges and Opportunities of Foundation Models for Decoding the Human Brain

Christian Schiffer, Institute for Neuroscience and Medicine (INM-1), Forschungszentrum Jülich

Decoding the microstructural organization of the human brain relies on the analysis of multi-modal image data that captures complementary organizational principles, including the distribution of neuronal cell (cytoarchitecture), nerve fiber organization (fiber architecture), and neurotransmitter receptor distributions (chemoarchitecture). Petabyte-scale multi-modal imaging datasets acquired using modern microscopic imaging techniques enable microstructural investigation at great detail. In recent years, specialized deep learning methods have been developed that are able to perform a range of different downstream tasks for these datasets, including segmentation, classification, data imputation, cross-modality image synthesis, and representation learning. Inspired by the recent success of large foundation models in performing a wide range of downstream tasks for different data modalities, we will discuss in this talk how foundation models could transform AI-based microstructural brain analysis.


DECODE: Cloud-connected Labs of Future for Energy Materials 

Kourosh Malek 1,2 , Michael Eikerling 1,2
1  Theory and Computation of Energy Materials (IEK-13), Forschungszentrum Jülich
2  Centre for Advanced Simulation and Analytics (CASA), Simulation and Data Science Lab for Energy Materials (SDL-EM), Forschungszentrum Jülich 

The clean energy technology sector faces a major challenge with the pace of development trailing behind commercialization targets. The root cause hampering progress towards cleaner materials and technologies is that laboratories with complementary capabilities still largely operate in separation, with a lack of coordination among their efforts. The EU-funded DECODE project (DE-centralised ClOud Labs for inDustrialisation of Energy Materials) aims to break down these barriers by creating a decentralised and adaptive cloud-connect labs concept. DECODE aims at transforming the development and innovation process for clean energy materials and technologies. The project envisions the creation of a decentralized platform that connects multiple labs to enhance the effectiveness and accelerate the progress of research and development in the field of clean energy technology. The core elements of the platform consist of the DECODE FABRIC, a matrix-like structure that facilitates collaboration, and a scoring system to assess integration readiness of methods and tools. An AI-enabled CPU orchestrates contributions from partner labs. Initially focusing on hydrogen technologies, DECODE's vision may be expanded to other technologies including energy harvesting, storage, clean water and more. The platform strives for an unprecedented level of flexibility and adaptability, accommodating diverse strategies and technologies. In summary, DECODE accelerates clean energy innovation through interconnected labs, fostering a sustainable and cleaner future. 

At its core, DECODE strives to develop and deploy three innovative modules: 1) the DECODE Foundry, a semantic search engine for assembling methods and tools into practical workflows, 2) the DECODE FABRIC, a matrix structure that connects modelling and characterisation suites, and 3) the DECODE CPU for end-to-end orchestration of a given materials development to integration pipeline. The platform, designed for unprecedented flexibility and interoperability, harnesses AI-driven data management and ontological mapping to enable the seamless collaboration among partner labs. The DECODE platform will be built with a modular architecture, harnessing existing AI-cloud and LLM-enabled data management infrastructure at Forschungszentrum JÜLICH - IEK-13 (Virtual Mind Labs). 
The DECODE project has been designed to achieve an integrated European materials platform, allowing a systemic use of tools and capabilities including materials modelling, characterisation, robotics, data documentation, ontologies, artificial intelligence, and machine learning, which are orchestrated to accelerate the design, development and application of chemicals, materials and related processes and manufacturing.

Figure1: Conventional process vs. DECODE process. Modelling concepts (blue circles), characterisation tools (blue hexagons), methodologies (grey circles).

The DECODE project has received funding from the European Union’s Horizon Europe Research and Innovation Program under Grant Agreement No 101135537.  

Lessons from enzyme function prediction, encoding protein topology for LFMs

Karel van der Weg, Institute of Bio- and Geosciences: Bioinformatics (IBG-4), Forschungszentrum Jülich

Recent advances in biology and machine learning have enabled the development of large foundation models (LFMs), such as Meta's ESM, leading to an explosion of structural protein information. Understanding these proteins forms the basis for drug discoveries, engineered enzymes, and personalized medicine.
However, many LFMs for proteins are based on the sequence representation. Ignoring the topological relations in protein structures, influencing factors like shape and stability. Protein function arises from dynamic topology, with interactions described as the combined topology of interaction partners. 
We will demonstrate TopEC, our deep graph neural network for protein function prediction. Using message passing networks to encode protein topology in graphs.  We will discuss specific techniques for encoding biochemical data, including rotational invariance and incorporation of 3D positional information. From our experience with TopEC, we will showcase the inclusion of topological data in our structure-based protein LFM.
Furthermore, we will showcase how topological data analysis methods such as PCA, t-SNE and UMAP visualize the topology of the latent space in deep neural networks. These methods yield insights into network behavior and enable the discovery of relative biological information.