Machine Learning (ML) has been taken up as a method to support science across many disciplines at an unprecedented speed in the last years. At HZDR, we have multiple groups that employ ML in a variety of applications and at the same time observe groups with large resources of data aspiring to use ML.
This symposium is motivated to bring together practitioners, experts and novices in order to foster a community of ML users, developers/engineers and data scientists at the center. We hope to serve three goals with this workshop:
The workshop will be split into two kinds of sessions:
In order to accommodate any circumstances at their location of work for participants and speakers alike, all speakers in parallel sessions are asked to pre-record videos of their contributions. The link to the videos will be shared among all registered participants prior to the workshop.
Given the current COVID19 incidence levels, this workshop will be held completely online. We ask all interested parties to register for the event, so distribution of information is easier. Only registered participants will receive information on video conferencing details.
To register, please use the Registration form available in the event menu on the left. Registration is possible until Dec 3, 2021.
We will be introducing recent machine learning techniques researched and developed at the Institute of Radiation Physics for advanced comprehension of compact Laser-particle accelerators (electrons and ions). High-fidelity simulations of the involved physical phenomena are carried out by computationally-expensive particle-in-cell simulations which are used for planning of experiments as well as subsequent analysis. We will be discussing methods for surrogate modeling and reduced-order modeling for reducing the computational complexity and storage footprint of the simulations. From an experimental perspective, one important task relates to the recovery of the initial physics conditions using simulations that mimic the experiment. In addition, advanced spectral diagnostics provide promising novel insights into time-dependent processes, e.g. inside the plasma. Analysis of such data frequently touches inverse problems (e.g. phase retrieval) that occur in e.g. laser diagnostics, analysis of plasma expansion via pump-probe experiments (SAXS reconstruction), or novel experimental diagnostics such as coherent-transition-radiation. Modern data-driven methods promise fast solutions and quantify uncertainty even of ambiguous inverse problems, while the reliability of these methods on out-of-distribution data has to be considered.
Currently, patients that undergo radiotherapy receive a standardised treatment regime. Adapting treatment to each patient may improve tumour control and reduce side effects of radiotherapy. We use machine learning techniques to prognosticate tumour control and predict chronic side effects with the aim of translating such models to the clinic.
The successful diagnostics of phenomena in matter under extreme conditions relies on a strong interplay between experiment and simulation. Understanding these phenomena is key to advancing our fundamental knowledge of astrophysical objects and has the potential to unlock future energy technologies that have great societal impact.
A great challenge for an accurate numerical modeling is the persistence of electron correlation and has hitherto impeded our ability to model these phenomena across multiple length and time scales at sufficient accuracy.
In this talk, I will present a solution to this problem in terms of a data-driven modeling framework for matter under extreme conditions – the Materials Learning Algorithms (MALA) package. MALA generates surrogate models based on deep neural networks that reproduce the output of state-of-the-art electronic structure methods at a fraction of their computational cost. This opens up the path towards multiscale materials modeling for matter under ambient and extreme conditions at a computational scale and cost that is unattainable with current algorithms.
MALA is jointly developed by the Center for Advanced Systems Understanding (CASUS) at the Helmholtz-Zentrum Dresden-Rossendorf, Sandia National Laboratories (SNL), and Oak Ridge National Laboratory (ORNL).
Reference: https://doi.org/10.1103/PhysRevB.104.035120
The accurate modeling of materials is a fundamental task in material science. Advanced methods such as Density Functional Theory (DFT) provide quantum chemical accuracy through explicit calculation of the electronic structure of materials, but they come at high computational costs. These computational demands are especially prohibitive in the context of dynamic investigations. Increasingly efficient implementations of DFT can only alleviate this problem to a certain degree.
Here, we present a different approach to tackle this problem. Feed-forward neural networks are trained on electronic structure data in order to replace DFT calculations at a fraction of the computational cost. Such surrogate models can be used to model matter under extreme conditions as they occur in planetary interiors or fusion reactors across multiple length and time scales.
To facilitate the training, testing, and application of DFT surrogate models, the Center for Advanced Systems Understanding develops the Materials Learning Algorithm (MALA) package as an open-source software project in collaboration with the Sandia National Laboratories and Oak Ridge National Laboratory.
This talk gives a brief introduction to uncertainty quantification (UQ) for neural networks. We investigate these methods as part of a Helmholtz AI voucher in collaboration with the MALA [1,2] project, where we build surrogate models to speed up demanding density functional theory calculations. In this context, UQ methods can be used to asses the validity of model predictions and can also serve to detect out-of-distribution data.
[1] https://github.com/mala-project/mala
[2] J. A. Ellis et al., Phys. Rev. B 104, 035120, 2021
In recent years, Physics Informed Neural Networks (PINNs) gained big traction in the scientific computing community. PINNs provide a neural-net-surrogate model parametrizing the solution space of a certain Partial Differential Equation (PDE) derived as the solution of a variational problem.
Thereby, the variational problem is typically formulated in terms of $L^2-$norms that are approximated by measuring Mean Square Errors. We will present a generalization of the analytical $L^2-$setup with the notion of Sobolev spaces and their corresponding norms, allowing to consistently implement weak formulations of PDEs accordingly to the classic mathematical theory. Thereby the approximation of the Sobolev norms is realized by extending classic Gauss-Legendre cubature rules.
In this presentation, we will demonstrate that the derived Sobolev Cubatures enable to replace the automatic differentiation by polynomial differential operators and reach higher accuracy for classic PINN problems than prior used MSE or $L^2-$loss.
Numerical simulations of complex systems such as Laser-Plasma acceleration are computationally very expensive and have to be run on large-scale HPC systems. Offline analysis of experimental data is typically carried out by expensive grid scans or optimisation of particle-in-cell code like PIConGPU modelling the corresponding physical processes. Neural Network based surrogate models of this simulation drastically speeds up the analysis due to fast inference times promising in vivo analysis of experimental data. The quality of that surrogate model, in terms of generalisation, depends on the stiffness of the problem along with the amount and distribution of training data. Unfortunately, the generation of training data is very storage-intensive especially for high-fidelity simulations in the upcoming exascale era. We therefore need to rethink the training of surrogate models to tackle memory- and space limitations of current HPC systems. This is achieved by translating continuous learning from Computer Vision to surrogate modeling while additional regularization terms are introduced to foster the generalisation of the surrogate model. The training of the neural network is carried out simultaneously to a concurrent PIConGPU simulation without the need to write training data to disk. The IO system for moving data via streaming methods from the simulation to the concurrently running training task is built with the openPMD-api and ADIOS2. A proof-of-principle is demonstrated by training of a 3d convolutional autoencoder that learns a compressed representation laser wakefield acceleration performed by PIConGPU via streaming.
In multiphase fluid dynamics knowledge of the particle size distribution of the dispersed phase is one of the key points of interest. In chemical engineering bubble columns are used as mass transfer apparatuses, where a gas is dispersed in a liquid phase. The size distribution of the bubbles is the determining factor for the total surface contact area between the two different phases and with that a dominant factor on mass transfer.
In experimental investigations this necessitates an accurate and reliable segmentation of gas bubbles that overlap in recorded images to accurately determine their size. Since proper statistics are needed in many cases, an automated procedure is required to evaluate larger datasets. Machine learning applications offer a great opportunity to improve the segmentation and supersede existing techniques (Watershed, …).
These tasks can be solved by many different Convolutional Neural Networks and their variants (U-net, StarDist, etc.). We first demonstrate successful instance segmentation results by applying the Stardist algorithm and comparing it with a two-stage U-net based segmentation approach.
These types of neural networks are trained using supervised learning techniques and require a rather large set of manually annotated image data. Since manual creation of labels is tedious and throughput is limited, there is a lack in the amount of available training data as well. This work presents an approach using deep generative models to create artificial images that resemble experimental data, enabling us to enlarge the dataset for segmentation training. This approach is used to specifically train U-net and a variant of Stardist (MultiStar) to improve the segmentation of overlapping bubbles.
We further outline potential shortcomings of our experiments and discuss future research directions.
The emerging advances in imaging technologies pave the way for the availability of a multitude of complementary data (e.g., spectral, spatial, elevation) in Earth sciences. Recently, hyperspectral imaging techniques have arisen as the most important tool to remotely acquire fine-spectral information from different materials/organisms. Nonetheless, such datasets require dedicated processing for most applications due to the 1) high-dimensionality of an HSI and 2) highly-mixed nature of pixels within an HSI. In addition, fine-spectral information usually comes at the cost of coarse spatial resolution due to the trade-off between spectral and spatial resolutions in hyperspectral imaging systems. Therefore, several machine learning techniques (e.g., supervised learning and unsupervised learning) were proposed in the last decades to alleviate such challenges.
Unsupervised learning techniques have become popular among the proposed machine learning techniques since they do not rely on labeled samples for clustering. Data points in a high-dimensional dataset can be drawn from a union of lower-dimensional subspaces, thus subspace-based clustering approaches, specifically, sparse subspace clustering (SSC) concept has drawn special attention to cluster high-dimensional data into meaningful groups. SSC-based approaches benefit from the so-called "self-expressiveness" property, where each data point can be written as a linear combination of other data points from the same subspace. Such algorithms, hence are able to process and tackle high-dimensional and highly-mixed nature of HSIs, as is the case in real-world applications (e.g., urban-, land-cover-, and mineral-mapping). However, the superior performance of SSC is counterbalanced with demanding high computational power and being time-consuming compared to traditional clustering approaches. In addition, the number of clusters of interest needs to be predefined prior to the clustering procedure.
We proposed the following studies to mitigate the aforementioned challenges and develop automatic, robust, and fast clustering approaches to analyze remote sensing datasets.
We studied the performance of different sparse subspace-based clustering algorithms on drill-core hyperspectral domaining [1];
We developed a fast, robust, and automatic sparse subspace-based clustering algorithm, the "hierarchical sparse subspace clustering (HESSC)" to analyze HSIs " [2];
To incorporate spatial information in the clustering procedure, we proposed a hidden-Markov random subspace-based clustering algorithm for HSI analysis [3];
To improve the final clustering result and fully exploit spatial information in the clustering procedure, we proposed a multi-sensor hidden-Markov random subspace-based clustering and multi-sensor sparse-based clustering (Multi-SSC) algorithms, where the former utilizes a post-processing step to refine the generated in accordance with spatial information, while the latter uses the spatial and contextual information within the clustering structure schema. Worthy to indicate that the spatial and contextual information is derived from high spatial-resolution images, whereas the rich spectral information is extracted from an HSI [4], [5].
Prior to any analysis procedure, one needs to conduct preprocessing steps to decrease the effect of the noise (e.g., atmospheric effects, instrumental noises) contaminating the data. It is crucial to precisely carry on the preprocessing steps. We studied the impact of applying a denoising technique before and after atmospheric corrections. The observations challenge the current de facto paradigm of denoising in a processing chain of spaceborne and airborne remotely sensed images [6].
References
[1] Shahi, K. R., Khodadadzadeh, M., Tolosana-Delgado, R., Tusa, L., and Gloaguen, R. (2019, September). The Application Of Subspace Clustering Algorithms In Drill-Core Hyperspectral Domaining. In 2019 10\textit{th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)} (pp. 1-5). IEEE.
[2] Rafiezadeh Shahi, K., Khodadadzadeh, M., Tusa, L., Ghamisi, P., Tolosana-Delgado, R., and Gloaguen, R. (2020). Hierarchical Sparse Subspace Clustering (HESSC): An Automatic Approach for Hyperspectral Image Analysis. \textit{Remote Sensing}, 12(15), 2421.
[3] Rafiezadeh Shahi, K., Ghamisi, P., Jackisch, R., Khodadadzadeh, M., Lorenz, S., and Gloaguen, R. (2020). A New Spectral-Spatial Subspace Clustering Algorithm For Hyperspectral Image Analysis. \textit{ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences}. V-3-2020. 185-191. 10.5194/isprs-annals-V-3-2020-185-2020.
[4] Rafiezadeh Shahi, K., Ghamisi, P., Jackisch, R., Rasti, B., Scheunders, P., and Gloaguen, R. A multi-sensor subspace-based clustering algorithm using RGB and hyperspectral data. In 2021 11\textit{th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)}(pp. 1-5). IEEE.
[5] Rafiezadeh Shahi, K., Ghamisi, P., Rasti, B., Jackisch, R., Scheunders, P., and Gloaguen, R. (2020). Data Fusion Using a Multi-Sensor Sparse-Based Clustering Algorithm. Remote Sensing, 12(23), 4007.
[6] Rafiezadeh Shahi, K., Rasti, B., Ghamisi, P., Scheunders, P., and Gloaguen, R. When is the right time to apply denoising. In 2021 IEEE Geoscience and Remote Sensing Symposium, IEEE.
Deep learning methods have found profound success in recent years in solving complex tasks such as in the field of computer vision, speech recognition, and security applications. The robustness of these deep learning models has been found to be vulnerable to adversarial examples. These are perturbed samples, which are imperceptible to the human eye, that lead the model to erroneous output decisions. In this study, we adapt and introduce two geometric metrics, namely density, and coverage, and evaluate their use in detecting adversarial samples in batches. We empirically study the metrics using MNIST and real-world biomedical datasets, from MedMNIST, subjected to two different adversarial attacks. Our experiments show promising results.
Helmholtz AI has been established to support scientists across the Helmholtz Society and across domains. The consultant team at HZDR was put in place to focus their support on matter research. That includes accelerator physics. Recently, we cooperated with DESY Hamburg on a voucher for the European XFEL.
To run experiments using the European XFEL it is essential that various laser systems work correctly. Damage to such a laser renders the conducted experiments useless, but it is, however, not obvious when such damage occurs. Therefore, we developed a method to automatically detect a defective laser system using its sensor data in a machine learning pipeline. Using only data from the healthy system for training the used clustering algorithm can detect whether unseen data is from a damaged or healthy system.
Solving partial differential equations (PDE) is an indispensable part of many branches of natural science. The analysis of experimental data by numerical simulation typically requires costly optimisation or grid-scan which is very costly. Machine Learning based surrogate models denote promising ways for fast approximation of these simulations by learning complex mapping from parameters to solution. However, recent surrogate models require a considerable amount of training data derived from full simulation runs, introducing a high storage demand as well as computational complexity especially if high-dimensional high-fidelity simulations are considered. Physics-informed neural networks denote a mesh-free approach which rely only on data for initial/boundary conditions. However, the most recent implementations of PINNs are not suited for high-fidelity problems because there is no multi-gpu implementation available making them intractable for scientific applications at FWK. We, the Helmholtz AI YIG at FWK, developed a library called Neural Solvers enabling large-scale distributed training of Physics-informed Neural Networks. We primarily aim at solving forward and inverse problems in Laser- as well as Plasma Physics such as 3D laser propagation for advanced understanding of Laser-plasma accelerators with very low storage footprint while retaining physical correctness of the trained surrogate model. The framework is ready for cross-disciplinary applications such as atom physics for simulation and inversion of density functional theory, geophysical inversion of acoustic wave equation, as well as inversion of 2d heat equation based on experimental neuroimaging data. Our experiments have shown that Neural Solvers can reach the accuracy of recent numerical methods while introducing excellent speedup up to at least 180 NVIDIA V100 GPUs making them ready for upcoming exascale systems. In the future, we will be researching parameterized PINNs to learn the solution of a set of simulations concurrently while transfer learning will be used to transfer knowledge from data of lower fidelity to high-fidelity problems.
The understanding of laser-solid interactions is important to the development of future laser-driven particle and photon sources, e.g., for tumor therapy, astrophysics or fusion. Currently, these interactions can only be modeled by simulations which need verification in the real world. Consequently, in 2016, a pump-probe experiment was conducted by Thomas Kluge to examine the laser-plasma interaction that occurs when an ultrahigh-intensity laser hits a solid density target. To handle the nanometer spatial and femtosecond temporal resolution of the laser-plasma interactions, Small-Angle X-Ray Scattering (SAXS) was used as a diagnostic to reconstruct the laser-driven target. However, the reconstruction of the target from the SAXS diffraction pattern is an inverse problem which are often ambiguous, due to the phase problem, and has no closed-form solution. We aim to simplify the process of reconstructing the target from SAXS images by employing Neural Networks, due to their speed and generalization capabilities. To be more specific, we use a conditional Invertible Neural Network (cINN), a type of Normalizing Flows, to resolve the ambiguities of the target with a probability density distribution. The target in this case is modelled by a simple grating function with three parameters. We chose this analytically well-defined and relatively simple target as a trial run for Neural Networks in this field to pave the way for more sophisticated targets and methods. Unfortunately, we don’t have enough and reliable experimental data that could be used as training. So, in consequence, the network is trained only on simulated diffraction patterns and their respective ground truth parameters. The cINN is able to accurately reconstruct simulated- as well as preshot data. The performance on main-shot data remains unclear due to the fact that the simulation might not be able to explain the governing processes.
Radiation signatures emitted by Laser-plasma interactions are ubiquitous and are straightforward to experimentally acquire via imaging and spectroscopy. The data encodes phase-space dynamics on the smallest temporal and spatial scales. Yet such data is hard to interpret and thus is frequently discarded as being too complex. For theory and data analysis this raises several central questions: What are experimentally promising radiation signatures? What do they mean physically and are these robust and unambigous indicators?
Calculating classical radiation emitted by relativistic plasmas from all charged particles across the entire spectrum from the IR to the x-ray range and emitted into the full solid angle, while retaining coherence and polarization properties, is a prime HPC data challenge, currently requiring exascale compute capabilities. These calculations, are successfully perfomed in-situ by the particle-in-cell code PIConGPU at the cost of increasing computational requirements by several orders of magnitudes.
By exploiting machine learning techniques we aim for two goals: Speeding up calculations of these radiation signatures, as well as for improving knowledge extraction, i.e. connecting simulated and experimentally relevant radiation signatures, ideally unambigously, to the initial radiation sources and physics processes.
We introduce the data challenge and motivate how a large-scale distributed analysis of a huge set of unstructed point cloud data via an autoencoder approach, can be used to map a compressed representation to radiation diagnostics via invertible neural network. Initial results on a smaller scale of a specialized application have been encouraging: invertible neural networks based on variational autoencoders successfully have been trained on flashes of radiation in Laser-wakefield accelerators to identify and spatially localize the instances of electron injection.
We propose a deep neural network based surrogate model for a plasma shadowgraph - a technique for visualization of perturbations in a transparent medium. We are substituting the numerical code by a computationally cheaper projection based surrogate model that is able to approximate the electric fields at a given time without computing all preceding electric fields as required by numerical methods. This means that the projection based surrogate model allows to recover the solution of the governing 3D partial differential equation, 3D wave equation, at any point of a given compute domain and configuration without the need to run a full simulation. This model has shown a good quality of reconstruction in a problem of interpolation of data within a narrow range of simulation parameters and can be used for input data of large size.
Air quality regulations have reduced emissions of pollutants in the U.S., but many prognostic studies suggest that future air quality might be degraded by global climate change. The simulated climate by various climate models shows a large variation in the future decades, and it is important to account for such variations to study future air quality. A typical approach to study future air quality projections uses three-dimensional (3D) Eulerian models, but these models are computationally too expensive to perform an ensemble of long-term simulations for various climate projections. Therefore, we have developed a machine learning (ML) based air quality model to study, in an efficient way, how future air quality might be influenced by climate change. Our ML model uses two-phase random forest to predict the O3 and PM2.5 concentrations with training datasets of key meteorological information and air quality pollutant emissions. To evaluate the model performance, we used the input datasets for the U.S. Environmental Protection Agent (EPA) the Community Multiscale Air Quality Modeling System (CMAQ) simulations and compared our model predictions against the CMAQ output as a benchmark. The 1995 – 1997 data were used to train the ML model; 2025 – 2035 data were used to evaluate it. The ML model is well performed for hourly O3 predictions over the whole domain in four selected months (January, February, July, and August), and the R2 values are in 0.5 – 0.7, the normalized mean bias (NMB) values are within ±3%, the overall normalized mean error (NME) values are below 20%. Compared to CMAQ, our ML model tends to overpredict the O3 in the Southeast U.S and California, and underpredict in the Central U.S, and the NMB values computed for each grid are generally within ±10%. Predicting PM2.5 is more challenging than predicting O3, but our ML model performance is still acceptable. The overall R2 values of PM2.5 predictions are in 0.4 – 0.6, and the NMB values are within ±6%, but the NME can be up to 60%. The NMB in each grid is within ±30%. There is no clear trend for the regional variation of ML model performance for PM2.5. Our ML model performs better for summer PM2.5 (July and August) than winter (January and February): NME is 10% - 20% lower in summer. While the model performs better in winter than summer with about 10% lower NME for O3. Our ML model with GPU acceleration runs less than one hour using a single GPU processor to predict 11-year one-month (total 11 months) simulations. It uses significantly less computing resources compared to the 3D models, like CMAQ, while it results in comparable predictability to CMAQ. It shows that our ML model a reliable and efficient tool to assess the air quality under various climate change scenarios.
In this talk, I'd like to present modern machine learning tools for estimating the posterior of the inverse problem exposed in a beam control setting. That is, given an experimental beam profile, I'd like to demonstrate tools that help to estimate which simulation parameters might have been produced a similar beam profile with high likelihood.
The research of interactions between the pathogens and their hosts is key for understanding the biology of infection. Commencing on the level of individual molecules, these interactions define the behavior of infectious agents and the outcomes they elicit. Discovery of host–pathogen interactions (HPIs) conventionally involves a stepwise laborious research process. However novel computational approaches including machine learning and deep learning allow to significantly accelerate the discovery process. One example of such approaches includes an algorithm for detecting intracellular and extracellular poxvirus virions in a 3D superresolution micrographs without specific immunohistochemical labelling. This is made possible through deep learning model inference from seemingly irrelevant fluorescence channels. Another example using allows predicting infection outcomes in a population of cell employing time-lapse microscopy data.
While Machine Learning (ML) caused a boom in computational sciences and its broad field of applications, some if its weaknesses as its low accuracy, huge training data amount requirement and hard interpretability tighten its domain of applications in complex systems science that demand high scientific, perfomance quality.
Due to our recent findings, we improved classic multivariate polynomial interpolation schemes (MIP) whose strengths and weaknesses tend to be complementary to those of ML methods.
In this presentation, we will demonstrate how to extract the best of both worlds in order to provide models for sparse and scattered data merasurements , enable efficient post processing analysis of ML surrogate models, optimise ML hyperparameters and regularise ML autoencoders.
Especially, we are looking forward to introduce you to the open source minterpy python package, whose alpha-release includes the core implementations making our contribution accessible to the ML community.
With this talk, I'd like to present and review our activities as Helmholtz AI consultants at HZDR. Our team has been installed since 2019 and fully operational by now. This talk will both highlight some of our activities, experiences made and future plans. We hope that the HZDR community will profit from this and we establish more collaborations with it.