- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Your profile timezone:
The Research Data Management Container (RDMC) is a key element of the NFIDxCS architecture. To integrate diverse communities into the requirements engineering process, we conduct a series of HackaThons. This HackaThon is a part of the initiative, aiming to engage the community of researchers and research software engineers. Given the architecture's complexity, developing a prototype is essential to demonstrate the critical components and the overall NFIDxCS system for managing RDMCs. This practical, hands-on workshop will contribute to enhancing and expanding the conceptual version of the RDMC and its associated platform. In the workshop, we use a modularized approach to allow flexible engagement in the hacking over the course of the day and includes the following thematic areas:
The state of the work on the RDMC and platform can be characterized as a prototype based on already existing work done by the authors by the end of November 2023. There will be further steps in the development leading to a kind of bare bone version the RDMC and platform realizing basic functionality realizing the concepts from (1,2).
The preparation will include a set of questions aimed at extending the basic functionality. The following list serves as a starting point to guide the discussion on specific topics chosen for the HackaThon:
To capitalize on the results of the HackaThon, they will be integrated into the existing prototype in a git repository.
A brief report for the general interested public will be published on the NFDIxSC (https//www.nfdixcs.org) website as well.
(1) Goedicke,Michael; Lucke,Ulrike (2022): Research Data Management in Computer Science - NFDIxCS Approach. INFORMATIK 2022. DOI: 10.18420/inf2022_112.
(2) Firas Al Laban; Jan Bernoth; Michael Goedicke; Ulrike Lucke; Michael Striewe; Philipp Wieder; Ramin Yahyapour (2023): Establishing the Research Data Management Container in NFDIxCS. Vol. 1 (2023): 1st Conference on Research Data Infrastructure (CoRDI) DOI 10.52825/cordi.v1i.395
This workshop is organised by the German Reproducibility Network.
Let’s talk about good practices in research as well as research software engineering! We love the saying “Better Software, Better Research”, but what we actually need to do in practice? In this workshop we share best practices on how to make your work FAIR and reproducible and then also discuss what that means for your project. How can I set up a reproducible project? How should I license my software? How can I make my materials findable and reusable? These and more questions will (hopefully) be answered as part of this workshop.
Topic | Details | Speaker | Duration (min) | Start |
---|---|---|---|---|
Arrival + Welcome | Welcome (from GRN+HIFIS) | Tobias | 5 | 09:30 |
Introduction | The GRN and the idea of the workshop | Heidi | 10 | 09:35 |
Talk session | Short talks introducing the 4 topics (5min + questions) | Heidi, Michael, Tobias, Max | 40 | 09:45 |
Break | — | — | 20 | 10:25 |
Intro: World Cafe | How the world cafe works | Heidi | 5 | 10:45 |
World Cafe | Participants can go to tables with experts on the 4 topics (15 minutes on each table) | all | 65 | 10:50 |
Wrap-Up | Wrap-up (1 minute highlight per table) | Tobias (+ Heidi, Max, Michael) | 5 | 11:55 |
4 topics: 4 talks + 4 world cafe tables (15min per round / per table)
Expert: Heidi
Expert: Michael
Expert: Tobias
Expert: Max
In the Leibniz Association there are at least three institutes (e g., LEIZA, WIAS, PIK, ...) with active members in the RSE community - but we have the feeling that we might be more. For that reason we would like to come together in Würzburg to get to know each other, share and discuss ideas on research software engineering within the Leibniz family and build up a Leibniz-RSE network.
Options that we might want to discuss are the formation of a "deRSE Arbeitskreis RSE @ Leibniz" or some linking into the currently forming Working Group (AG) for Software in the Leibniz Association (in association with the WG Research Data).
How to participate and do networking as members of both Leibniz and the RSE community could also be a point of the meet-up discussions. Are there perhaps topics/fields where we should bring an RSE perspective into Leibniz or an Leibniz perspective into RSE? To support the discussions we will share both experience and appropriately branded cookies (even if we are not accociated with that specific company in Hannover).
RSEs within Max-Planck would like to meet and discuss internally the current state of affairs.
Most likely we will be less than 30 people. Maybe we will start at 9.30 and go until 12. A break in between.
The term entropy was originally introduced by the physicist Rudolf
Clausius as a quantity which describes the ability of a physical system
to change its state in a thermodynamic process. But at least since the
pioneering work of the mathematician Claude Shannon, entropy has also
become a central concept in information theory. How are these two
interpretations related? What exactly is entropy and how can we use it
to understand thermodynamic quantities such as temperature? This lecture
aims to provide an introductory overview of these questions in the area
between computer science and physics. One focus will be the numerical
estimation of entropy through sampling. As with any nonlinear estimation
measure, systematic errors occur when estimating entropy, but these
errors can be significantly reduced using suitable mathematical methods.
The lecture also deals with the question of how these correction
procedures can be implemented algorithmically.
Are you developing an Open Source research software? Is your software developed and maintained mainly by one group or organization? Is your software also of interest to third-party users, be it researchers, NGOs or other group of users? Do you have first third-party users of your software? Do you wonder how to build a community around your software beyond the original developers?
If these questions resonate with you, we invite you to join our upcoming meet-up. This event aims to connect research software developers planning to broaden their development team, those encountering their initial interactions with third-party users, and those who have successfully established vibrant communities around their research software. We are interested in your experiences, both positive as well as negative, in collaborating with software developers beyond your organization and want to learn about the challenges you may have faced.
During the meet-up we will split up into smaller groups and discuss and hopefully answer questions like:
• How to prepare research software for third-party users/developers?
• How to attract new third-party users/developers?
• How to get third-party users into collaborative development?
• How to balance growing demand for support with obligations for own projects?
• How to lower entry barriers?
Your insights will contribute to the creation of a final document summarizing all lessons learned and proposing effective solution strategies. This collective resource will serve as a valuable guide for research software developers navigating the transition from a software primarily used by a specific group to a research software developed and maintained by a broader and more diverse community.
ioProc is a scientific workflow manager designed for reproducible, maintainable, and transparent linear workflows which are metadata ready. In this tutorial we will together:
Finally we will close the tutorial with a quick guided tour through the source code repository of ioProc which is particularly interesting for potential contributers or for the ones who want to peek under the hood.
What do you need to bring with you?
NHR is an alliance of university high-performance computing centers funded by the federal and state governments. NHR offers not only computing resources, but also consulting and training for scientists at German universities.
In our presentation, we will give a brief overview of NHR and the services it offers to scientists.
In Earth System modelling (ESM), the high variety and complexity of ESM processes to be simulated as well as the specialisation of scientists led to the existence of a multitude of models, each aiming on the simulation of different aspects of the system. This scenario gave rise to a highly diverse ecosystem of ESM software, whose components are written in different languages, employ different HPC techniques and tools, and overlap or lack functionalities.
To use the national technical resources and the scientific expertise more efficiently, the natESM project aims to establish a coupled seamless ESM system by providing so-called technical-support sprints. A sprint consists of a goal-oriented package of work executed by a dedicated research software engineer on a selected ESM model during a defined amount of time.
The scientist’s sprint proposals undergo a technical evaluation, which is divided in two steps: The first is a "sprint check", consisting of a low-threshold method for the presentation of a sprint idea. After an open discussion about the technical feasibility and possible adjustments, if the work aligns in general with the natESM strategy, the proposals move to the second phase, with the creation of a full sprint application. This consists of a detailed document which contains the model characteristics and formalises the goals, timeline, and criteria for sprint fulfilment.
Once a scientific approval is also granted, the work on a sprint starts and lasts from a couple of weeks up to six months, depending on the stipulated timeline and objective. The period and timeline can also be adjusted depending on impediments found along the way. The type of work done is inline with natESM goals, and sprints are usually focused on -- but not limited to: architecture porting, model coupling and interfacing, modularization, as well as general software engineering improvements. The overarching concept is to efficiently enable the models to progress in the desired direction, and therefore entails identifying the minimum amount of work which will allow the model to take the largest strides towards the community goal. Upon finalisation of a sprint, a public documentation is published, potentially serving as help to other models on addressing similar problems.
The complete process is based on an open and interactive discussion between the model scientists and the research software engineers. It usually takes the form of short regular meetings and chat-based conversations. The underlying reasoning is to minimise the work a scientist has to do to communicate their problems and to receive support.
Based on the positive results observed and feedback from the community, the sprint concept is proving to be a highly effective approach to improving the technical resources of the scientific community. This tool, guided by a comprehensive strategy outlined by scientific requirements, will help the Earth system modelling future goals to be ever closer. Its applicability extends beyond the field of Earth system modelling and may also prove valuable to other research areas seeking to establish a research-software-engineering service and accelerate the results of partnerships between scientists and RSEs.
Artificial Intelligence (AI) transforms various industries with its potential for intelligent automation, leading to the creation of innovative products and services. This technological paradigm shift poses unique challenges and opportunities for research software engineers, pushing them to integrate AI into their software to maintain academic competitiveness.
In response to these evolving needs, the WestAI service center, funded by the BMBF, was established in Germany. WestAI represents a consortium of institutions, each recognized for their expertise in machine learning, artificial intelligence, and scalable computing. The center offers support to businesses and research groups aiming to integrate AI into their products and workflows. To achieve this, WestAI adopts three key strategies:
This presentation will offer a comprehensive overview of WestAI’s services and infrastructure, demonstrating how they can advance AI projects. For further information and to engage with our experts, please visit westai.de.
As a central institution of the University of Hamburg, the House of Computing and Data Science (HCDS) supports interdisciplinary research and application of innovative digital methods in close cooperation with its partners from science and research in the Hamburg metropolitan region.
It coordinates and supports the implementation of the digital strategy in research at the University of Hamburg. It fuels the easy adoption, usage, and research of digital methods in its Methodology Competence Center and offers various disciplines and projects a forum for the exchange of information and collaboration at the interface between methodological sciences and applied sciences in the Cross-Disciplinary Labs (CDLs).
In Cross-Disciplinary Labs, interdisciplinarity takes place at eye level: here, research questions are addressed that are of interest to both methodological science and its application in research in science and the humanities.
The presentation outlines the initial approach and the preliminary results of a comprehensive survey aimed at mapping the Research Software Engineer (RSE) landscape within the Helmholtz-Zentrum Dresden-Rossendorf (HZDR). The primary objective of this survey is to provide a comprehensive overview of the current status of RSE activities at HZDR, including the diverse range of RSE projects being undertaken. By examining the emerging trends and challenges within the institution, this research aims to equip decision-makers with essential insights to facilitate the future establishment and training of RSEs at HZDR.
Since the SSI surveys, we have known that research software is an essential part of the scientific work of many researchers. However, many researchers who develop software have not received specific training for that task, with the consequent impacts on the software quality, re-usability and sustainability. Central support units at the institutions where these scientists are employed can be important instruments to overcome this problem.
Such a group has existed at the Alfred Wegener Institute, Helmholtz Center for Polar and Marine Research for several years. It was originally initiated to integrate new numerical methods into various Earth system models. However, it quickly became clear that the need for support goes much further. In addition to specific tasks such as porting model code for new HPC platforms or new programming paradigms, it also includes more general tasks such as assistance in introducing better coding and developing practices and providing training.
Due to technical developments on the one hand and the strong staff fluctuations due to fixed-term contracts in the institute on the other, the range of services has to be constantly adjusted. Therefore, we conducted a survey at our institution to determine the status quo regarding the development of research software. It was found that scientific groups at AWI often invest enormous time and human resources in development and maintenance work. At the same time, the need for support services and consulting was articulated. In addition, our HPC and Data Processing support group organized interviews with specific user groups to better structure the support services of our group, using the small group's staff more efficiently. As a support unit for different working groups, it is important to find out early about planned new scientific projects that require support. At the same time, the interviews served to discover synergies where generalized solutions can support several scientific groups at the same time. Training and workshops for development best practices, code optimization, automation of tasks, etc are of general interest.
Our contribution will present the findings from the survey and interviews, and will introduce the measures taken by our group to help our users to develop code more sustainably.
The RSE Publication Monitor is an online tool to show how much and in which context scientists write about software. As the RSE Publication Monitor is developed at the Forschungszentrum Jülich, it currently uses a publication database of the Forschungszentrum as data source.
The monitor regularly queries a database to find publications that mention certain keywords related to software. Based on this information, it is possible to browse through correlations of keywords mentioned in these publications, and also to find out how often different keywords are used in publications from different institutes/research groups.
This tool can be easily extended to cover other keywords of interest. Furthermore, the code is open source and can be easily adapted to use other data sources and everyone is encouraged to design their own analysis. Furthermore, we are always open for collaborations and further ideas.
This talk will not only focus on the general approach and the technical structure of the publication monitor, but will also show some findings from using the monitor in order to highlight the benefits of setting up such a monitor.
Together with the University of Alabama, we are replicating "An empirical study of security culture in open source software communities" (https://doi.org/10.1145/3341161.3343520) with a focus on research software engineers in Germany and the US.
The original study collected quantitative evidence on how important different aspects of security culture is in the field of open source software development.
While we acknowledge that open source software also plays an important role throughout the research community, we are also aware that there are important differences.
E.g., while participating in an open source project is largely guided by volunteering and idealism, development of research software has more personal motivations.
Also, in research there is a different framework when it comes to (grant) policies and legal frameworks.
Hence, we thought it was time to re-do this well designed study with some own extensions in the scientific community.
The survey will be open until end of 2023 (and maybe a little longer). In this talk, we want to present some preliminary results and maybe some early findings especially with the focus on the German part of the study.
To enable and ensure the reproducibility and traceability of scientific claims, it is essential that scientific publications, the corresponding datasets, and the data analysies code are made publicly available.The adoption of the FAIR4RS principles and software engineering best practices could significantly enhance the success of delivering a codebase that produces consistent, reproducible results. Yet, not much is known about how practices like version control, continuous integration, scientific data management, automated testing and software documentation and citation are adopted within scientific community.
To gain a better understanding of the standards and practices followed by research software developers in Potsdam, we identified GitHub users who are PhD candidates, PostDocs, researchers or other academics affiliated with the University of Potsdam or one of the local research institutes. This focus is motivated by our goal of fostering collaboration and engagement with the research software development community in Potsdam. We then collected 13000+ open-source repositories of those users for thorough study to access coding practices and standards followed by them. The repositories collection has significant number of projects that are data analysis and web development with frequent mentions of keywords like data, web, machine learning along with use of programming languages like python and javascript. There is also a notable presence of non-technical projects, including educational repositories identified by terms like course, teaching, and workshop, along with other repositories with keywords survey, study, and assignments, thesis. To focus only on research repositories which are actually research software we classified them according to the DLRs application classes . This helped us to eliminate lot of repositories which do not contain relevant artefact and which are not meant for generating reproducible research results. We found 2100+ projects which falls under DLR application class 0 repositories which does not have relevant artefact that aligns with research software. In our study, which is presented as a short talk, we want continue to classify all open-source projects from the data collected into the DLR application classes with the goal of carefully assessing the use of git, git work flows, automated testing, and how they organize their code, test files, and documentation. Additionally, our goal is to assess the level of accessibility and quality of scientific documentation, code commenting, that ensure software reproducibility. With this short talk we hope to spark a discussion and further collaboration on this topic.
Research projects in the field of energy systems research are typically conducted in interdisciplinary teams - scientists from energy and electrical engineering, computer science, economics and other disciplines often work together on simulation-based analysis [Ferenz2022]. In many projects, joint simulation models are created and integrated into existing simulation frameworks. Analysis of data from simulation studies is also often undertaken as a joint activity. Thus, the need for joint software and model development in energy system research is obvious. Nevertheless, the knowledge of software engineering strongly varies between the different backgrounds. In such an interdisciplinary research project in the field of energy research, we tested whether a training series on research software engineering would improve the quality of the work, especially the software, and the mode of collaboration.
The training series served to harmonize the level of knowledge on research software engineering across different disciplines and to make the individual competencies and knowledge of the partners usable within the overall group. This is done with the clear goal of improving the quality of the work and making tools relevant to the work known to all.
As part of the training series, several topics have been discussed beforehand with research group leaders in the domain to identify relevant competencies in the given setting of energy system research software.
The topics of joint software development, version management, and (automated) testing of the developed software have been identified as crucial aspects from the DevOps area, and thus have been addressed in first trainings. This was undertaken to ensure that all partners have a consistent level of knowledge and can guarantee a consistent quality standard for the developed software. As a basic module, some aspects from requirements engineering have been been part of the curriculum.
In addition, a closer look at open source strategies and licensing models was included since these are increasingly important in the context of Open Science. In this way, it was ensured that project members can meet the requirements of this approach during their work.
In the talk, some lessons learned from conducting the course will be presented and discussed. We would like to get feedback on the general idea, the course concept, the curriculum and discuss experiences of other attendees on delivering this kind of knowledge to interdisciplinary research teams.
The talk should include 15 minutes + 15 minutes discussion with the audience, if possible.
[Ferenz2022] Ferenz, S.; Ofenloch, A.; Penaherrera Vaca, F.; Wagner, H.; Werth, O.; Breitner, M.H.; Engel, B.; Lehnhoff, S.; Nieße, A. An Open Digital Platform to Support Interdisciplinary Energy Research and Practice—Conceptualization. Energies 2022, 15, 6417. https://doi.org/10.3390/en15176417
At a community workshop at deRSE23 in Paderborn we talked about
teaching RSEs. Among the questions considered were, what are the basic competencies of RSEs?
Which institutions are required? Which changes are necessary in the current academic system?
After this workshop we started to gather the input from the community
and quickly noticed that the scope is bigger than a single short publication,
and therefore the TeachingRSE project was born.
In this talk I will present our first publication, on the foundational competencies of an RSE, arXiv:2311.11457 . In addition I will detail the direction of this project and point people to possibilities for contributing online, or directly at the accompanying workshop.
For the first time, Introduction to Research Software Engineering was offered as a class in the Computer Science faculty of TU Dresden during this winter semester (2023/24) (https://tu-dresden.de/ing/informatik/smt/cgv/studium/lehrveranstaltungen/ws2324/RSE/index). This talk will briefly cover the content, feedback from students and own observations as well as ideas how to continue and extend the class in the future.
What if your graduate programme actually prepared you with all the software engineering skills you need to participate in the research software community?
The M.Sc. Computational Science and Engineering (CSE) at the Technical University of Munich gathers STEM graduates from all over the world and teaches them elements of numerics, computer science, and applications. CSE graduates are the perfect candidates for developing the next PETSc, OpenFOAM, or Tensorflow. And yet, the programming-related part of the curriculum needed some aligning and dusting.
Over the past six years, we had the opportunity to look at the big picture, redesigning several courses that "tell a story" together. Nowadays, a CSE student can follow a coherent track that prepares them for working as software engineers in an RSE team developing simulation software. We start by preparing the ground with fundamental Linux, Git, Matlab/Octave, C++, and teamwork concepts in the 1-week onboarding course "CSE Primer". In the afternoons of that week, students also work in teams, developing small projects analyzing climate data.
For the rest of the semester, CSE students have to follow "Advanced Programming", a course with an ambitious name which we put a lot of effort in justifying. With the "advanced", we aim to raise the level of the inexperienced, while still offering enough opportunities for already experienced students to grow. The material covers a pragmatic mixture of modern C++ with just enough references of legacy features to be able to work with existing codebases. The slides include code snippets that the students can interact with using the Compiler Explorer. The tutorial exercises include common tools that support the development, including a debugger, sanitizer, formatter, build tools, testing frameworks, and more. An optional project lets students develop their own idea in pairs, or contribute to existing open-source projects, while participating in a code peer-review process. The lectures and tutorial are hybrid, and the in-person exam is supported by TUMExam, a system that offers digital correction and review features. The redesign of this course attracted several students from additional study programs, with the original audience of CSE students now representing less than 10% of the exponentially-growing total audience.
After the first semester, students follow a practical (lab) course. One highlight is the Computaional Fluid Dynamics Lab, in which students work in groups to implement worksheets and their own final project, in a bare-minimum C++ PDE framework, receiving code reviews on GitLab and maving their first steps towards parallel programming and performance optimization. Cross-references between Advanced Programming and CFD Lab make the two courses coherent, without discouraging external students to join. The (not offered anymore due to staff shortage) seminar Partitioned Fluid-Structure Interaction lets students expand their research skills specific to CFD, writing their own paper and participating in peer-reviews.
This talk will give an overview of these courses, discussing several didactical and technical elements applied in each, concluding with not-so-obvious good practices.
We will have a wine tasting in the famous wine cellar of the UNESCO world heritage site, the residence of Würzburg.
This is a tutorial aimed at absolute beginners that tries to get people started with their first CI pipeline. This is a skill-up session. We will use gitlab try to get a linter working on a python script. The workshop will utilize a carpentries style such that everyone leaves with a working CI example. This workshop is part of the "A day of CI" track.
Contributing to better health has been the motivation for many software developers to stay in academia despite excellent job perspectives in industry. However, as soon as digital health data is processed with your software, things get easily complicated today. You still can download open health data and play around, but as soon as it is getting close to patient care and you really want to have impact to health practice, you find yourself inbetween privacy risk and software quality assessments, software validation and liability risks. You need to be compliant to the medical device regulation, and likely the AI and the cyber security act – and of course the general data protection regulation with its various local dialects (aka Landesdatenschutzgesetze). This talk will give you an idea on the complexity of biomedical software engineering -- and why open source software should be an ethical mandate in this field.
A working group of de-RSE e.V. is currently writing a position paper on "Establishing RSE departments in German research institutions". According to the timeline of the working group, a preprint version of this position paper will be available for de-RSE24. The paper is intended for adoption as an official position of the de-RSE association. Following up on successful community interactions at de-RSE23 in Paderborn and the de-RSE Unconference 2023 in Jena, we intend to use the conference to collect feedback for the community approval process of the position paper. After the feedback from the conference meet-up and the open online review is incorporated, we aim for a swift adoption and publication procedure for the position paper.
In this presentation, we will dive into the topic of testing, with a specific focus on the development of unit tests. Fundamental approaches to writing effective tests and improving the quality of our software will be explained. We will go into why to do testing at all and how it helps us to detect bugs early and enhance the maintainability of our codebase.
In order to be able to determine how well our code is tested, the concept of code coverage is introduced. Furthermore, an outlook on additional possibilities and approaches for writing tests will be provided. This will include advanced testing concepts such as integration tests, property based testing, and mutation tests.
The objective of this presentation is to provide participants with a basic understanding of unit test development and equip them with practical tips and techniques for building high-quality software. We hope that after this talk, you will be able to write tests effectively and optimize your development processes.
Maintaining documentation up to date can be difficult.
Synchronization between the behaviour of the software and the documentation is, on the other hand, extremely important. Lack of it means that users might lose trust in the maintainers (this is even more important for the documentation of a HPC system), and its presence is crucial, for example, for Tutorial-Driven-Development, where the documentation acts as an integration test suite for the code.
In this talk I will present a view of the challenges associated with the problem, look at solutions that the community has adopted, and present the solution developed for our specific use case and discuss how CI can support this practice.
Supporting multiple operating systems, compilers, etc. can lead
to a combinatorial explosion with the need for a large test suite.
Similarly, a large library can have many features to test such that a full run
for one of those combinations can take a long time.
In this skill-up we will talk about a common approach to address this problem,
layering test suites and pipelines.
Then, the most important parameter combinations
are tested on every pull request while a fuller set is run regularly on a scheduled basis.
This also includes a discussion about combinatorial explosion and parameter
matrices to help decide which combinations are the most important.
A coffee break! utilise to connect with others, or have a look at the posters!
Deployment of scientific software across diverse platforms like MacOS, Windows, and Linux is a requirement for any software which is developed for the broader community. Such deployment presents multifaceted challenges, particularly when mixed-language programming (e.g., C/C++, Fortran and Python) with intricate dependencies (like Qt or Python) is a part of the build mechanism.
Addressing library dependencies adds another layer of complexity, particularly in the choice between static or dynamic linking and maintaining consistency across versions and platforms. Furthermore, bundling libraries with the installer/package requires strategic choices to make a balance between efficient use of system resources, portability and easiness of installation for the normal user (with Windows MSI installers, MacOS DMG files, and Linux packages like DEB and RPM).
The incorporation of a separate Python package ("wheel") provides an straightforward installation mechanism for the user as Python's cross-platform compatibility simplifies certain deployment aspects and usage, yet incorporating the underlying C/C++ components (libraries) necessitates a proper configuration to ensure a smooth integration within the user's system.
This talk explores the complexities involved in deploying such software, with attention to platform-specific nuances, intricacies in terms of library linking, compilation and packaging (installers and Python wheels), and provides some insights and solutions acquired in the long-term experience with developing BornAgain, an open-source software to simulate and fit neutron and x-ray scattering.
Creating an architecture for distributed system consisting of several machines operating different software with different programming languages can be a challenge. Do you find yourself designing new interfaces over and over again with only use case specific differences? We present a solution to that: libjapi.
libjapi is an abstract and reliable C library that can be integrated in existing software to provide a configurable API. Download it from https://github.com/Fraunhofer-IIS/libjapi and use it under MIT license.
libjapi receives newline-delimited JSON messages via TCP and calls registered C functions. A JSON response is returned for each request. The benefit to you: All string handling is done by the library. Multiple users can be connected at the same time. Furthermore, it is also possible to create push services, which asynchronously push JSON messages to the clients subscribed to them. The behaviour is highly customizable at the server side. The clients can be implemented in any other language or framework as long as they can receive and transmit JSON messages over TCP. Different clients can interact with the same server as in the following use case.
We use libjapi to control the settings of an SDR (software defined radio) modem. libjapi is used by the GnuRadio based system to provide an API, that can change settings like the center frequency of the received or transmitted signals. Also measured values like received channel power and an overview of the most important settings are provided as push service. Three different clients were implemented, one command line tool in Python and a Vue based interactive web frontend. Both can operate at the same time. Additionally another service in C listens to the push service and records all these values into a database.
The library is provided as free open source software to ease the life of other developers like you. Feel free to use it, provide feedback through GitHub and add your own contributions and improvements. Ideas for future improvements include authentication, the integration of a JSON schema validator and usability improvements.
Did you inherit a huge C or C++ research code?
Are you supposed to make it faster (say you're in HPC)?
Are you supposed to introduce new features (say you're in physics)?
Or perhaps you want to rejuvenate this code?
You estimate the code to be too large for that to be done properly or cleanly?
This talk introduces the ideas of the Coccinelle system for large-scale code analysis and restructuring.
You are invited to visit our 2h tutorial to get a working introduction to the tool usage.
See https://github.com/coccinelle/coccinelle for more about Coccinelle.
The Julia programming language aims to provide a modern approach to scientific high-performance computing by combining a high-level, accessible syntax with the runtime efficiency of traditional compiled languages. Due to its native ability to call C and Fortran functions, Julia often acts as a glue code in multi-language projects, enabling the reuse of existing libraries implemented in C/Fortran. With the software library libtrixi, we reverse this workflow: It allows one to control Trixi.jl, a complex Julia package for parallel, adaptive numerical simulations, from a main program written in C/C++/Fortran. In this talk, we will present the overall design of libtrixi, show some of the challenges we had to overcome, and discuss continuing limitations. Furthermore, we will provide some insights into the Julia C API and into the PackageCompiler.jl project for static compilation of Julia code. Besides the implications for our specific use case, these experiences can serve as a foundation for other projects that aim to integrate Julia-based libraries into existing code environments, opening up new avenues for sustainable software workflows.
The Medieval-Latin-Dictionary (MLW - https://mlw.badw.de) is a research project of the Bavarian Academy of Sciences and Humanities. It's purpose is to write comprehensive a dictionary of medieval latin taking into account scholarly (theological) as well a profane sources.
Writing Dictionaries of this kind and extend typically is an enterprise that takes many decades to complete. And the Medieval-Latin-Dictionary, too, started decades ago. Several years ago, its workflows have been put on a fully digital basis, which also includes entering the dictionary articles in a structured form.
For this purpose often XML is the technology of choice. However, XML is clumsy to read and write and, because of that, one can hardly avoid to use proprietory technolgie like the Oxygen-XML-Editor for authoring. The scientists had been given the choice to either use XML or try a domain specific notation which was going to be developed specifically for the MLW.
After prototypes for both alternatives had been presented, the MLW-team soon decided in favor of using a domain-specific-notation which is now in use for three years.
In my talk I describe the collaborative process in which the DSL was developed and launched, its key-design-principles and technological choices. However, I will also talk about the things that went wrong and the lessons we learned.
(The software as well as the documentation is Open Source and available on: https://gitlab.lrz.de/badw-it/mlw-dsl-oeffentlich)
Introduction
In 2018, the LEO study identified approximately 6.2 million individuals between the ages of 18 to 64 in Germany with very low literacy skills. These individuals are deemed to have a literacy level that restricts their participation in crucial aspects of society. [1] Various former and recent studies, including VERA 8, PISA, ULME I-III, and IQB, have highlighted a similar situation within the realm of apprenticeship and for juveniles in general. [5] A widely accepted public measure to enhance this situation involves implementing literacy courses at German community colleges, also known as Volkshochschule (VHS). Nevertheless, evaluating initial skills is a time-consuming process, and maintaining ongoing documentation of the learning requires continuous alignment with the comprehensive competency model created during the original "lea" (literacy education for adults; own translation) project. [2] Moreover, there is a need for anonymous self-learning based on domain-specific parameters for those affected.
lea.online
The "lea.online" project (2018-2022; BMBF FKZ W143600) had the objective of enhancing the material and competency model towards vocational fields and to give teachers exclusive function to monitor and evaluate diagnostic findings over time. Additionally, it should provide a non-intrusive anonymous interactive and gamified experience for learners. [3][4]
The project started with digitizing the competency model, which evolved into a comprehensive content management system known as the “lea.Backend”. It functions as the backbone for the user-facing software applications: “otu.lea,” a diagnostics app accessible through a browser, “lea.Dashboard,” a browser-based analytics app for teachers, and the “lea.App,” an anonymous self-learning app available for mobile devices. [5]
The lea.online project embraced a highly interdisciplinary approach to research software engineering. It comprises multiple target user groups: individuals with low literacy, teachers, and researchers, each segment defining their own requirements. The material for learning and assessment entails multiple occupational classifications spread across various subject dimensions, including reading, writing, language understanding, and mathematics. The team comprised of professionals hailing from diverse disciplines such as Educational Sciences, Mathematics, UX Research, Software Engineering, and Law with a shared goal of crafting software that is suitable for practical application in the field.
Key Insights and Takeaways
It is important to note that developing software for field use, rather than for exclusive expert use, introduces a wide range of additional factors and prerequisites typically associated with consumer-grade software designed for the mass market.
Therefore, the primary objective of this work is to present a comprehensive collection of key insights and takeaways acquired during the four-year developmental period, as well as in the current operational phase. It covers pitfalls, methods, and recommendations for future projects. It also displays examples and related code to explicitly demonstrate functionality. Due to its interdisciplinary nature, categorizing it by topics such as conceptual, UI/UX/accessibility, technical, legal, and ethical will support its overall structure.
At the end, a critical appraisal of the constraints of this project will be included.
Additionally, a roadmap for the current research and development will be presented, which offers prospects for joint efforts and progress for personal research and development.
The Academy of Sciences and Literature Mainz (AdW) produces a number of critical editions of cultural heritage, which become increasingly multimodal over long funding periods of 12 or more years. Historical dictionaries, for example, turn to include geodata and statistics, image archives gain contextualisations and 3D reconstructions, and letter collections require annotations as well as digitised archive records. In addition, the Linked Open Data paradigm defines common formats to make (and keep) the editions' content available in, and federated search providers specify their additional APIs to support. Last but not least, digital editions should be accessible by default rather than as an exception.
To address such challenges, the AdW has tried to produce a fixed set of edition-related extensions for the content management system TYPO3 for several years now – with limited success. They were spread across multiple versions of the platform, hard to combine, and difficult to maintain as they were designed for one project and then heavily adapted for another without further documentation. In this talk I will outline the process of redesigning and rewriting this software stack.
The Cultural Heritage Framework 2 (CHF) is a toolkit for web apps that enable users to edit and publish cultural heritage data. Instead of abstract user stories, which have proven useful for the development of individual editions, the CHF was rebuilt based on media ecology theory. In this process, the software is seen as an entity in relation with various other human and non-human entities which can be grouped into media ecosystems. The product needs to be designed in a way that allows it to ‘survive’ and be seen as a good-faith actor in various such ecosystems, including (but not limited to) academic editorial teams, frequently changing and often inexperienced maintainers, mobile web browsers, content aggregators, and users with accessibility needs.
Compared to previous attempts, this analysis led to a different feature set including an adaptable and atomic user interface, embedded JSON-LD metadata, semantic main classes, standardised search functionality, an import/export mechanism, integrated user documentation, and extensive developer documentation. Through interlocking and coherent components for specific data types, projects using the software may now add functionality as they grow or change focus over time. All components feature interfaces to edit the data by reusing features of the TYPO3 platform, but also allow importing data from other systems that a project may use or exporting the data to TEI XML or triple stores.
The talk focuses on the practical application of media ecology theory in the CHF, which is not yet common in Digital Humanities software. It specifically dives into the consequences of evaluating accessibility software, its users, and entities with more limited requirements as one ecosystem instead of a single user story: accessibility needs to be observed not just on the level of content consumption but also on the levels of data analysis and production.
Research Software plays an increasing role in the context of Humanities and, specifically, Archaeology to support the analysis of the vast and ever-growing data. As more and more disciplines come together and perform advanced analyses (e.g., with ancient DNA analysis), the demand for reproducible and testable results becomes more serious. So far, most tools have been created ad-hoc to test a hypothesis, but this does not comply with modern objective research practices. Instead, well-designed, and proven tools and methods are needed that allow reproducible and well-structured results. Tools must thereby be equally accessible and FAIR as the data itself, in compliance with the standard right of access to and participation in culture.
NFDI4Objects (N4O) is a broad community dealing with material remains of human history, the FAIR and CARE principles as well as FAIR4RS. The goal is to integrate the community into so-called Community Clusters to strengthen Software as Research Data, Publication and Citation of Research Software as well as the RSE profile.
This paper presents the community participation possibilities, N4O FAIRification Tools (e.g., Alligator, AMT, SPARQL Unicorn) and examples from Computational Archaeology (e.g., R and Python scripts, AI) in the context of CAA-DE.
During the last German chapter CAA conference in Würzburg (September 2023), multiple software tools were presented and discussed, such as for modelling stratigraphy (implemented in Python), cluster analysis for archaeological finds (implemented in R), using AI techniques for detection of archaeological sites on satellite images or classification of Celtic coins. All these approaches were designated using different tools, programming methods and methodologies. It could be noted that very few of them adhered to (Research) Software Engineering principles, making it difficult for any uptaker to understand or re-use the code. What is worse, few were published or made accessible, as the results (aka the data) were deemed more important than the means for generating them. This impacts reproducibility and therefore, reduces the value and credibility of the results. Most tools were developed for analysis and, therefore, have a notion of "quick hacks", which developed into more complete programs as the questions asked started to develop with the analysis. This also led to a lack of re-use of existing tools and methods.
These challenges are not new in the context of IT and are representative of any scientific code development, but awareness of their relevance for good scientific work is slowly rising. Areas new to IT are however more susceptible to these pitfalls. It is therefore even more relevant to identify these issues from the beginning and develop and teach good RSE principles with these communities; to demonstrate the relevance of said principles and to not see them as a burden but as a potential for continuation and improvement of research.
In this paper we want to highlight the challenges and approaches to engage the archaeological community in Research Software Engineering.
Often, software developed within a research group to solve specific problems
evolves into an integral tool, sometimes not only for the original group, but
also for other groups and institutions where it may be adapted. A well-known
problem, however, is to guarantee long-term stability and maintenance, as well
as the further development of the software. A lack of (permanent) funding as
well as high personnel fluctuation, among others, pose challenges
In this talk, we will present how maintenance and development of such a research
software, namely our open-source data-management toolkit LinkAhead (formerly
known as CaosDB), could be solved by founding IndiScale, a company that provides
services around this software. We will explain which considerations lead to the
decision to found and why LinkAhead was well-suited for founding a supporting
Start-up company.
We will also give an overview the problems to be faced, as well as the different
requirements to project management and day-to-day work compared to the life as
researchers and research software engineers in academia.
Certain research questions necessitate highly specialized software solutions tailored to the unique intricacies of the problem at hand. These questions often arise from the complex and nuanced nature of the research domain, demanding precise methodologies and algorithms that cater to specific requirements. In such cases, attempting to implement a generic solution might prove counterproductive and time-consuming. Nevertheless, while certain research questions demand highly specific software solutions, it is equally important to recognize instances where commonalities and synergies exist across diverse projects. In scenarios where the fundamental requirements overlap, the development of generic solutions becomes not only feasible but also advantageous. By leveraging shared expertise and identifying these common threads, research software engineers can create versatile research software products that cater to multiple research inquiries whereas reinventing the wheel leads to inefficient allocation of resources, as valuable time and effort are expended on solving problems that may have already been addressed by existing solutions. Striking the right balance between tailored and generic solutions not only optimizes resource utilization but also establishes a foundation for standardized practices, promoting collaboration and interoperability. This approach allows the research community to benefit from both the precision of specialized tools and the efficiency gained through the development of reusable, standardized software components. This nuanced approach acknowledges the diversity of research questions while harnessing the potential for synergies and collaborative advancements in research software engineering.
Creating research software products requires a combination of technical expertise, domain knowledge, effective collaboration with researchers, and a commitment to best practices. It is also imperative to incorporate the identification of overlapping requirements across diverse projects into the work culture of the RSE team.
The SUB software and service development team has been developing solutions for technical and methodological issues in project teams using collaborative and agile methods such as Scrum for over 10 years. In this session we will describe how we try to recognize synergies at an early stage, which methods and strategies we use to develop generic solutions in project work, what challenges we encounter, and how we manage them. Reinventing the Wheel
Software engineers communicate with hardware through a language known as an instruction set architecture (ISA). However, the conclusion of Dennard scaling and Moore's Law implies that this traditional ISA is being replaced by complex heterogeneous ISAs. In light of this, the expertise required to develop new compilers and create domain-specific languages will be crucial for future software engineers. Therefore, this talk will focus on addressing the challenges posed by emerging computing architectures and research related to abstracting the complex ISA. As a use-case, the talk will present an open-source tool that recently received the Best Paper Award at one of the leading computer architecture conferences.
Eine für die numerische Simulation von Strömungen sehr beliebte und erfolgreiche Software ist die quelloffene Software der OpenFOAM Foundation, welche sowohl in der Industrie als auch im akademischen Umfeld Anwendung findet. Forschungsgruppen, die keine reine Inhouse-Entwicklung leisten können oder anstreben, gewährt sie eine optimale Basis um eigene Ideen und Konzepte in einer transparenten Umgebung effizient testen zu können. Obwohl der Wartungsaufwand im Vergleich zu einer Eigenentwicklung insgesamt erheblich geringer ist, müssen Erweiterungen dennoch gepflegt werden um diese mit dem jeweils aktuellen Stand des Hauptrelease kompatibel zu halten. Die damit verbundene Arbeit erlangt umso größere Bedeutung, wenn Erweiterungen im Sinne der FAIR-Prinzipien zusammen mit wissenschaftlichen Publikationen bereitgestellt werden. Als agil entwickelte und intensiv gewartete Software stellt die Software der OpenFOAM Foundation in dieser Hinsicht besondere Anforderungen an die nachgelagerten Entwickler.
Das Helmholtz-Zentrum Dresden – Rossendorf e.V. (HZDR) verfolgt hierbei einen möglichst nachhaltigen Ansatz. Abgeschlossene und zitierfähige Entwicklungen werden entweder in einer eigenen Softwarepublikation veröffentlicht, oder, in enger Abstimmung mit den Kernentwicklern der OpenFOAM Foundation, in das Hauptrelease integriert. Die für die Arbeit an der Erweiterung (Multiphase Code Repository by HZDR for OpenFOAM Foundation Software) geschaffene IT-Infrastruktur zeichnet sich durch einen hohen Automatisierungsgrad aus und bietet Anwendern innerhalb und außerhalb des HZDR eine nützliche Plattform für die Erforschung von numerischen Methoden und Modellen.
Rückgrat der Arbeiten ist die über die Helmholtz Cloud bereitgestellte GitLab-Instanz (Helmholtz Codebase). Darin werden zwei Repositorien gepflegt: Eines für die Codeerweiterung und eines für Setups zur Simulation konkreter Anwendungen (Multiphase Cases Repository by HZDR for OpenFOAM Foundation Software). Zur Sicherung der Qualität und Funktionalität wird die Arbeit in der GitLab-Umgebung von Continuous-Integration-Pipelines (CI) begleitet, in deren Rahmen unter anderem statische Code-Checks, Build-Tests und Testläufe automatisiert vorgenommen werden. Für die Verwendung in CI-Pipelines sowie die lokale Entwicklung der Erweiterung wird die Installation als Container (Docker) bereitgestellt. Reine Anwender können auf die Installation per Debian-Paket zurückgreifen. Die zitierfähige Veröffentlichung des Quellcodes erfolgt mit jeder wissenschaftlichen Publikation im Rossendorf Data Repository (RODARE). Die Verwendung des Workflowmanagementsystems Snakemake ermöglicht skalierbare Validierungsläufe. Um die Portierbarkeit der Entwicklungen zu verbessern konzentrieren sich jüngere Arbeiten auf die Bereitstellung der Software als HPC-Container (Apptainer) für die Anwendung auf Hochleistungsrechnern. Dieser Beitrag gibt einen Überblick über die genannten Elemente der Umgebung und deren Zusammenspiel.
Manipulating and processing massive data sets is challenging. For the vast majority of research communities, the standard approach involves setting up Python pipelines to break up and analyze data in smaller chunks, an inefficient and prone-to-errors process. The problem is exacerbated on GPUs, because of the smaller available memory.
Popular solutions to distribute NumPy/SciPy computations are based on task parallelism, introducing significant runtime overhead, complicating implementation, and often limiting GPU support to specific vendors.
In this tutorial, we will show you an alternative based on data parallelism. The open-source library Heat [1] builds on PyTorch and mpi4py to simplify porting of NumPy/SciPy-based code to GPU (CUDA, ROCm, including multi-GPU, multi-node clusters). Under the hood, Heat distributes massive memory-intensive operations and algorithms via MPI communication, achieving significant speed-ups compared to task-distributed frameworks. On the surface however, Heat implements a NumPy-like API, is largely interoperable with the Python array ecosystem, and can be employed seamlessly as a backend to accelerate existing single-CPU pipelines, as well as develop new HPC applications from scratch.
You will get an overview of:
- Heat's basics: getting started with distributed I/O, data decomposition scheme, array operations
- Existing functionalities: multi-node linear algebra, statistics, signal processing, machine learning...
- DIY how-to: using existing Heat infrastructure to build your own multi-node, multi-GPU research software.
We'll also touch upon Heat's implementation roadmap, and possible paths to collaboration.
[1] M. Götz et al., "HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics," 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 2020, pp. 276-287, doi: 10.1109/BigData50022.2020.9378050.
The versioning of code is important to keep track of how code changed over time. Git is the code versioning that is mainly used, and the two most popular platforms for git are Github and Gitlab.
This talk aims to show ways to combine the best of these two platforms:
The community and visibility of Github with the option for self hosting and additional Continuous Integration features offered by Gitlab.
HPC systems are commonly and CI-systems sometimes reachable only from within the network on site. In these cases the code must reside locally (e.g. on self hosted Gitlab instances) if testing shall be done on the systems. Without a synchronization the code owners need to decide on whether to do HPC-backed CI (via Gitlab) or to include their peers (via Github). The synchronization enables to do both. It therefore leads to a higher interaction with the peer group as well as being able to test the software on the systems/machines where it shall be deployed to.
This is an advanced topic for the CI track.
waLBerla, an HPC software framework for multi-physics simulations based on the Lattice Boltzmann method, has consistently demonstrated exceptional parallel performance across various supercomputing platforms.
Maintaining a resilient codebase developed over a decade by multiple generations of developers is a paramount goal for waLBerla.
To achieve this, the framework has employed a continuous integration pipeline for a long time.
This pipeline ensures functional correctness and compatibility with a diverse array of compilers through systematic and automated testing.
Recognizing the importance of detecting performance regressions introduced by code changes, waLBerla has extended its continuous integration setup to include a continuous benchmarking pipeline.
Performance benchmarks run automatically on a range of CPU and GPU architectures, providing developers with swift feedback on how new commits impact the framework's performance.
In adapting to the dynamic landscape of HPC, waLBerla emphasizes versatility by testing on diverse hardware.
This proactive approach not only ensures good performance but also provides developers with quick insights into the effects of their contributions.
Adhering to FAIR principles (Findable, Accessible, Interoperable, and Reusable), waLBerla stores profiling and timing data.
Developers benefit from an interactive visualization of this data, enabling them to discern performance trends over time.
This transparent approach empowers developers to make informed decisions within a performance-driven development process.
In summary, this presentation offers an in-depth exploration of waLBerla's comprehensive development infrastructure.
The delicate balance between functional correctness and performance optimization is achieved through a meticulous Continuous Integration Pipeline, a versatile Continuous Benchmarking Setup, and transparent, FAIR-compliant profiling and timing data visualization.
The audience will gain insights into how these elements collectively cultivate a development environment where waLBerla excels across diverse HPC landscapes.
Continuous Integration (CI) is an indispensable part of modern software development. All major providers of software development platforms now offer an integrated range of resources for Continuous Integration, some of which are free to use. This offer is not sufficient for the requirements and use cases of some scientific software projects. There are scientific software projects which would like to take advantage of a large software development platform such as GitHub for the development of an open source software project and at the same time have access to the CI resources provided locally at an institution. This use case was implemented as part of HIFIS at HZDR and combines the GitHub platform with the locally available GitLab CI resources.
The presentation shows how this integration was implemented, which hurdles could be overcome, and at the same time addresses which difficulties exist. To this end, the talk goes into a practical use case, the alpaka C++ library and their specific CI implementation. The alpaka library (https://github.com/alpaka-group/alpaka) is a C++ abstraction library for accelerator development. It allows to write code once and run it on different accelerators/processor types like CPUs, GPUs and FPGAs. Therefore, it supports a wide range of processor manufacturers, through various compilers and SDKs. This results in a large set of supported software combinations that cannot be executed on the publicly available CI resources. With smart measures, the team makes use of the CI resources that have been made available and implements measures to handle the complexity and available resources.
This presentation focuses on the continuous integration (CI) infrastructure used by two simulation software frameworks with a strong emphasis on performance and scalability: HyTeG, a C++ framework for large-scale high-performance finite element simulations based on geometric multigrid, and waLBerla, a massively parallel framework for multiphysics applications.
During this talk, we will detail our approach to managing Docker Containers within our CI environment. As we develop software for research purposes, we often face the challenge of testing various combinations of software packages, especially different compilers. To tackle this, we employ Python scripting and the Spack software manager to create Docker Containers. These containers are then utilized within the CI for our research software. The container creation process is automated within the CI itself, enhancing consistency and manageability.
Furthermore, we will explain how Python scripting generates the CI specifications for our GitLab CI. With approximately 150 different jobs in our Pipeline, a generated CI significantly reduces the chances of errors compared to manual creation and maintenance. Moreover, we will shine a light on some of the less conventional uses of CI, such as analyzing build times or conducting code style checks.
Our goal for this talk is to share best practices and spread ideas about more efficient CI utilization.
The US Research Software Engineer Association (US-RSE) was formed in 2018 as a grassroots effort by a handful of motivated volunteers who focused on community building. Since its inception, US-RSE has hosted dozens of community calls, virtual and in person workshops, and other events that bring together RSEs. With over 2000 members today, it has been successful in much of its mission: 1) community, 2) advocacy, 3) resources, and 4) diversity, equity, and inclusion. Many things have changed over this time, such as having three versions of the website, two financial sponsors, and now professional staff and a conference. These changes, and the five years of existence and activities, however, lead to challenges as well, such as successfully mixing paid and volunteer work, avoiding or dealing with volunteer burnout and the concomitant need for new volunteers, moving from building community to providing services to that community, and moving from describing problems to collectively solving them. This talk will discuss both the successes and challenges, and is intended to lead to comparative discussion with members and leaders of other RSE organizations, where we hopefully can learn from each other.
We are data scientists and software engineers from ETH Zurich's Scientific IT Services, an embedded RSE and HPC group.
After visiting RSE con UK for the second time, we started to build up an RSE community at ETH Zurich and to connect interested people within Switzerland. The response so far has been overwhelming and we want to give a short overview of what we have done up to now and what we are planning for the future. We also hope to connect with RSE communities outside of Switzerland for future collaborations.
When we get into science, we do so because we want to have an impact on the world, explore the boundaries of knowledge, or just answer questions. Yet, we see so much bad quality science, results that are not reproducible, and building on the shoulders of giants seems to be so far away.
One building block of research quality is bringing the right knowledge to the right people. Open Science is hard. Data analyses are hard. Research Software Engineering is hard. So how do we get these skills across? Of course, through training.
In this talk I want to present a new trainer network: the Digital Research Academy. Since research software engineering is one of our core topics, I want to share my thoughts on educating the next generation of researchers and RSEs, the way the Digital Research Academy can contribute to it, as well as encourage current RSEs to become ambassadors for digital literacy.
Software development and research are naturally driven by different incentives. While the main aim of software development should be the generation of a robust and easily maintainable product, the daily life of a researcher is dominated by the quest for insightful analysis results and publication deadlines.
We present an approach to solving those conflicting incentives through collaborative development that we have used to develop several research software products. In this approach, two different entities work together closely: The bioinformatics support unit of the Robert Koch Institute (RKI) and the applied computer science bachelor’s and master’s study programs of the HTW Berlin University of Applied Sciences.
The bioinformatics support unit provides consulting and analyses to other research units within the RKI. The bioinformaticians working there are fully qualified and experienced computer scientists with the knowledge and skills necessary for high-quality software development. Their performance however is primarily measured by the quality and speed of the analyses they perform for the internal customers. Accordingly, prior experience with in-house software development has shown that clean software development practices are hard to sustainably implement in this environment.
On the other hand, the HTW has access to a large number of computer science students with a high interest in software development in and of itself. Those students in their diverse projects and final theses are fully shielded from pressures resulting from the requirements of biological research projects at the RKI. As such, they are a perfect resource for developing research software that is solely focused on robustness and maintainability. However, there are two challenges to overcome when working with computer science students who want to develop research software: Firstly, access to realistic questions and data are hard to come by, and developing software that solves a self-posed problem using simulated data is not very satisfying. And secondly, students are only available during the short time-frames defined by the duration of their projects.
Given that those challenges and opportunities show a high potential for synergy, RKI and HTW have established a cooperation for research software engineering. In short, the daily challenges and requirements concerning analysis software are discussed in semi-regular joint meetings. Work packages and research questions (such as the comparison of different algorithms) are jointly defined and prioritized. Then, those are given out to students at the HTW in the form of projects or final theses. The results are continuously combined and evaluated at the RKI, and flow back into the work package definitions. In order to create a sustainable and continuous environment in which those work packages can be integrated, a single position is co-financed by the RKI and the HTW to supervise and coordinate the entire process.
In this way, a new quality control pipeline was developed that is currently productively used at the RKI, a legacy analysis pipeline was refactored and containerized, and a novel analysis pipeline is being developed. We hope that our experiences – both positive and negative – will encourage others to enter and profit from similar synergetic relationships.
Ensuring the reproducibility of scientific software is crucial for the advancement of research and the validation of scientific findings.
However, achieving reproducibility in software-intensive scientific projects is often challenging due to dependencies, system configurations and software environments.
In this paper, we present a possible solution for these challenges by utilizing Nix and NixOS.
Nix is a package manager and functional language that allows to mitigate these problems by guaranteeing that a package and all its dependencies can be built reproducibly as long as there is a build plan at the desired time.
NixOS is a purely functional Linux distribution, built on top of Nix that enables the build of reproducible systems including configuration files, packages and their dependencies.
We present a case study on improving the reproducibility of preCICE, an open-source coupling library, and some of its main adapters using Nix and NixOS.
Using this approach, we demonstrate how to create a reproducible and self-contained environment for preCICE and highlight the benefits of using Nix and NixOS for managing software and system configurations, resulting in improved reproducibility.
In addition, we compare the usability and reproducibility provided by Nix, in the context of preCICE, with two already established high-performance computing (HPC) solutions, Spack and EasyBuild.
This evaluation enables us to assess the advantages and disadvantages of employing Nix to improve reproducibility in scientific software development within an HPC context.
The European Environment for Scientific Software Installations (EESSI, pronounced as "easy") is a collaboration between different European HPC sites and industry partners, with the common goal to set up a shared stack of optimized scientific software installations.
It can be used on a variety of systems, regardless of the operating system, processor architecture, or GPU of the client system, or whether it's a full-size HPC cluster, an HTC system, a cloud environment, or a personal workstation, while providing improved reproducibility in scientific computing.
Leveraging the tools and expertise inherent in the HPC community, EESSI integrates solutions such as EasyBuild, Lmod, CernVM-FS and Gentoo Prefix to ensure the creation of a scalable and rigorously tested software stack that is easy to deploy and maintain.
EESSI also enables users to access the same software stack on their own desktop or laptop, allowing them to run small tests and scale up to larger systems as needed, with a consistent user experience.
This presentation provides an exploration of the motivation, design, implementation, and benefits of EESSI.
Docs: http://www.eessi.io/docs
Intro: https://www.youtube.com/watch?v=Fzv4ieiI1jo
By default languages like R and Python load packages/modules from a system-wide package environment. Thus all projects use the same package versions which makes it hard to track which package versions were used for a specific project, and using different package versions for separate projects becomes impossible. Package updates which are required for one project may break other projects. These issues are especially problematic on large multi-user systems like an HPC (high performance computer cluster).
To solve this virtual package environments can be used. These isolate projects from each other and from the system-wide package environment, making it possible to use different package versions in separate projects. Also, this makes it clear which package versions were used for a project.
In this talk we present our package environment solution for developing scientific models on a large multi-user HPC. We describe how and why our approach is different from typical package environment setups, what the advantages and downsides are, and what lessons we have learned. In summary, at the cost of some additional complexity for users, our approach greatly improves reproducibility, robustness, and control over which package versions are used for each model run. Using a shared package cache we need 8.3 GB disk space for 3400 different package versions, and restoring entire package environments with over 200 packages takes a couple of seconds.
Despite the various challenges in automatic text recognition for printed (OCR) and handwritten (HTR) material, great progress has been made during the last decade. Several milestones have been reached regarding the actual text recognition step but also in layout analysis and pre- and postprocessing. Additionally, free open-source implementations of related tools and algorithms are released constantly. While these allow tackling highly heterogeneous use-cases ranging from mass full-text digitisation in libraries to the processing of individual documents, including specific production of training data (Ground Truth, GT) and subsequent training of deep learning OCR/HTR models, they rarely offer standardized interfaces and a low barrier of entry for non-technical users.
To solve the issue of easy-to-use, flexible, connectable, and sustainable combination and application of current and future individual technical OCR/HTR solutions we introduce the open-source tool OCR4all, which in turn leverages the open-source OCR/HTR framework OCR-D. In the following we discuss how users can benefit from the powerful combination of the two.
Whereas OCR-D concerns on the standardized implementation of single-step processors, OCR4all aims at enabling any user – even those without technical background – to perform OCR/HTR completely on their own and in great quality, while also offering tools to manually generate training data in order to train more performant work-specific models and consequently improve the output of the fully automated OCR/HTR processors.
To combine both approaches we engineered interfaces between OCR-D tools and OCR4all based on the OCR-D specifications. For a very flexible integration of different OCR and especially OCR-D processors, OCR4all relies on the Java Service Provider Interface. OCR-D processors, which are written in Python, are plugged into OCR4all as containerized service providers that implement the required interfaces, either through a simple manual configuration or fully automatically. The latter is achieved by leveraging the OCR-D-Tool-JSON which contains all necessary information about input/output relationships and available parameters and is mandatory for all OCR-D processors.
Due to the above described adaptations and the thereby extended flexibility, OCR4all can now be applied to an even more heterogeneous selection of materials and use cases. Its main focus still lies on the interactive high quality processing of challenging early printings and manuscripts by non-technical users, including correction and GT production and consequently material-specific training. However the applicability of OCR4all for mass digitization, e.g. in libraries and archives, has vastly improved.
Science communication is a crucial topic for both researchers and public discussion. On the one side, researchers need to develop communication skills to make their research accessible to an audience outside their field. On the other side, scientific opinion often consolidates arguments and helps audiences understand the current state of knowledge on specific discussions. Successful communication about research is a combination of ' appropriate skills, media, activities, and dialog' [1], which researchers need to strengthen. As a part of a dashboard supporting third mission activities [2], the aspect of choosing the right medium is supported by WiKoDa (a german acronym for WissenschaftsKommunikationsDashboard, translating to 'Science Communication Dashboard').
WiKoDa is designed to simplify the complexity of the media landscape and to display key information for researchers, tailored to their interests. In the frontend [3], various data visualization panels summarize the media interest in a particular research topic over time, show which medium engages in research areas, and indicate whether the topic is more prevalent in regional or national media. The backend system [4] is separated from the frontend by design, allowing data processing while the frontend manages user interactions. Backend tasks include extracting data from articles, integrating data into a database, and providing an API with aggregated data. Currently, the project is in a proof-of-concept phase at the University of Potsdam. WiKoDa utilizes articles from a university-tailored press review. Future expansions could extend its utility to other institutions.
The evaluation required a simulation environment because the limited data integration may lead to certain scientific fields having no data for evaluation. This could introduce a negative bias, as users might not be able to assess functionality without data. To address this, an article creation algorithm was developed, simulating various atmospheres (rising, falling, or consistent media interest) for each research topic by generated articles and placing these generated articles in a medium. This Wizard-of-Oz simulation allows users to believe they are experiencing a real scenario, enabling them to evaluate the dashboard based on its visualizations and functionality. The evaluation indicated that participants find the dashboard useful, but some aspects of its functionality and transparency were criticized.
While research data management systems (RDMSs) provide many benefits for scientists, data integration is still one of the major bottlenecks for the adoption of an RDMS. Especially the omnipresent dependency on file-based digital workflows and the strong heterogeneity of file and data layouts pose important challenges. We have developed a crawler-based concept [1] that allows us to combine file-based digital workflows with RMDS-software in a way that they can be used simultaneously. Furthermore, the concept includes a flexible configuration of data integration procedures in a YAML-based format that facilitates its application to different use cases. We demonstrate how to apply these concepts practically using the LinkAhead-crawler framework (CaosDB was recently renamed to LinkAhead). The software is published as Open Source software under AGPLv3 and can be accessed online (https://gitlab.com/linkahead/linkahead-crawler).
[1] Tom Wörden, H.; Spreckelsen, F.; Luther, S.; Parlitz, U.; Schlemmer, A. Mapping hierarchical file structures to semantic data models for efficient data integration into research data management systems. Preprints 2023, 2023081170. https://doi.org/10.20944/preprints202308.1170.v1
Interoperability is key. Libraries, archives, and other repositories providing data to the public utilize a variety of metadata standards and interface specifications. TextAPI is a newly developed API specification (so far mainly used for digital scholarly editions) that provides metadata about textual resources while TIDO (Text vIewer for Digital Objects) is an application to display them. Our contribution will give a short introduction about this couple.
Providing (meta)data for digital editions via API makes applications using this data agnostic to the back end technology. This can significantly contribute to the sustainability of the project and the resulting architecture. TextAPI does not require any database at all, since a static copy of the resources provided by a webserver is sufficient for most features.
By referencing Web Annotations TextAPI resources can be extended to any scale. Currently we use Web Annotations to reference entities, highlight editorial desicions and other annotations.
The API is organized in a modular and extensible way to extend the core specifications with additional data for specific use cases.
The specification conforms to JSON-LD format to become part of the semantic web.
TIDO is a web application based on VueJS operating in common with the TextAPI. It provides views to the data organized in panels and tabs. These views contains image, text, metadata, collection objects and annotations. It is designed to be a general-purpose front-end application to display TextAPI resources. Project implementators can rely on highly dynamic configuration options in order to assembly the desired presentation of their data. It is possible to define a strict panel setup (order and amount of panels and their content) or rely on TIDO's default settings to achieve quick results. The viewer enables researchers to interact with the text by displaying various types of annotations, for example editorial comments or references to external registers. That combined, users are able to exchange their current state of the application by sharing a citable URL. The TIDO bundle provides a lightweight solution that can be self-hosted on any web server architecture.
The architecture is not feature-complete, yet. Our requirements mainly derived from projects creating digital editions (e.g. transcriptions) include the display of multiple sources at once for comparing and analyzing multiple texts. Also the interactions between text, annotations and corresponding images has to be improved to complete our construction kit for digital editions.
When applied on many resources, TextAPI will allow for a central search index for all these resources and so – again - helps to scale vertically.
NOTE: 15:20-17:10
Did you inherit a huge C or C++ research code?
Are you supposed to make it faster (say you're in HPC)?
Are you supposed to introduce new features (say you're in physics)?
Or perhaps you want to rejuvenate this code?
You estimate the code to be too large for that to be done properly or cleanly?
This hands-on tutorial introduces concepts of the Coccinelle system for large-scale code analysis and restructuring.
In these two hours, we will teach you what you need to start working with Coccinelle.
What you need is a Linux installation or VM with a working version of Coccinelle.
Basic Unix usage and C/C++ knowledge required.
See https://github.com/coccinelle/coccinelle
When developing research software, it is often relevant to track its performance over time. It is even vital when targeting high-performance computing (HPC). Changes to the software itself, the used toolchains, or the system setup should not compromise how fast users obtain their results. Ideally, performance or scalability should only ever increase. Hence benchmarking should be an integral part of testing, in particular for HPC codes. At the same time, up-to-date benchmarks that are publicly available can advertise the code and inform users how to set-up the software in the most ideal way or whether they are achieving the expected performance.
To limit the burden on developers, the aforementioned steps should be automated within continuous integration (CI) practices, introducing continuous benchmarking (CB) to it. For HPC, an added complexity is the requirement of more than the usual CI backends, with access to longer running steps and more resources than available on a single node. Reusing test cases that are easily run by hand is another simplification for developers that may not be familiar with the research field. We show our solution to CB that we use at the Juelich Supercomputing Centre (JSC), where we combine the already implemented benchmarking via the Juelich Benchmarking Environment (JUBE) with properly authenticated CI steps running on the supercomputing systems at JSC. The combined results, including the evolution over time, are then further processed and displayed on pages published via CI.
Continuous Integration (CI) is an invaluable asset and an important aspect of any (research) software development and research software lifecycle.
However, CI is often only considered as a side aspect when giving talks about software projects. To remedy this, we propose a session targeting only the topic of CI and including the possibility for shorter talks, to lower the entry barrier. In addition, as CI can also be used for many other possible use-cases and scenarios beyond its application in (research) software development, we also want to offer space for these topics as part of the session.
As the format for this session, we intend to offer the possibility for short lightning talks from a wide range of participants on topics revolving around CI. These talks can cover notable and noteworthy usage scenarios and examples for CI and also encourage ideas and experiences for scenarios using CI outside of software development, e.g. for papers, websites, teaching and similar.
Among the organizers, we have already identified possible topics for this session. However, before defining the final agenda, we intend to open the session to anyone in the RSE community interested in contributing and presenting their favorite topic. To this end, we will circulate a pad and an open call for lightning talks via, e.g., the de-rse mailing list to collect interested parties and topics based on which we will compose the agenda for the session.
What does it take to develop and maintain a research code for the stochastic simulation of the physics of the strong interaction in Lattice Quantum Chromodynamics (LQCD)? How to make it run on the fastest supercomputers in the world? How many people are involved and what do they contribute when and how? What kind of development and interaction structures are useful? Which kinds of challenges need to be overcome?
In this contribution we present the ongoing and organically cooperative effort between LQCD groups in Bonn and Cyprus, the development team of the QUDA LQCD library and staff at Juelich Supercomputing Center (JSC) in enabling the tmLQCD software framework, the workhorse of the Extended Twisted Mass Collaboration, to successfully run on supercomputers with accelerators by NVIDIA and AMD today as well as machines by these and other vendors in the future.
LQCD has historically been a trailblazer discipline in its adoption of new supercomputing architectures. Some groups have even actively contributed to the development of new machines such as the BlueGene line of supercomputers. At some sites, LQCD practitioners use very large fractions of the total available computing time, such that efficiently implementing the underlying algorithms has a significant impact not only on the possible science output but also on the associated energy consumption.
Many software frameworks for LQCD have been developed by small collaborations of “user-developers” who were able to efficiently target the relatively slowly changing architectures of the time with comparatively simple algorithms, mostly just making use of MPI for inter-process communicaton and some level of hardware specialisation via compiler intrinsincs or even inline assembly to target particular hardware architectures. The rapid proliferation of many-core and accelerated supercomputing systems by multiple vendors and the increased complexity of state-of-the-art algorithms have changed this irrevocably.
In order for LQCD codes to target current and future supercomputers, close interaction between hardware vendors, supercomputing centers as well as library and application developers is mandatory. In addition to addressing performance on individual architectures, questions about correctness testing, provenance tracking, performance-portability, maintainability and programmer productivity arise. Finally, the complexity of these new architectures leads to interesting failure modes which may be difficult or impossible for the developers to diagnose on their own.
We focus on collaborative aspects such as the open development practices of both QUDA and tmLQCD, the interaction between the QUDA development team and the LQCD community, the early access programme organised by JSC for the Juwels Booster supercomputer and the effort required over many months for the diagnosis of a particularly vexing issue with node failures on that machine. In this process we analyse which structural and interactional factors we believe have enabled us to successfully tackle these challenges at different stages and attempt to use our example to characterise what it takes to develop software for LQCD research today.
Simulations based on particle methods, such as Smoothed Particle Hydrodynamics (SPH), are known to be computationally demanding, consisting on numerous interactions between locally-defined neighbors. Compared to other numerical methods, SPH is mesh-free, meaning that computations are not restrained to a fixed grid: particles, acting as interpolation nodes, are instead free to move across the entire domain, leading to additional challenges when dealing with parallel computations.
While such methods have for long been executed in parallel on multi-core CPUs, in recent years the increasing adoption of many-core accelerators, such as GPUs, has opened up the field of parallel computing to new possibilities.
However, parallel models and techniques do often differ between multi-core and many-core systems, requiring particular attention in coordinating the execution of threads and memory operations for the latter.
Moreover, hardware fragmentation and vendor-specific programming interfaces are still characterizing their market. Hence, support for various hardware configurations may easily lead to non-trivial and less maintainable implementations.
To leverage over those differences, some higher-level specifications have become available recently, such as the SYCL programming standard, which provides an interface for compiling ISO C++ code on various back-ends, including GPU APIs.
The following work highlights the initial effort in adopting the SYCL standard for the execution of SPHinXsys, an open-source multi-physics library. The result is an execution model able to run the same implementation on variable (heterogeneous) hardware, with considerable speed-up compared to the current multi-core CPU parallelization.
The discussion will primarily focus on the difference between multi-core and many-core parallelization, describing how the existing parallel methods have been adapted to be executable with SYCL. Among others, representation of data-structures for parallel access, communication strategies, and parallel methods for data sorting will be topics discussed in depth. Minimizing the effort for the user to adopt this new execution model has also been taken into consideration, reducing the changes required to port an existing simulation. Execution details are designed to be transparent to the library user, not requiring particular knowledge of the underlying execution model. Finally, benchmarks will be presented, showcasing performance comparisons between the current multi-core CPU implementation and the newly introduced SYCL parallelization with a GPU back-end.
Despite the immense computational power offered by HPC clusters(and the resources governments pour into obtaining these resources), it remains uncharted territory for many researchers. The primary deterrents include the perceived bureaucratic hurdles associated with accessing HPC resources, the opacity of usage procedures, and a notable lack of accessible support systems.
The intricate process of gaining access to HPC systems, coupled with the convoluted nature of usage procedures, has led to a situation where life science researchers seek alternatives. Institutions, responding to this demand, resort to establishing additional localized infrastructures, resulting in a fragmented and non-standardized compute ecosystem with lots of redundancy. The consequence is a landscape marked by redundancy, inefficiency, and a lack of standardized practices, all of which hinder the realization of the true potential of HPC in life science research.
To bridge this gap, a pivotal step involves the adoption of contemporary workflow systems that seamlessly integrate with HPC batch systems and support remote file management for research data management support. This talk tells the story of incorporating native batch system support into the snakemake workflow management system, advancements by which life science researchers can overcome traditional obstacles and tap into the full capabilities of distributed cluster computing. Highlighted are success stories from massive pharmaceutical ligands screens and genome oriented research.
As the talk explores these advancements, we extend an invitation to engage with the Snakemake developer community. Collaboratively, we can contribute to making bioinformatic workflow solutions more accessible and aligned with Open Science goals. Join us in this practical journey towards a more efficient and collaborative future in life science research.
When it comes to enhancing exploitation of massive data, machine learning methods are at the forefront of researchers’ awareness. Much less so is the need for, and the complexity of, applying these techniques efficiently across large-scale, memory-distributed data volumes. In fact, these aspects typical for the handling of massive data sets pose major challenges to the vast majority of research communities, in particular to those without a background in high-performance computing. Often, the standard approach involves breaking up and analyzing data in smaller chunks; this can be inefficient and prone to errors, and sometimes it might be inappropriate at all because the context of the overall data set can get lost.
The Helmholtz Analytics Toolkit (Heat) library offers a solution to this problem by providing memory-distributed and hardware-accelerated array manipulation, data analytics, and machine learning algorithms in Python. The main objective is to make memory-intensive data analysis possible across various fields of research ---in particular for domain scientists being non-experts in traditional high-performance computing who nevertheless need to tackle data analytics problems going beyond the capabilities of a single workstation. The development of this interdisciplinary, general-purpose, and open-source scientific Python library started in 2018 and is based on collaboration of three institutions (German Aerospace Center DLR, Forschungszentrum Jülich FZJ, Karlsruhe Institute of Technology KIT) of the Helmholtz Association. The pillars of its development are...
In this talk we will give an overview on the current state of our work. Moreover, focussing on the research software engineering perspective we will particularly address Heats role in the existing ecosystem of distributed computing in Python as well as technical and operational challenges in its further development.
Research Software is one of the most important tools of modern science, and the development of research software is often a prerequisite for cutting-edge research. With the advancing digitalization of research and teaching, the number of software solutions emerging at scientific institutions with the purpose of gaining scientific knowledge is ever increasing. For the reproducibility of scientific results, referencing and making accessible the respectively employed or developed software is essential. In many cases, the provision of corresponding software is of great importance for the reproducibility of data analyses as well as for the re-use of the research data in question. RSE is therefore an integral part of the research process and is addressed as such in research policy at Helmholtz, such as the Helmholtz Open Science Policy and individual Policies at the Helmholtz Centers. For large organizations, coordination of great numbers of activities and integration of a wide variety of stakeholders is a challenge. Furthering der communication among these stakeholders is at the core of the mission of the Helmholtz Open Science Office. This talk will provide an insight of how this is approached at Helmholtz, thereby giving an overview of activities at Helmholtz supporting and recognizing the production and maintenance of research software, such as the Helmholtz Task Group Research Software, Helmholtz Federated IT Services (HIFIS), Helmholtz Research Software Directory, the Helmholtz Incubator Software Award, the Helmholtz Platform for Research Software Engineering - Preparatory Study (HIRSE_PS), licensing workflows, the Helmholtz Open Science Policy, Helmholtz engagement for research reproducibility and activities linking all RSE stakeholder at Helmholtz as well as engaging with the RSE community on a national and international level.
The Helmholtz Association is a pioneer in the establishment of research software guidelines and policies in the German research landscape. The roots go back to one of the first German RSE focused workshops, which took place in Dresden in 2016. Since then, the field of RSE has been successively expanded at various levels through the provision of training and support services, technical platforms and, last but not least, the development of guidelines and policies. In context of research data management, a similar process has been driven with a strong focus on research data policies and data management plans. Guidelines for software development are just as important in modern research, but have hardly been established to date.
The talk is a progress report on the activities and results that have been achieved in the Helmholtz Centers in recent years. We present concrete examples with facts, statistics and user experience reports. In addition, we also share our experiences on how to actively stimulate this process, for example, through awards and visible indicators.The policy implementation at Helmholtz is ongoing and is actively supported in regular Helmholtz-wide research software forums organized by the Helmholtz Incubator Platform HIFIS and the Task Group Research Software of the Helmholtz Open Science Working Group.
Communication and teamwork are important parts of modern research software engineering in a multi-disciplinary field.
In our work we are based on a well-established foundation of trusted tools from industry and open source communities like git, continuous integration, test frameworks or issue tracker and KANBAN boards for shaping and helping our workflows. While research software engineering frequently means transferring knowledge, be it with our peers, our students, or our clients, we also are challenged in unique ways to communicate and educate.
The talk gives an overview over how we at the DLR Institute of Networked Energy Systems in the Department of Energy Systems Analysis have adapted methods from a variety of fields. Including our experiences:
The talk ends with an overview over how each of the mentioned pieces work together to reduce the complexity in and around our work and to free resources for writing code and contributing to science.
Article 27 of the Universal Declaration of Human Rights establishes everyone's right to participate and benefit from the advancement of science. At the same time, it articulates the right to protection of authors' moral and material interests resulting from their scientific activity. Copyright laws grant exclusive rights over novel creative work to its authors who can then decide on the conditions of use, distribution and adaptation of their work. It is up to them to reconcile the apparent conflict between own interests and those of their communities in the terms of the license.
The principles of Open Science require the publication of the data in a widely accessible way without requiring the authors to allow others to adapt or modify the work (for example in the CC-BY-ND license). When we apply those principles to research software, it is also necessary to consider specific characteristics of software. As the function of software is to instruct our devices and machines to work for us, it can be viewed as a digital tool that acts in the way it was programmed. It is the best interest of the users to fully control the software they depend on. This can only be achieved if the license grants the user the freedom to run, copy, distribute, study, change and improve the software, which was formalized as four essential freedoms by the Free Software Foundation. Open Source Initiative later rephrased this conditions in ten criteria defining open source. In contrast to Open Science allowing anyone to modify the program without discrimination to any group or activity is provided in all licenses approved for free and open source software (FOSS).
Another characteristic of software is the functional difference between source code on one hand and obfuscated or machine code on the other. The source code is required for effective modification of the program for a software developer, while the obfuscated or machine code is only useful to execute the instructions specified in the software on a computer. To assure that the program and all its derivatives remain FOSS in any of the future versions a concept called copyleft was invented. The copyleft is a legal requirement that demands the release of the source code if the program is distributed. This way it safeguards the users against malicious actors and allows continuous collaboration of all contributors.
In this presentation we will give an overview of a few common FOSS licenses (such as MIT, Apache, LGPL, GPL, and AGPL) and analyze the main differences between terms and conditions in each of them. We will also discuss the possibilities of licensing research software that strike the right balance between everyone's essential rights and freedoms.
There are no RSEs at Intel - or are there? Despite being a hardware manufacturer, Intel has one of the biggest software workforces in the industry. Apart from pure development, many engineers also perform an interface role between science and software that might well be characterized as research software engineering. We provide an overview of the work and impact of RSEs within Intel and in the broader ecosystem, and discuss the skill sets required and careers that can be pursued.
Context:
Program slicing [1] is an important technique to assist program comprehension. A program slicer identifies the parts of a program that are relevant to a given variable reference (i.e., all statements that can influence the resulting value). The resulting slice can then help R programmers and researchers to understand the program by reducing the amount of code to be considered. Furthermore, the slice can help speed up subsequent analyses, by reducing the amount of code to be analyzed.
Objective:
We present flowR, a novel program slicer and dataflow analyzer for the R programming language.
Given R code and a variable of interest, flowR can return the resulting slice as a subset of the program or highlight the relevant parts directly in the input.
Currently, flowR provides a read-eval-print loop, a server connection, and a rudimentary integration into the R language extension for Visual Studio Code which allows to interactively generate and investigate slices. flowR is available as a docker image.
We focus on the R programming language because even though it has a huge, active community, especially in the area of statistical analysis (ranking 7th on the PYPL and 17th on the TIOBE index), the set of existing tools to support R users is relatively small. To our knowledge, there is no preexisting program slicer for R.
Besides the RStudio IDE and the R language server, the {lintr} package and the {CodeDepends} package perform static analysis on R code.
However, all these tools rely on simple heuristics like XPath expressions on the abstract syntax tree (AST), causing their results to be imprecise, limited, and sometimes wrong.
Therefore, we consider flowR's dataflow analysis to be a valuable contribution to these tools by improving their accuracy.
Method:
flowR uses a five-step pipeline architecture, starting with the parser of the R interpreter to build an AST of the program. After normalizing the AST, the dataflow extraction works as a stateful fold over the AST, incrementally constructing the graph of each subtree. The calculation of the program slice reduces to a reachability traversal of the dataflow graph which contains the uses and definitions of all variables. Finally, the slice is either reconstructed as R code or highlighted in the input.
Limitations:
Currently, flowR neither handles R's reflective capabilities (e.g., modifying arguments, bodies, and environments at run-time), nor its run-time evaluation with eval. Moreover, flowR does not perform pointer analysis, which causes vectors and other objects to be treated as atomic (i.e., flowR does not differentiate access to individual elements).
Results:
Using real-world R code (written by social scientists and R package authors) shows that flowR can calculate the dataflow graph (and the respective slice for a given variable reference) in around 500ms. Moreover, averaging over every possible slice in the dataset, we achieve an average line reduction of 86.17%[±9%] (i.e., when slicing for a variable, the relevant program parts are only around 14% of the original lines).
[1] Weiser. Program Slices. Michigan, 1979.
The BERD@NFDI consortium aims to establish a research data management platform for economics within the German National Research Data Infrastructure (NFDI). This platform will host diverse resources such as research data from areas like marketing, machine learning models, and company reports, involving various partner institutions and user communities. This results in an agile, user-driven requirements engineering process.
In this talk, we will focus on our software engineering approach in the context of this process and will share our experiences in the following areas: First, the infrastructure is managed in a cloud solution for which we have built a multi-tier technology stack to best support our continuous development and deployment approach; second, for the research data management we are using the repository software InvenioRDM that is built on the open source Invenio framework.
After a short introduction of our cloud-based technology stack, our agile software development process and InvenioRDM, the talk will explore in more depth our adaptation and extension of this framework reflecting the specific needs of the BERD user community. Of particular importance is InvenioRDM's flexibility to support the development of domain-specific user interfaces and custom metadata models aligned with the FAIR principles. Its modular and domain-driven architecture, characterized by distinct layers encompassing data access, service, and presentation, renders the code structure easily comprehensible. This layered architecture also facilitates the construction of custom modules, allowing for seamless extension according to the specific functionalities desired by our user community, for instance for adding new types of research data, implementing fine-grained search capabilities, and enabling quality checks for the data presented in the platform. We discuss the advantages and drawbacks of this flexibility, including code complexity and technical debt, and how we address these issues through quality assurance measures like comprehensive testing and GitLab-based deployment.
Chemotion ELN is a widely used Electronic Lab Notebook that promotes FAIR research. It does so by providing a comprehensive platform that helps a researcher at all stages - from planning to publishing, while automatically managing their research data for them. Funded by NFDI4Chem, the ELN's use has been growing steadily - an aspect of the project that has brought its own challenges.
In this behind the scenes talk, I will discuss our learning curve as we grew from being an in-house solution to a national one and now head towards providing the ELN as a Service. Some aspects that I will cover include moving away from monolithic architecture, solving on-site deployment challenges, experience with hiring external developers, the cost of providing 'free' support, striving for funding and our ongoing attempts at improving overall SysAdmin and end-user experience. The aim of the talk is to highlight problems in RS development (with focus on Germany), even when enough funding and well-intentioned support is available.
Much software engineering (SE) research results in research software artifacts (RSAs), which can be contributions themselves or means to obtain other contributions. However, as research evolves and researchers move on, RSAs tend to be abandoned and become unusable due to a lack of maintenance. As a consequence, other researchers require a lot of effort to re-implement the RSA based on the descriptions in the corresponding publication. Attempts to enhance transparency and artifact availability distilled best practices for sharing artifacts, such as artifact evaluations, badges, and persistent repositories. Still, these attempts are not perfect and RSAs suffer from problems like a lack of documentation, unclear system requirements, or outdated dependencies.
We want to conduct a hackathon as an enjoyable experience for interested participants to see whether, which, and how many artifacts they can reuse from previous publications. Specifically, we will collect a sample of publications (approx. 30) from recent major SE venues, such as ASE, ICSE, and FSE, in which the authors shared RSAs. We will select these so that technical requirements and re-use efforts allow an in-principle reuse within three hours on personal computers. Moreover, we will select RSAs with different badges (e.g., validated vs. non-validated) and provide RSAs targeting different SE domains (e.g., testing, static analysis, refactoring, repair) to motivate many participants. The hackathon is accompanied by an anonymous online survey comprising a minimal set of demographic and background questions as well as the RSA's ReadMe together with a reflection section on its reuse. Single participants or groups (depending on the number of participants) are asked to reproduce the steps in the ReadMe file and replicate the corresponding results. Within the survey, they should document their process and problems by checking whether they can reproduce the single steps in the ReadMe, shortly describing additional steps to achieve reproduction, or issues hindering them. Moreover, participants can update an RSA themselves to make it work (optional), which we track by using version control systems (e.g., $\texttt{git}$). This way, we are collecting steps and best practices for making artifacts (re-)usable as well as identifying RSAs' properties that facilitate or challenge their reuse. We will summarize and synthesize the findings from the surveys and discuss the results with the participants in a separate session. If enough participants are interested in the hackathon, we may be able to publish the findings to guide future RSA sharing. Independently of this case, we will share the synthesized results with the participants; at least during the discussion session and via a report (ideally a published paper) distributed afterwards. If we obtain improved documentations or implementations, we plan to share these with the original RSA authors to consider for updating their repositories. Overall, we hope that this design makes the hackathon an enjoyable experience, with the potential to contribute to new insights as well as improvements to existing RSAs. Since this hackathon is open for practitioners, we hope to get insights on closing the gap between research artifacts and their practical usage.
AgriPoliS (Agricultural Policy Simulator) is an agent-based model for simulating the development of agricultural regions, focusing on the structure changes under economical, ecological and societal factors. The farms in the region are modeled as agents in AgriPoliS, which interact with each other through different markets, most importantly, land market. Every year the agents make their decisions about bidding for new land plots, stable and machine investments, and production processes through mixed linear programming (MIP) optimizations, to maximize their profit, where the focus is only for the current year with no regard for the future implication of these decisions. In this work, we enhance the agents with Deep Reinforcement Learning, giving them the ability to develop strategic instead of myopic decisions to maximize their long-term profits within the simulation period through strategic bidding behavior in the land market.
As the first step, only one agent is enhanced with Reinforcement Learning while the other agents adopt the standard behavior. As we are interested in the effects of the strategic bidding behavior of the agent, we formulate bidding prices as the action space. The state space consists of the state variables of the agents and the region under investigation, which include liquidity, current stables and machines, the distribution of remaining contract duration for rented land, rent prices, spatial distribution of free land plots in the region and the distribution of competitive agents in the neighborhood. Equity capital of the agent is considered as the reward from the environment by taking an action. Using the state variables, the agent chooses the action (bidding price) to present to the land market. Based on the success and/or failure of his bid, the agent proceeds to make investment and production decisions and obtains the results of his decisions. This continues until the end of the simulation period when the cumulative equity capital is accessed.
The learning framework consists of two parts, AgriPoliS and learning algorithm. AgriPoliS, which is implemented in C++, functions as the environment. It takes the action from the learning algorithm, which is implemented in Python, and deliver the states and rewards to the learning algorithm. The communication between the two parts is realized through the message queue system zeorMQ.
Since the action space is continuous, we implemented the DDPG (Deep Determinant Policy Gradient) algorithm with PyTorch. After tuning the learning hyper parameters, the enhanced agent could learn a stable strategy, which varies the relative bidding prices and maximizes the cumulative rewards. The first results are promising, the effect of the agent changing its bidding behavior not only affects its equity capital but also the equity capital of other farms. More algorithms like TD3 (Twin Delayed DDPG) and SAC (Soft Actor-Critic) are under work and we are also interested to resolve the learning stability issue of the algorithms.
As scientists, we frequently encounter the challenge of conveying complex research topics to a general audience. A prime example of this complexity is the field of artificial intelligence, which is currently undergoing unprecedented advancements - and many of these evolutions are hard to grasp for those not immersed in the community. When combined with other state-of-the-art developments, such as in the field of extreme edge computing, how can research still be made accessible to a wider public?
Our project steps into this space. We developed a novel approach that allows to distribute AI algorithms in a self-regulated manner across multiple devices within a wireless ad-hoc network. The network remains stable against connectivity fluctations as, owed to its intelligent architecture, tasks can be redistributed automatically. In this process, not only one device takes over the execution of the neural network, but multiple nodes compute the task cooperatively. In order to demonstrate this technology, we built an interactive hardware demonstrator. It allows attendees to experience firsthand how their handwriting samples are processed, analyzed, and evaluated through a distributed computing process, which moreover can be actively controlled by regulating the individual devices during the action.
In our talk, we elaborate on the building process of this demonstrator. We reflect our experiences and challenges that emerged during the development, from the early concept stage up until the final continuous code integration phase. We will share insights into the practical hurdles we encountered, the software architecture and programming languages we used, the hardware solutions we adopted, and the lessons we learned along the way.
HELIPORT is a data management guidance system that aims at making the components and steps of the entire research experiment’s life cycle findable, accessible, interoperable and reusable according to the FAIR principles. It integrates documentation, computational workflows, data sets, the final publication of the research results, and many more resources. This is achieved by gathering metadata from established tools and platforms and passing along relevant information to the next step in the experiment's life cycle. HELIPORT's high-level overview of the project allows researchers to keep all aspects of their experiment in mind.
A particularly interesting use case are machine learning projects. They are often prototypical in nature and driven by iterative development, so reproducibility and tranparency are a great concern. It is essential to keep track of the relationship between input data, choices in model parameters, the code version in use, and performance measures and generated outputs at all times. This requires a data management platform that automatically records the changes made and their effects. Existing MLOps tools (such as Weights and Biases, MLFlow) live entirely in the ML domain and start their workflow with the assumption that data is available. HELIPORT, on the other hand, takes care of the data lifecycle as well. Our envisioned platform interoperates with the domain specific tools already used by the scientists, and is able to extract relevant metadata (e.g. provenance). It can also make persistent any additional information such as papers the work was based on, documentation of software components, workflows, or failure cases. Moreover, it should be possible to publish these metadata in machine-readable formats.
The challenge arising from these aspects consists in integrating ML workflows into HELIPORT in such a way that they work on the provided data and metadata. The goal is also to enable the comprehensible development of ML models alongside the experiment documented in HELIPORT. This allows different teams (e.g. experimentalists and AI specialists) to work together on the same project in a seamless manner, and help generate FAIRer outcomes. In the long term we hope to aide in establishing digital twins of facilities, and making their maintenance a part of the data management proces.
In many disciplines, research software is nowadays essential for scientific progress. Most often, this software is written by the scientists themselves. However, they usually pursue a short-term strategy during development, aiming at the earliest possible results. However, most often this approach leads to low software quality, especially since the scientists are generally self-taught programmers. As a result, widespread and long-term use of the software is prevented and, at the same time, the quality of scientific research and the pace of progress are compromised.
The SURESOFT project at the TU Braunschweig aims to establish a general methodology and infrastructure based on Continuous Integration (CI) for research software projects. CI is a prerequisite for improving the quality of research software, simplifying software delivery, and ensuring long-term sustainability and availability.
In this talk, I will present how we applied the ideas and concepts of SURESOFT to our research code VirtualFluids. The code is a Computational Fluid Dynamics solver based on the Lattice Boltzmann Method for turbulent, thermal, multiphase and multicomponent flow problems as well as for multi-field problems such as Fluid-Structure-Interaction including distributed pre- and postprocessing capabilities for simulations. VirtualFluids is designed to be used on High-Performance-Computing platforms with both GPGPUs and CPUs. Efficiency has always been critical, probably even more important than obtaining maintainability. As a result, in the past VirtualFluids lacked a delivery strategy as well as quality assurance. In my presentation, I’ll talk about how we used ideas from SURESOFT to improve VirtualFluids and how we refactored the application to find a better balance between efficiency and making sure the software is delivered well and is high quality. This way, VirtualFluids can handle powerful computation while becoming more structured and adaptable.
RSEs are required to publish reproducible software to satisfy the FAIR for Research Software Principles. To save RSEs the arduous labor of manual publication of each version, they can use the tools developed in the HERMES project. HERMES (HElmholtz Rich MEtadata Software Publication) is an open source project funded by the Helmholtz Metadata Collaboration. The HERMES tools help users automate the publication of their software projects and versions together with rich metadata. They can automatically harvest and process quality metadata, and submit them to tool-based curation, approval and reporting processes. Software versions can be deposited on publication repositories that provide PIDs (e.g. DOIs).
In this hands-on workshop, we briefly present and demonstrate HERMES before guiding RSE participants through setting up the HERMES publication workflow for their own software projects. We also cater for participants who want to deploy HERMES for their own infrastructure.
The workflow follows a push-based model and runs in continuous integration (CI) infrastructures such as GitHub Actions or GitLab CI. This gives users more control over the publication workflow compared to pull-based workflows (e.g. the Zenodo-GitHub integration). It also makes them less dependent on third-party services. Rich descriptive metadata is the key element to useful software publications. The workflow harvests existing metadata from source code repos and connected platforms. Structured metadata could for example come from a Citation File Format file or a CodeMeta file. Unstructured data could be everywhere, especially in the code or the README file. HERMES processes, collates and optionally presents the gathered data for curation to keep a human in the loop. In curation, output can be controlled and errors reduced. After approval, HERMES prepares the metadata and software artifacts for automatic submission to FAIR publication repositories.
In the course of the workshop, RSEs are enabled to employ HERMES for their own projects through following a live coding session on an example project. We will address any problems that arise along the way and help participants solve them. Finally, we will discuss potential improvements of the HERMES workflow based on the hands-on experience participants made.
The workshop should last about 90 min. The target audience is everyone who deals with research software. Researchers, developers, curators and supervisors are welcome as well as everyone interested. No specific expertise or previous experience is needed. We work with GitHub or GitLab, and use their continuous integration tools, so some previous experience with these platforms may be helpful.
Research software development is crucial for scientific advancements, yet the sustainability and maintainability of such software pose significant challenges. In this tutorial, we present a comprehensive demonstration on leveraging software templates to establish best-practice implementations for research software, aiming to enhance its longevity and usability.
Our approach is grounded in the utilization of Cookiecutter, augmented with a fork-based modular Git strategy, and rigorously unit-tested methodologies. By harnessing the power of Cookiecutter, we streamline the creation process of research software, providing a standardized and efficient foundation. The fork-based modular Git approach enables flexibility in managing variations, facilitating collaborative development while maintaining version control and traceability.
Central to our methodology is the incorporation of unit testing, ensuring code integrity and reliability of the templates. Moreover, we employ Cruft, a tool tailored to combat the proliferation of boilerplate code, often referred to as the "boilerplate-monster." By systematically managing and removing redundant code, Cruft significantly enhances the maintainability and comprehensibility of research software. This proactive approach mitigates the accumulation of technical debt and facilitates long-term maintenance.
The open-source templates are available at https://codebase.helmholtz.cloud/hcdc/software-templates/. In the first 30 minutes of the tutorial, participants will gain insights into the structured organization of these software templates, enabling them to understand the framework's architecture and application to their own software products. The subsequent 30 minutes will be dedicated to a hands-on tutorial, allowing participants to engage directly with the templates, guiding them through the process of implementing and customizing them for their specific research software projects.
Maintaining research software presents distinct challenges compared to traditional software development. The diverse skill sets of researchers, time constraints, lack of standardized practices, and evolving requirements contribute to the complexity. Consequently, software often becomes obsolete, challenging to maintain, and prone to errors.
Through our tutorial, we address these challenges by advocating for the adoption of software templates. These templates encapsulate best practices, enforce coding standards, and promote consistent structures, significantly reducing the cognitive load on developers. By providing a well-defined starting point, researchers can focus more on advancing their scientific endeavors rather than grappling with software complexities.
Furthermore, the utilization of software templates fosters collaboration and knowledge sharing within research communities. It encourages the reuse of proven solutions, accelerates the onboarding process for new contributors, and facilitates better documentation practices. Ultimately, this approach leads to a more sustainable ecosystem for research software, fostering its evolution and ensuring its relevance over time.
In summary, our tutorial offers a practical and comprehensive guide to creating and utilizing software templates for research software development. By harnessing Cookiecutter with Git-based modularity, unit testing, and the power of Cruft, we aim to empower researchers in building robust, maintainable, and sustainable software, thereby advancing scientific progress in an efficient and impactful manner.
ESM-Tools is a modular infrastructure software that enables the seamless building, configuration and execution of Earth System Models (ESM) on various High Performance Computing (HPC) platforms. The software is developed at the Alfred Wegener Institute for Polar and Marine Research in Bremerhaven, jointly with the GEOMAR Helmholtz-Zentrum für Ozeanforschung in Kiel. The software is open-source and distributed through GitHub.
The aim of ESM-Tools is to provide an infrastructure tool that includes different ESM components and facilitates the use of these ESMs on different HPC systems. The software must therefore be able to handle many different possible contingencies that these models and HPC systems require. Another demand of the software is to be easily expandable in order to include future ESMs and HPCs and thus also increase the modularity of the ESMs. One of the main requirements is that all of these adaptations to the extensibility and to the functionality of the software should be customizable by the user/researcher of the software and not necessarily by an experienced software engineer. To fulfil these requirements, it must be possible to expand the functionality without changing the source code.
In order to address the above stated software requirements we applied the following design choices: (i) use of a modular software architecture, (ii) following the separation-of-concerns principle: separate source-code (consists of an HPC- and model-agnostic Python back-end) and configuration, (iii) a modular and hierarchical configuration: modular easy-to-read/write YAML files defining the configuration of each specific component of the setup (HPC- and model configuration), (iv) enable an extended functionality to the configuration files by applying a special configuration file syntax (esm-parser), (v) provide an adaptable workflow and plugin manager that is configurable by the advanced user to extend and add new functionality.
In this contribution we will introduce ESM-Tools and the design choices behind its architecture. Additionally, we will discuss the advantages of such a modular system, and address the challenges associated with its usability and maintainability resulting from these design choices and our mitigation strategies.
Research software often starts with small codes that expand over the years and by the work of various people into large codes. This evolution is shaped by the specific research projects for which codes are used and the pressure to achieve within these research projects milestones and publications. This pressure is often counteracting the aim to develop a rigorous and clear structure for the underlying research codes. In addition, research codes are typically not developed by professional software engineers but rather by PhD students and postdocs without special training in software engineering. All this inevitably leads to large legacy codes that become increasingly difficult to handle.
As an example of such problems, we discuss the multi-physics simulation code 4C which consists of more than 1 million lines of code and has been developed over 20 years by dozens of researchers in collaboration of several institutions. 4C is a parallelized multi-physics research code for analyzing and solving a plethora of physical problems with focus on computational mechanics. It offers simulation capabilities for various physical models, including single fields such as solids and structures, fluids, scalar transport, or porous media, as well as multi-physics coupling and interactions within multi-field problems.
To overcome the initially summarized difficulties, we applied two strategies. On the one hand, we started new code projects based on the advanced library deal.II and on the Julia project Trixi.jl and sought to port the functionalities of 4C into these new codes. On the other hand, we pursued incremental refactoring of the existing legacy code. In this talk, we compare the pros and cons as well as the results of both approaches in the case of 4C.
Sustainable software development and metadata practices are crucial for making research software FAIR. Undeniably, this requires an initial investment of time and effort for researching and adopting best practices, as well as for regular maintenance tasks. This impairs the bottom-up adoption of best practices by RSEs. In this talk we present two complementary tools addressing this challenge:
The command-line tool somesy1 organizes and synchronizes software project metadata across multiple required or recommended files that are typically used in a software project. It is designed for easy integration into the development workflow, reducing the overhead for software metadata management.
The fair-python-cookiecutter2 is a git repository template providing a well-structured foundation for new Python projects. The detailed documentation also turns it into a hands-on educational resource. Besides combining various state-of-the-art tools, it includes somesy to simplify software metadata management. The template follows software engineering practices recommended by DLR and OpenSSF. Furthermore, it supports relevant metadata standards such as REUSE, CITATION.cff and CodeMeta.
ClusterCockpit, a specialized performance and energy monitoring framework
designed for High-Performance Computing (HPC) cluster systems, has evolved
significantly since its inception in 2018. The framework comprises a web
frontend, an API backend, a node agent, and a metric in-memory cache. Being
an open-source project, its source code is available on GitHub at
https://github.com/ClusterCockpit.
This presentation delves into the challenges encountered and the journey
taken by ClusterCockpit over the past five years. Initially built as a PHP
Symfony web application relying on server-side rendering and JQuery
libraries, the framework has transformed into its current state with a Go API
backend and a web frontend based on Svelte.
The talk emphasizes the tradeoffs encountered when choosing frameworks at
different levels and finding the right balance between ease of use and
flexibility. The project's progression is explored, starting from its early
stages with PHP Symfony to the current architecture. Notable stages and
experiences are highlighted, providing insights into the decision-making
process.
Particular attention is given to the choices made in terms of architecture
and design, shedding light on the considerations that led to the adoption of
Go for the backend and Svelte for the frontend. The presentation aims to
offer a comprehensive understanding of ClusterCockpit's development, focusing
on the evolution of technologies, frameworks, and the project's current
state.
In this workshop, the TeachingRSE project will work on how to institutionalize the education of RSEs in Germany. To that end we plan to showcase the current status to workshop participants to collect feedback on the current work, but also enable deRSE24 participants to contribute to the next publications and become regular contributors to the project.
Structured and machine-readable experiment metadata enable various analysis and visualizations of the stored data and metadata, which may not be easily achievable with unstructured metadata such as free texts. On top of that, structured metadata increases the findability and re-usability of said metadata (and the data to which the metadata is attached) for other purposes following the spirits of the FAIR data principles. Adamant is a browser-based research data management (RDM) tool, specifically developed to systematically collect experiment metadata that is both machine- and human-readable. It makes use of the JavaScript Object Notation (JSON) schema, where valid schemas can be rendered as an interactive and user-friendly web form. Researchers may create a JSON schema that describes their experiments from scratch using the Adamant user interface or provide an existing schema. At its current state, Adamant is mainly used to compile structured experiment metadata in conjunction with a generic electronic lab notebook. In this talk, we will present the current features of Adamant and production-ready RDM workflows involving Adamant and other RDM tools, as well as concepts for future development of Adamant. These concepts include an ontology and knowledge graph integration for a guided acquisition of structured metadata, and visualization of graph data for better browsing and navigation through the stored metadata. Overall, the ultimate goal of Adamant is to make FAIR RDM activities as easy as possible for researchers.
PostWRF is an open-source software toolkit to facilitate the main visualization tasks and data handling for the Weather Research and Forecasting (WRF) model outputs. The toolkit is mostly written in NCL and Shell, with a namelist that resembles the WRF or WPS namelists. Besides the visualizations, PostWRF provides WRF-NetCDF to GeoTIFF conversion for GIS applications, ERA5-NetCDF reanalysis data plotting and extraction. The primary purpose of PostWRF is to benefit the environmental researchers (both experienced and inexperienced) to make use of the WRF model simulations, in a straightforward and efficient way, without dealing with coding and syntax errors. Since the WRF model simulates most aspects of a full atmospheric model in the regional scale, the toolkit can also be used as an educational aid in meteorological and environmental science. PostWRF is available on GitHub (https://github.com/anikfal/PostWRF), provided with HTML documentations (https://postwrf.readthedocs.io/en/master) and guided examples.
Helmholtz-Zentrum Hereon operates multiple X-ray diffraction (XRD) experiments for external users and while the experiments are very similar, their analysis is not. Pydidas [1, 2] is a software package developed for the batch analysis of X-ray diffraction data. It is published as open source and intended to be widely reusable.
Because the wide range of scientific questions tackled with the technique of XRD, a limited number of generic tools will not be sufficient to allow all possible analysis workflows. Easy extensibility of the core analysis routines is a key requirement. A framework for creating plugin-based workflows was developed and integrated in the pydidas software package to accommodate different analytical workflows in one software tool. We present the architecture of the pydidas workflows and plugins along with the tools for creating workflows and editing plugins.
Plugins are fairly simple in design to allow users/collaborators to extend the standard pydidas plugin library with tailor-made solutions for their analysis requirements. Access to plugins is handled through a registry which automatically finds plugins in specified locations to allow for easy integration of custom plugins. Pydidas also includes (graphical) tools for creating and modifying workflows and for configuring plugins, as well as for running the resulting workflows.
While pydidas was develop with the analysis of X-ray diffraction data in mind and the existing generic analysis plugins reflect this field, the architecture itself is very versatile and can easily be re-used for different research techniques.
[1] https://pydidas.hereon.de
[2] https://github.com/hereon-GEMS/pydidas
This is a hands-on tutorial for how to write a parser with parsing-expression-grammars (PEG). PEG-based parsers are essential for building domain-specific-languages. They are also useful for programming compilers and transpilers or for retrieving structured data from hand-written sources like bibliographies.
Writing formal grammars is often considered difficult. The key to success is, in my opinion, to employ an incremental test-driven approach. In this totorial we will use the DHParser-framework which is a Python-based Parser-generator with strong support for test-driven development.
As guiding example we will write a grammar for Markdown and construct a Markdown-parser from scratch. Along the way we will also touch such topics as simplification of syntax-trees, locating and reporting of syntax errors, fail-tolerant parsing.
If time permits we might in the afternoon session also cast a glance at more advanced topics like the use of macros and preprocessors or, depending on the preferences of the audience at more complicated examples like parsing LaTeX or transpiling data-structure-definition from Typescript to Python.
Most of the tutorial will be based on the documentation of DHParser on https://dhparser.readthedocs.io.
Prerequisits are a good knowledge of regular expressions (we will use them a lot as the modes building blocks for our grammar). People who want to follow through the examples and exercises themselves should bring a Laptop with Python 3.7 (https://python.org) or higher and PyCharm-Community-IDE (https://www.jetbrains.com/de-de/pycharm/) installed. Also DHParser should be installed with the command "python -m pip install DHParser" or directly from https://gitlab.lrz.de/badw-it/DHParser.
Test-driven development (TDD) is an approach in software development that plays a role in enhancing the overall quality and efficiency of the software development process.
TDD involves writing automated tests before developing the actual code. In this approach, the developer begins by creating a test that intentionally fails. Subsequently, the necessary code is implemented to pass the test, followed by code refactoring to improve its overall quality. This iterative cycle is applied for each new feature or modification to the codebase.
Utilizing TDD independently or in conjunction with other practices can contribute to maintaining the code in a consistently functional and deployable state. This collaborative approach is effective in identifying and resolving issues or bugs early in the development process, ensuring a smoother and more reliable software development journey.
Publishing sustainable research data and providing appropriate access for many research communities challenges many players: Researchers, RSEs, standardisation organizations and data repositories. With national research data infrastructures (NFDI) being set up in Germany, the latter could be solved in the mid- to long-term for specific datasets. In the meantime, researchers often produce datasets of data in research projects which are provided as services, e.g. from a web page, but may, due to a lack of funding, disappear in that form after the research project has ended. To circumvent this, open research data is hosted long-term on public platforms like university libraries, Zenodo or Github. However, this hosted data is not necessarily easily discoverable by different research communities. On top of that, research data is rarely published in isolation, but with links to related datasets, leading to the creation of link-preserving, FAIR linked open data (LOD) as RDF dumps, modelling data interoperably in common vocabularies. LOD in RDF preserves links, but is not necessarily Linked Open Usable Data (LOUD), i.e. it does not provide data in ways different research communities expect. We would like to address this problem of missing LOUD data while removing requirements on the backend such as hardware and software to a minimum.
We believe that a solution to this data provision problem is publishing research data as static webpages and using standardised static APIs to serve data in ways different research communities expect.
We developed a documentation extension to our SPARQLing Unicorn QGIS Plugin, allowing to publish RDF data dumps as HTML page and RDF serialization per data instance, similar to what frontends to triple stores such as Pubby provide.
It is published as a QGIS Plugin, a standalone script on Github and a Github Action.
The resulting data dump is hostable on static webspaces e.g. Github pages and allows navigating the contents of the LOD data in HTML including a class tree. It may include:
* Further data formats: Graph Data (GraphML, GEXF), General Purpose (CSV)
* SPARQL querying in JavaScript using the data dump
* Generation of static APIs, e.g. JSON documents mimicking standardized APIs, for
* OGC API Features: Access to FeatureCollections from e.g. QGIS
* IIIF Presentation API 3.0: IIIF Manifest Files for images/media in the knowledge graph including typed collections
* CKAN API: Datasets in the DCAT vocabulary or data collections
Static APIs further the accessibility of LOD data for different research communities and increase the chances of data reusage and exposure in different research fields, while at the same time not depending on additional infrastructures for data provision.
Our talk shows the feasibility of using publicly available examples for geodata and CKAN (SPP Dataset, AncientPorts Dataset, CIGS Datatset) and the ARS-LOD dataset for static IIIF-data.
We discuss requirements and limitations of this kind of publishing in a RDM publishing workflow, in relation to NFDI plans and how to extend this approach to only partially open data using a Solid pod publishing workflow.
Energy research software (ERS) is used in energy research for multiple purposes like visualization of processes and values, e.g., power quality, (co-)simulation of smart grids, or analysis of transition paths for energy systems. Within an exemplified research cycle, this software is often fundamental for producing new research results while it can also present a result of performed research. (1)
Metadata have shown to be one of the success factors for the so-called FAIRification of research software, especially to improve findability and, thus, reusability of research software (2), (3). To reach a high interoperability of metadata as part of the FAIRification these metadata should follow a defined schema with extensive reuse of relevant metadata elements from other schemas. Within the energy domain some approaches to collect metadata for energy research software already exist (e.g., the openmod wiki, or the Open Energy Platform). However, none of these approaches uses a formalized and interoperable metadata schema to open up the approach for further reuse for FAIR ERS.
As first step to develop a metadata schema requirements have to be gathered on which information should be included in the metadata (4). Therefore, the goal of our work is to gather these specific requirements for a metadata schema for ERS.
To this end, we follow a qualitative research approach to get relevant requirements from multiple stakeholders: we conducted semi-structured interviews with 36 researchers from different subdomains of energy research, e.g., research on power grids or specific components. The researchers use different types of software (from scripts over libraries to stand-alone software). Our interviews followed a rough interview guideline, based on the FAIR criteria, structured in five main categories: findability of general fitting ERS, selection of the right ERS for certain research, accessibility, interoperability, and reusability.
The interviews show a diverse field of requirements for ERS due to different reasons. First, depending on the subdomain, the ERS are highly diverse. Second, the scientific backgrounds of energy researchers lead to different requirements, e.g., regarding the choice of programming languages. The interviews show especially a need for information on the community and quality of ERS.
In our talk, we will present the results of our requirement analysis and discuss them with the audience.
References
(1) S. Ferenz, “Towards More Findable Energy Research Software by Introducing a Metadata-based Registry,” in Abstracts of the 11th DACH+ Conference on Energy Informatics, Anke Weidlich, Gunther Gust and Mirko Schäfer, Ed., Springer, 2022. doi: 10.1186/s42162-022-00215-6.
(2) D. S. Katz, M. Gruenpeter, and T. Honeyman, “Taking a fresh look at FAIR for research software,” Patterns, vol. 2, no. 3, Mar. 2021, doi: 10.1016/j.patter.2021.100222.
(3) A.-L. Lamprecht et al., “Towards FAIR principles for research software,” Data Sci., vol. 3, no. 1, pp. 37–59, Jan. 2020, doi: 10.3233/DS-190026.
(4) M. Curado Malta and A. A. Baptista, “Me4DCAP V0.1: a Method for the Development of Dublin Core Application Profiles,” Min. Digit. Inf. Netw., pp. 33–44, 2013, doi: 10.3233/978-1-61499-270-7-33.
Research across the Helmholtz Association is based on inter- and multidisciplinary collaborations across its 18 centers and beyond. However, the wealth of Helmholtz’s (meta)data and digital assets are stored in a distributed and incoherent manner, with varying quality.
To address this challenge, the Helmholtz Metadata Collaboration (HMC) launched the unified Helmholtz Information and Data Exchange (unHIDE) project in 2022. UnHIDE aggregates metadata harvested from Helmholtz infrastructure in the Helmholtz Knowledge Graph (Helmholtz KG). This serves as a lightweight and sustainable interoperability layer to interlink data infrastructures and increase visibility and access to the Helmholtz Association’s (meta)data and information assets
Version 1.0.0 of the Helmholtz KG was released in October 2023. This includes a comprehensive web front end for manual search of resources [1], a stable and documented [2] backend with a tested data ingestion and integration pipeline, and machine accessible endpoints [3].
In this talk we present an overview of the Helmholtz metadata ecosystem, we describe the semantic and technological architecture of the Helmholtz KG and how it integrates metadata from heterogeneous sources to improve visibility and findabiltiy. We will show how code and research software is scattered throughout different platforms (such as institutional gitlab instances), how its metadata is lacking connection to other (research) publications and that only a minority is formally published in central indexes [4]. We will show and discuss some results of our efforts to integrate and improve software metadata in Helmholtz as well as future ways how the Helmholtz KG is envisioned to harmonize and improve quality of metadata at the source: in the respective infrastructures.
[1] https://search.unhide.helmholtz-metadaten.de/
[2] https://docs.unhide.helmholtz-metadaten.de/
[3] https://sparql.unhide.helmholtz-metadaten.de/
[4] e.g. https://helmholtz.software/
Based on the precondition of the availability of Kubernetes as a Service, this talk is going to show why the migration of legacy software projects to a Kubernetes cluster is a worthwhile undertaking and how it can succeed, including the decisions made for the tools and the lessons learned on the way.
Starting from the current state with multiple deployment tools and environments (VMs, dedicated servers, Puppet, manual installs), we briefly cover the main problems arising from this historically grown structure.
We will then introduce a toolchain of CNCF Graduated software comprising Helm Charts as a standardized format for the deployment configuration of an application and ArgoCD as a deployment tool for continuously delivering applications to a Kubernetes cluster in a declarative and GitOps manner.
We will then see how this new toolchain for deployment and application hosting resolves or mitigates the aforementioned problems.
The second part of the talk will give a brief introduction of supporting services running inside the cluster that contribute to having a controlled, stable and easy to maintain environment for application deployments. This will cover possible solutions for certificate issuing, secrets management, logging and error tracking.
We will close the talk with an outlook on how a software development team could adapt to the new workflow and open a discussion on possible problems (e.g. limited resources) and coping strategies.
Task farms can be used to solve embarrassingly parallel workloads where a number of independent tasks need to be performed. This presentation introduces taskfarm, a python client/server framework that was designed to manage a satellite data processing workflow with hundreds of thousands of tasks with variable compute costs. The server uses flask to hand out tasks via a REST API and a database to track the progress of tasks. The client is also implemented in python. The presentation will focus on the software design process, the pitfalls and dead ends encountered when dealing with big data and how they were resolved.
Soft matter systems often exhibit physical phenomena that resolve different time- or length-scales, which can only be captured in numerical simulations by a multiscale approach that combines particle-based methods and grid-based methods. These algorithms have hardware-dependent performance characteristics and usually leverage one of the following optimizations: CPU vectorization, shared memory parallelization and offloading to the GPU.
The ESPResSo package[1] combines a molecular dynamics (MD) engine with a lattice-Boltzmann (LB) solver, electrostatics solvers and Monte Carlo schemes to model reactive and charged matter from the nanoscale to the mesoscale, such as gels, energy materials, and biological structures[2]. The LB method is widely used to model solvents and diffusive species that interact with solid boundaries and particles. The popularity of the method can be explained by its simplicity, re-usability in different contexts, and excellent scalability on massively parallel systems. New LB schemes can be rapidly prototyped in Jupyter Notebooks using LbmPy[3] and PyStencils[4], which rely on a symbolic formulation of the LB method to generate highly optimized and hardware-specific C++ and CUDA kernels, that can be re-used in waLBerla[5].
Originally designed for high-throughput computing, ESPResSo has recently found new scientific applications that require resources only available at high-performance computing (HPC) facilities. Major structural changes were necessary to make efficient use of these resources:[6] replacing the original LB code by waLBerla, a library tailored for HPC; rewriting the MD engine to support data layouts optimized for memory access; and redesigning the particle management code to reduce communication overhead. These changes make ESPResSo more performant, productive and portable, and easily extensible and re-usable in other domains of soft matter physics. In collaboration with our partners of the Cluster of Excellence MultiXscale, the software is now available on EasyBuild and will be part of the EESSI[7] pilot.
References:
[1] Weik et al. "ESPResSo 4.0 – an extensible software package for simulating soft matter systems". In: European Physical Journal Special Topics 227.14, 2019. doi:10.1140/epjst/e2019-800186-9
[2] Weeber et al. "ESPResSo, a Versatile Open-Source Software Package for Simulating Soft Matter Systems". In: Comprehensive Computational Chemistry. Elsevier, 2024. doi:10.1016/B978-0-12-821978-2.00103-3
[3] Bauer et al. "lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods". In: Journal of Computational Science 49, 2021. doi:10.1016/j.jocs.2020.101269
[4] Bauer et al. "Code generation for massively parallel phase-field simulations". In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019. doi:10.1145/3295500.3356186
[5] Bauer et al. "waLBerla: A block-structured high-performance framework for multiphysics simulations". In: Computers & Mathematics with Applications 81, 2021. doi:10.1016/j.camwa.2020.01.007
[6] Grad, Weeber, "Report on the current scalability of ESPResSo and the planned work to extend it". MultiXscale Deliverable, EuroHPC Centre of Excellence MultiXscale, 2023. doi:10.5281/zenodo.8420222
[7] Dröge et al. "EESSI: A cross-platform ready-to-use optimised scientific software stack". In: Software: Practice and Experience 53(1), 2023. doi:10.1002/spe.3075
Abstract: The Helmholtz Research Software Directory (Helmholtz RSD,
https://helmholtz.software) is a platform designed for promoting and
discovering research software. The platform has been launched in March
2023 by the Helmholtz Federated IT Services (HIFIS) to provide a
platform for research software developed within Helmholtz. The Helmholtz
RSD is based on the RSD of the Netherlands eScience Centre
(https://www.esciencecenter.nl/), and allows to comfortably add, manage
and track metrics of research software. In this demo session, we will
learn how to build the RSD from scratch and discover the
functionality of the RSD from an admin and end user perspective.
I co-organize the monthly HackyHour at the University of Würzburg. A social gathering where people interested in computational tools for research meet and exchange ideas. It is also aimed to provide help for students and researchers with specific problems. I also organizers the regular Data Dojo at my organization. In the Data Dojo a pre-selected data set is collectively analyzed, with everyone taking turns at the keyboard.
I would love to meet organizers of similar events or people interested in starting such meet-ups. We could discuss the pros and cons of different formats, how to advertise and sustainably organize these events. Further, we can identify common pitfalls and discuss ways to connect the existing initiatives all across Germany (and beyond).
A hands-on workshop session at DE-RSE 2024
Software Management Plans (SMPs) help Research Software Engineers (researchers who develop code as part of their research or software engineers who support research activities) to oversee some of the activities during the software development lifecycle. Such activities could support (i) researchers in developing better software by following some minimum good practices, and (ii) software engineers in adopting some practices that might not be common outside research (e.g., archiving releases, providing citation information).
In this workshop, we will introduce (research) software metadata and its connection to SMPs, a tool for metadata extraction from GitHub repositories (GitHub, paper, website), and the Software Management Wizard (SMWizard – a tool to facilitate filling in ELIXIR SMPS, see preprint and web page). After the introduction, we will have a hands-on session to work in small groups to try and improve the SMW and the metadata extractor (e.g., suggesting improvements to the SMP/SMW/metadata extraction, using the tools to improve your own GitHub repo machine-readability, implementing a new data integrator for the Wizard, improving the metadata extraction). The hands-on will finish with feedback from participants, followed by a wrap-up from the workshop organizers.
Time | Activity | Responsible |
---|---|---|
10’ | Welcome | Leyla Jael Castro |
Introductory session | ||
10’ | Software metadata | Stephan Ferenz |
10’ | Machine-actionable SMPs | Leyla Jael Castro |
15’ | Software metadata extraction | Daniel Garijo |
15’ | SMWizard | Marek Suchánek |
Hands-on session | ||
90’ | In groups, work on one of the following topics: (i) Create your SMP with the SMWizard and brainstorm on improvements, (ii) create an integrator for the SMWizard, (iii) use the metadata extraction tools to produce Codemeta and/or Bioschemas metadata files and brainstorm on improvements, (iv) extend or develop new functionality for the metadata extractor, (v) your own idea | All participants |
20’ | Feedback from groups | One participant per group |
10’ | Wrap-up | Leyla Jael Castro |