Please note that this is a public pad. Thus, please only share non-sensitive information.
Foundations of Research Software Publication (26./27.05.2025)
Organizational Details
Day 1 (26.05.2025)
- 09:00-09:30 Welcome & Introduction
- 09:30-10:45 Example, Code Repository Structuring
- 10:45-11:00 Break
- 11:00-12:45 Coding Practices, Documentation
- 12:45-13:00 Wrap Up
Day 2 (27.05.2025)
- 09:00-09:15 Welcome & Introduction
- 09:15-10:45 Open Source Licensing
- 10:45-11:00 Break
- 11:00-12:45 Software Citation, Software Release Practices
- 12:45-13:00 Wrap Up
Before the Workshop
During the Workshop
- Please mute yourself if you are not talking.
- Please use a headset.
- Please be back in time after breaks.
- Please ask your questions in the chat during the presentation.
Workshop Materials
Day 1
Welcome and Introduction
Please tell us in a minute about your background and interest in the workshop :)
Motivation
Reproducibility of scientific results and enabling others to review them are vital in research.
- Making scientific results reproducible is context-specific and can be hard in the concrete case. For example, the recently published paper Community-developed checklists for publishing images and image analyses provides advice on how to improve reproducibility in biomedical research when publishing an image analysis.
- Involved research software introduces a separate domain of potential reproducibility issues.

Image credit: The image is part of Illustrations from The Turing Way: Shared under CC-BY 4.0 for reuse created by the Turing Way Community & Scriberia and licensed under CC BY 4.0.
Making research code openly available helps. But there are more aspects to consider!

Image credit: The image is part of the Open Science MOOC, Module 5 “Open-Research-Software” and licensed under CC0 1.0.
In this workshop we consider minimum practices when publishing research code to improve reproducibility of scientific results.
[Task]: Please think about situations where you wanted to share your code outside organization:
Why did you want to share your code?
- Anna: I was sharing with my fellow PhD students scripts for making figures in GMT
- Sebastian: In our field we deal with complex behavior patterns of larval zebrafish and every lab basically has its own way of analyzing this data. This “wild west” scenario is one of the major obstacles of why the zebrafish is not yet considered much in chemical risk assessment by regulators. We would like to provide our approach publicly so other lab can use it and do not have to develop their own approaches
- others wanted to perform the same analysis on their data
- Emily: Developing three codebases in collaboration with colleagues in America (open source and widely used across our field)
- salika: as a supplementary material for my published paper.
- Sadegh: Simply to share the idea that how we produced results; To know from others how we could enhance the code and result; and also for getting citied of the paper.
- Sarah: I develop machine-learning models to predict material properties and find structure-property relationshios. Similar code could be used on other materials. Therefore, I would like to share my code. Also, it is more and more wanted (from funding agencies and so on). Lastly, I just think it’s good scientific practice to show how, you got to the results you are publishing.
- Dominik: To work together with fellow students / coworkers
- For collaboration with colleagues
- Provide code examples, e.g. for training
- muhammad: Supplement with the data produced, for reproducibility.
- Carlo: To let other researcher be able to fork my repository and develop it in parallel to me, and viceversa
- Pia: Along with a published paper
- :
- Sven: Parts of the code can be reused in other projects like code for special equipment
- Sonja: Helping others out with code I already made and vice versa
- Sonja: Requirements for publications
- Arne: development of big models
What worked well?
- Sebastian: Developing the approach to make it work for us, but it is a lot more and different work to make it work for others who do not know our code and approach
- Anna: well, eventually my collegue could reproduce the figure
- worked jus fine but very messy
- Dominik: Essentially worked well, tooling makes things easy for simple mainstream tasks
- creating a package that others can also use
- Sending a jupyter notebook over mail worked - buuuut…
- Emily: Using GitHub for managing issues and feature updates
- Sadegh: Not really well :)
- salika: I am not sure whether it “worked well”
What obstacles did you encounter?
- Sebastian: The design logic of a software architecture that provides different functions that should work independently of each other. For every test user we had we encountered different problems
- Anna: incompatible program versions and missing libraries even for such simple setup as code in GMT (well-written and well made code)
- messy code, want to make changes in it every time we see it, so so many versions
- generalizing the code to make it useful for other applications too
- Sonja: no version control
- Dominik: Only dependencies are sometimes difficult
- Pia: Not sure where to start, whether to include code for all preprocessing steps or whether to start from an intermediate level of processing
- in git: basically everything beyond git add, commit, push needs a lot of google
- Sarah: Infrastructure at the centers/institutes is not always well established.
- Sadegh: Having good structure; Sharing large data;
- Emily: Taking my personal codes and learning how to make them consistent with existing repositories
- after a while, code repositories tend to become a mess; especially when several people are working on it
- Carlo: not always sure of which version of the dependencies are needed. Sometimes some dependencies are not in the requirements and I have to figure out which one they are by myself.
- salika: I am a bit unsure about the best practices here…
Introduction to the Workshop Example “Astronaut Analysis”
Small Python 3 script using pandas
and matplotlib
.
Source available at: https://codebase.helmholtz.cloud/hifis/software/education/hifis-workshops/foundations-of-research-software-publication/astronaut-analysis-data/-/tree/0-original-source



How should I publish my analysis code to enable others to check or reproduce my results?
1. Put your code under version control
2. Make sure that your code is in a shareable state
3. Add essential documentation
4. Add a license
5. Make your code citable
6. Release your code
Which version control system(s) do you currently use?
- git (github, gitlab)
- Sadegh: git
- Sarah: git (TortoiseGit & github)
- Dominik: subversion (svn), git (gitlab, github)
- Lars: git
- Sebastian: git
- Pia: git
- git
- git
- git
- Carlo: github and gitlab
- Arne: git
Which artifacts do you usually store in the repository?
- Dominik: Scripts, notebook, config files, environment files
- scripts + requirements
- scripts + readme
- Sadegh: codes, data, Readme
- test data to reproduce code and for tests
- scripts, requirement file, readme
- Sarah: Code, ReadMe, example data
- workshop material (R and shell scripts; documentation) (not a git workshop, haha)
Which artifacts do you exclude?
- OS specific files, venv, data, sensitive/internal info (depending on private/public repo)
- data sets
- Sadegh: test results
- data
- data
- results, specific examples
- bigger data sets, pictures and metadata (usually stored separately in other repo such as Zenodo)
- Sebastian: gitlab
- Our group is currently using GitHub for the tool development
- Sadegh: GitHub and GitLab
- recently used DVC
- git (github, gitlab)
- gitlab
- Dominik: gitlab, github, svn, (wandb, overleaf)
- gitlab
- gitlab and github
- github
- gitlab
- github, github
- gitlab
- gitlab
What are your experiences when working collaboratively with others?
-
The experience with GitLab is a little bit messy, since it’s sometimes hard to separate the code, both people can be working on the same code and therefore having issues with pull requests
-
Git is complicated if some people don’t know how to use it and others are good at it
-
only minimal interactions through git - usually merge created some terrinble problems and we had to roll everything back (=
-
git is a bit annoying when you have conflicts the entire time =)
-
I think git/gitlab works fine, but we did not have the situation yet that multiple people had to work on the same code simultaneously. I expect this to be a bit more complicated. I like working with the project management tools like milestones, issues and assigning tags, those are really helpful. In general, I think the learning curve for git/gitlab is very flat and you really have to want working with it
-
only minimal experience working collabarative on code
-
Sarah: I have only minimal experience working collaborative since most of my coding projects are done on my own and only the data were “delivered” by other people.
-
Dominik: Mostly alright. It would be nice to have someone who really knows a lot more than myself to learn from
-
differences in experience among contributers
Step 1: Put your code under version control
You can follow using the material from episode 1.
Where should I store my code?
- Minimum: Use a local Git repository including some kind of backup
- Recommended: Use a code collaboration platform (e.g., GitLab, GitHub)
- Check your organizational policies where to store your code and whether you can use public code collaboration platforms
What belongs into the repository?
- Everything to make a usable version of your code:
- Source code
- Documentation
- Build scripts
- Test cases
- Configuration files
- Input data
- …
- Avoid adding generated artifacts:
- Third-party libraries
- Generated binaries
- …
- A
.gitignore
file helps you to control what goes into your repository. gitignore.io allows to generate well suited start configurations.
What to do if data is too large?
- Recommended max. Git repository size: 1 Gigabyte
- Git extensions handling large data:
- Generic:
- git-lfs (code collaboration plattform “standard”, central storage, usual Git workflow)
- git-annex (distributed storage, adapted Git workflow)
- Data analytics focused (storage, provenance + other features):
- DVC (distributed storage, workflow: additional command line tool + Git)
- Datalad (distributed storage, workflow: new command line tool instead of Git)
- Consider publishing large data sets separately:
- re3data.org provides an overview about public research data repositories
- Make sure to reference it in the Git repository in a reliable way
- Tools such as git filter-branch and BFG Repo-Cleaner allow you to remove unwanted or too large files from your Git repository history (use carefully!!)
Key Points
- Version control helps you to prepare the code for sharing.
- Make sure to put all relevant artifacts into the repository.
.gitignore
helps you to specify things that you do not want to share.
Further Readings
[Task]: Please take a look at the content of our example repository and particularly the main.py
. What aspects would you change before sharing it?
- Add a ReadMe and a header to explain what the repo is for
- maybe store functions in separate files, even though they are tiny to keep main function clean. more comments on functionality overall
- More comments in main.py are required, particularly function descriptions +1
- Add requirments (libraries).
- Add docs
- Make more explanatory variable names (e.g. not just df)
- add license and author
Step 2: Make sure that your code is in a shareable state
You can follow using the material from episode 2.
General Hints
- Make sure others can run your code:
- No dependencies on internal resources (servers, storage , licensed software, …)
- No absolute paths
- Clearly state dependencies
- Organize your code and directory structure
- Do not share sensitive data like passwords, user accounts, ssh keys, internal IP addresses, etc.
- Get in touch with user groups and the community.
Improve your code style
- Strive for understandable code:
- Apply a code style - consistency is more important than convenience.
- Use specific and appropriate names for all artifacts. Do not fear refactorings.
- Do not overcomment.
- Read code of others for inspiration.
- Try to do pair programmings and reviews (even if it is with your rubber duck).
- Use code checkers to sanitize your code.
- Linters or Checkers help to find poor code snippets and help to enforce coding styles.
- Available in many flavors for many (programming) languages.
Think about testing
- Small tests are done easily but already show effect.
- Automated tests work as an executable documentation.
- The earlier you start, the more tests you have at the end.
- A good starting point for your build automation/pipelines.
Key Points
- Make sure that others can (re-)use your code
- Do not share internals and secrets with your code
- Strive for understandable code
[Task]: Please think about your experiences with code documentation:
What experiences have you had with good or bad documentation?
- great example use cases with clear explanations of how the results were generated from one step to the next
- nice documentation on package dependencies and what you need to set up before getting started
- Very clear with example figures throughout. Any documentation better than no documentation
- Dominik: The documentations comes really into play when the “software” is not working out of the box. The documentation also sometimes feels like a bare minimum because the code needed to be published for a paper.
- e
- Good experience: If the Docs have copy-ready commands I can paste to e.g. install dependencies
- Good: good examples on how to run the code with clear structure and easy to read code (good nams, well commented, etc.) // Bad: Some people try to code as short as possible, which sometimes makes it hard to understand each line
- bad one: poor documentation about the data processing
- bad one: no docstrings describing the functions, but a lot of over-commenting most short code segments
What was important for you as a user?
- clear explanation of logic flow for complex tasks
- clear readme files
- Everything I need to know to get the code running
- If there is a single example script that just works out of the box
- Clear installation process, stable version (e.g. via conda)
- Knowing what the outputs should look like
- clear and concise steps on how to process the data before the analysis
- example dataset
- Dominik: A brief step-by-step README; ideally that should be enough to get things up and running
- clear instruction and descriptions
What was important for you as a contributor?
- Know who you are writing for
- Clear consistent structures
- Clear rules how to contribute
- Dominik: For in depth understanding, the complicated parts need more explenation
What might be important for a researcher in our example?
-
documentation of data origin, maybe statement if only subsets of the whole data sets are used. Documentation for which statistical tests are applied (if applicable)
-
Knowing some of the assumptions/limitations of a simulation or model
-
knowing where the data is from (with download date or version) and a brief description of it
Step 3: Add essential documentation
You can follow using the material from episode 3.
Mind your target groups
Typical Documentation Files
README
: The front-page of your code you should create in any case.
- Provide a short overview (information of common interest…)
- Don’t hesitate to use templates or copy from other projects.
- CONTRIBUTING: Information about how to participate in development.
- CODE_OF_CONDUCT: Explains the ground rules for expected behaviour and participation for contributors
- LICENSE file or LICENSES folder: Shows the license(s) under which the material is provided
- CHANGELOG: Explains major changes
- CITATION: Explains how to cite the software in a scientific publication
Markup Languages
- Do not use Word for documentation.
- Do not use LaTeX for (code) documentation.
- Markdown is one of the most popular markup languages.
It initially focused on Web publishing but there are various extensions and tool chains which make it suitable for creation of complex, technical documentation as well.
However, the existing dialects and tool chains are a bit fragmented.
But if you focus on Web publishing it is one of the most used solution today.
- AsciiDoc provides an easy to read and write syntax, a rich feature set for creation of complex, technical documentation, as well as a rich ecosystem.
- ReStructuredText became popular in the Python community.
In combination with its documentation generator Sphinx it is well suited for creation of complex, technical documentation.
[Task]: Please write a README file for our example.
- Think about what you want the audience to know about it! It is not important that all information is correct. Instead try to focus on the essential information.
- The basic Markdown syntax you need to know for this task is:
# Section title
## Subsection title
Normal text
A list with items
- item1
- item2
Wrap Up
Day 2
Welcome and Introduction
[Task]: Please think about your experiences with software and data licenses:
Which licenses (software, data) do you know and typically use when publishing content?
- Sadegh: Honestly I heared about some of them, but I have no clue what they are?
- CC licenses (or extensions such CC-NY) or MIT
- Dominik: None -> zero knowledge about the topic
- dont’ have much experience here–salika
- GPL, MIT
- Carlo: no experience
- Emily: no experience
- in readme file and headers to script
- Carlo: When I create a repo it can be automatically added by github.
- Dominik: License file in repository
- Sarah: same as Carlo (Zenodo does this as well for data)
Does your home organization offer any support with licensing (software or data)? What kind of support?
- Sarah/Hereon: There is a guideline (since Jan 2025) and for propriatary licenses there is a technology transfer office, that helps
- Philipp: There is a transfer department that helps to publish and distrubute code or software in general
- Dominik: Most likely
Step 4: Add a license
DISCLAIMER: This information is no legal advice and solely reflects the experiences of the episode contributors. Please contact a lawyer or your organizational legal department, if you are in doubt and require solid legal advice.
You can follow using the material from episode 4.
Copyright
Software Licenses
- Software licenses are a standardized way to grant rights to others
- Two aspects:
- Licenses grant certain rights
- Licenses demands certain obligations (e.g., attribution, disclosure of source code). Usually, when you distribute the code to third-parties!
- Important:
- Only use code which is covered by a license
- Make sure that you cover your code under a suitable license!
Software License Types

- Proprietary Software Licenses: Non-standard licenses with specific licensing
- Public Domain: Copyright holders waive all their exclusive rights
- Free and Open Source Software Licenses: Imply the availability of the source code and allow open distribution, modification, and re-use
- Copyleft: “The world is evil.” (e.g., GPL-2.0, GPL-3.0) => disclose all source code of a distributed derivative work (viral effect)
- Permissive (e.g., MIT, BSD-3, Apache License 2.0: “The world is good.” => only a few obligations, licenses are compatible and interoperable with most other licenses
Combining Modules under Different Licenses
Derivative Work or Combined Work?
- Important questions when dealing with copyleft licenses
- It depends on:
- the concrete license and its definition of a derivative work => GPL is quite strict
- the usage mechanism => copying, static linking, dynamic linking

- Strong copyleft licenses (e.g., GPL) make it hard to “achieve” a combined work!
License Incompatibility
- Exists when a program is a derivative work of components licensed under conflicting licensing terms.
- Typical problem with copyleft licenses
- Modules under conflicting licenses cannot be legally combined and distributed
Minimal Checklist
Ask for legal advice if you are unsure!
REUSE Live Demonstration
REUSE Overview
- Goal: Make it easy to determine license and copyright holders of a file for humans and machines!
- For more information: Tutorial, FAQ, Specification
Copyright and license decisions:
- Copyright holder: German Aerospace Center
- Source code: Apache-2.0
- Data set: CC0-1.0
- Documentation and plots: CC-BY-4.0
- Insignificant files: CC0-1.0
Key Points
- Minimum: Add a license file and state the copyright holder
- Recommended: Follow the REUSE Specification
- Consider third-party licenses from the very beginning
Further Readings
[Task]: Please think about your experiences concerning attribution of software in research publications:
Did you cite or reference software that you used? Why or why not?
- I try to cite all packages that I used because I think that’s a nice and right thing to do if I can use their functions for free
- Yes, occationally, when I use it had a significant role in my code or if I used it without changing it at all.
- I cited the paper on my repo, not vice versa.
- Yes, because I used it as a central part of my analysis. I have use the citation privided in their docs such as https://scikit-learn.org/stable/about.html#citing-scikit-learn
- yes, for most of the ML-related papers. why: to give credits where it’s due. also, as a ref for the readers…
- yes, always, because it’s important to keep this information
- yes to make the work reproducable
- Dominik: yes completeness
How did you cite or reference software?
- I cite the software packages/publications and used versions in the method section of my paper. Most packages are published as short papers or have DOIs (e.g via zenodo)
- "the analysis was performed with ToolName version + link to the published article if available, otherwise GitHub link "
- Usually, there are instruction on “How to cite”. Often it’s a seperate publication with an DOI or other identifier.
- simply followed the pakcage documentation.
- Use the bibtex citation and put as a regular reference, same as for papers.
- I just put the repo link on the ‘code accessability’ section of the paper.
- Dominik: APA7 Style (almost) same as paper
Step 5: Make your code citable
You can follow using the material from episode 5.
The problem: How to cite software correctly? Software has no title page with metadata.
How to cite Software?
- Cite all software packages (including your own) in the reference list of your academic work.
- Ideally try to:
- Cite the software itself
- Cite the exact version of the software
- Cite the software using its unique identifier
- Cite the source code
- Cite the authors of the software
- Cite the release date of the software
- For more information see: Research software citation for researchers
Example
“The data sets and the notebook containing the analysis details have been published separately [11].”
References
- …
- [11] Schlauch, Tobias & Haupt, Carina. (2019). Analysis of the DLR Knowledge Exchange Workshop Series on Software Engineering (Version 1.2.0). Zenodo. https://doi.org/10.5281/zenodo.3403991
How to make your Software citable?
- Provide citation metadata
- Archive your software and obtain a persistent identifier (PID)
- Provide a prominent citation hint as part of your documentation
- You can directly manage citation metadata as part of your source code repository by
providing a file codemeta.json
(machine-readable information about your software including citation metadata) or providing a file CITATION.cff
(human/machine-readable citation metadata).
- Another option is to let digital object repositories such as Zenodo manage your citation metadata.
- Recommendation: Manage citation metadata using a
CITATION.cff
file in your code repository
Authorship and Contributorship in Software
When determining the authors, please consider the following recommendations:
- There are no universally accepted guidelines for software authorship
- Different roles than
programmers
might be considered as the authors of a software.
For example, testers, reviewers, technical writers, maintainers, release engineers, software architects, UX designers, etc., may all qualify for authorship.
- Decisions about authorship are project-specific, but must follow good scientific practice.
Refer to the ICMJE Uniform Requirements for papers, and translate the best practice to software, e.g.
- There may be no honorary authorship.
- The contribution to the software must be substantial.
- Final approval of the outcome may be covered by a CLA.
- Agreement to be accountable for - at least the own - changes may be factored into the decision.
- Contributors should be acknowledged, but are not software authors. Typical examples include issue reporters, typo fixers, evangelists promoting the software, and managers/PIs with no substantial contribution to the software itself. Consider using automation to keep track of contributors, for example, with the help of the All Contributors bot.
Archiving Software
- Archiving your software in a publication repository is crucial to ensure long-term availability
- Such a publication repository allows you to deposit your software and to obtain a PID (e.g., a DOI).
- You can use the obtained PID to persistently reference your software in a research publication.
- Some real-world examples:
Software Journals
[Task]: Please write a CITATION.cff
file for a software or a data set using cffinit.
- Decide which software or data set you want to use. If in doubt, you can use the workshop example (links on published version on Zenodo.
- Determine the relevant citation metadata for a specific software/data set release. This includes:
- Name
- Authors
- Identifier for the exact software version:
- Recommended: Persistent identifier such as the DOI
- Alterantive: Repository URL + exact version (tag name, Git commit ID)
- Version number (e.g.,
1.2.0
)
- Publication date
- Create an initial
CITATION.cff
using the online editor cffinit.
What worked well for you?
- Philipp: Worked pretty smooth
- Dominik: Straight forward
- quite smooth flow to generate a .cff file.
- Sadegh: very user friendly.
- easy to generate the file, but I don’t have all the information they ask for
- cffinit worked nicely, I also appreciate the detailed descriptions of what is needed in each step
- The help/query feature is very useful
- straight forward
What problems did you encounter?
- Carlo: how do i check the version of my repo (from the git website) and the DOI? Not sure whether I should choose between DOI or URL…
- Dominik: Missing persistent information and identifier
- Can’t add multiple different licenses and unsure about whether to choose MIT/CC0-1.0
- unsure how to add prefered citation in extra cff fields
Key Points
- Cite all relevant software packages as good as possible in your academic work
- Make your code citable by adding citation metadata and archiving your software
- Encourage citation of your software
[Task]: Please think about how you handle reference versions of your software (e.g., specific version used in a paper, release):
How do you mark these reference versions?
- provide the DOI via zenodo and then state in the text which version of my software was used
- Explcitly name it, e.g. REFERENCE (Version 1.0.1)
- Mark as Milestone and freeze version
How do you share these special versions with others?
- If it is GitHub you can link a certain tag
- If you share is as a requirement state it like “numpy==2.0.1”
- they can download the assigned version via zenodo. there is also link to the gitlab repo, so they can download from there too. But maybe via zenodo is cleaner because there you can distinguish via different version (I think)
Step 6: Release your code
You can follow using the material from episode 6.
Release Basics
- A release is a specific working software version.
- The release number uniquely identifies the release.
- A release tag marks the release in your source code repository.
- The changelog documents all notable changes (keep a changelog).
- A user uses the release package to install and use the release:
- Contains code + documentation
- Simplest form: snapshot of your source code repository packaged as ZIP file
Minimal Release Checklist for Research Code
Basic release decisions for the “Astronauts Analysis”:
First you need to make some basic decisions for your research code.
For every new release do:
1. Prepare your code for release
- Define the current release number
- Update the documentation and citation metadata
2. Check your code
- Make sure that all important artifacts and information are up-to-date
- Make sure that your code works as expected
3. Publish and archive the release
- Mark the release in the source code repository using a (Git) tag
- Create the release package(s) and possibly publish it on a code distribution platform
- Archive the release package(s)
For our result, please see: https://codebase.helmholtz.cloud/hifis/software/education/hifis-workshops/foundations-of-research-software-publication/astronaut-analysis-data/-/tree/2024-03-20
Key Points
- Mark used, working software versions as releases using release numbers and tags
- Document important changes in a changelog
- Archive the release package
Wrap Up