3.1. Key aspects of scientific data processing and analysis

More than for any other kind of software, when developing code for scientific data processing and analysis, a number of criteria should be fulfilled as much as possible, as there are:

  • reproducibility

  • traceability

  • testability

  • reliable code

I like to refer to these criteria as the key aspects of scientific data processing and analysis. Given that most scientists never received any formal introduction to programming, let alone software development, chances are not very high that the average code used in science matches these criteria.

Note that this is not meant as an offense, just stating the facts. Furthermore, not having received any formal training in how to program or develop software is not a valid excuse to write code that does not match the criteria otherwise regarded as minimum standard for good scientific practice.

Many of the topics that we only touch upon here will be dealt with in more detail later in the part Getting it right. Hence you will find many references to the chapters there. However, writing code that “just works for me”TM cannot be considered scientifically sound. Hence the emphasis here on the prerequisites for scientific software.

3.1.1. Reproducibility

One core aspect of science is: others can reproduce my results. These “others” may well include my “future-me”. Without going into any detail of the discussions regarding reproducibility and replicability and the difference between those terms, it is important to realise that science requires me to work in a way that others can reproduce in some way my results – to check whether my conclusions are valid or at least justified.

What does that mean in terms of scientific software? Ideally, I should be able to rerun the code that was used previously to get some results. Again, this may be very tricky (or even impossible) in a more general sense and given enough time spent between originally writing a piece of software and wanting to rerun it.

But as long as we usually fail to get our code to run on a different machine, we need not concern ourselves with archiving software and how to run it in ten or twenty years from now. Prerequisites to reproduce results created by a piece of (scientific) software are:

  • Getting access to the code

  • Getting access to exactly the version of code used previously

  • Being able to install the code and its dependencies

  • Having some documentation at hand to not rely entirely on digging through the code itself

Hence, you may want to have a look at the following topics from the Getting it right part:

3.1.2. Traceability

Sometimes we cannot (or do not want to) reproduce the results of a previous run of some code, but we still want to know what exactly has been done. In this regard, reproducibility and traceability are somewhat different. Even if I cannot fully reproduce something, I still want to be able to know in sufficient detail what has been done.

What does that mean in terms of scientific software? Our code should (automatically) create a protocol of what exactly happened with what exact version of code and dependencies. This includes documenting each implicit and explicit parameter of whatever function we’ve called.

Prerequisites are once more Version control (git) and Version numbers (SemVer) to document the exact version of the code used, and most probably Packaging. However, a full trace of what has been done that is automatically created requires more infrastructure, actually something usually referred to as “scientific workflow system”. For one such system (created and maintained by the author of this course), have a look at the ASpecD framework.

3.1.3. Testability

Software is only as good as those who write it. And as we all do make mistakes from time to time, and as software generally has a level of complexity that is hard to deal with, we better write software with testability in mind. Yes, you can add tests afterwards, and this (still) is what usually is done. However, this approach comes with a number of problems, the most pressing that seeing testing as an option for later results in tests not being written at all.

What does that mean in terms of scientific software? Realise that tests (in the sense of: small pieces of code that automatically check the behaviour of your code) for scientific software is not a “nice to have” option, but an important and integral aspect of scientific software development. This means that you set aside the time required to write those tests and not consider the task being done before appropriate tests are in place. And make yourself familiar with the idea that writing tests before the actual code (test-first approach) may be a sensible, productive and quality-ensuring way of developing scientific software. For more, have a look at the Testing chapter.

3.1.4. Reliable code

We write scientific software to help us with getting our research done. Hence we write software we rely on for drawing scientific conclusions. That means that we should make sure as much as possible that we actually can rely on the software we use to be fit for purpose and of sufficient quality.

What does that mean in terms of scientific software? Code should be as readable as possible, it should be tested, and its source code should be (long-term) available, i.e. ideally published in some way or other.

Hence, you may want to have a look at the following topics: