3.2. Scripts vs. library

Often, the beginnings of scientific data processing and analysis are of exploratory nature. Therefore, a quick script importing, processing and plotting data seems like a natural choice. However, proper data processing and analysis gets quickly quite complex, resulting in lengthy, complicated, and hard to read scripts that impair the quality of the original task (and hence eventually the science behind). Reading (others’) code is always hard, even more so if the programmer doesn’t focus on readability. One characteristic of a professional programmer is their having understood that code readability counts – much more than most other aspects of code.

3.2.1. A general decision

From own experience, there seem to be two general approaches to data processing and analysis: one script per dataset or developing a library of functions that get used for all datasets. For sure, in the latter approach, you usually need a small script as well “gluing” together the calls to the library functions. But ideally, this will consist only of calls to functions with telling names, hence making for a good read.

3.2.1.1. One script per dataset

The idea behind: For each and every dataset we process and analyse, we start creating a script that lives next to the data and contains everything we need. Ideally, this greatly enhances portability and reproducibility. All we need is the data and the script.

Perhaps the biggest advantage of this approach: We know how to begin. Usually we start with importing data, and if we need to manually export them beforehand in plain ASCII files or else, containing only numbers, due to lack of proper importer routines. Next, we implement all the different processing and analysis steps, probably intertwined with plotting and print statements for the relevant output.

Nowadays, a similar approach, combining an interactive command line with documentation and plots, is Jupyter Notebooks. While probably not the right choice for programming and software development as such, they are extremely useful for exploratory data analysis and prototyping. If you want to script, and you can use Jupyter notebooks, always use them.

Besides being a straightforward approach to begin with, the big promises of “one script per dataset” are proper documentation of all necessary parameters – everything is in the script – and a great deal of independence – only data and script are required to reproduce what has been done. I would seriously doubt the validity of either argument, and for some details see the section “Replaying is not reproducing” below. However, there are a number of other, probably more serious problems with this approach.

Data processing and analysis is an inherently complex endeavour. The only working approach to complex problems is to divide them into smaller sub-problems and to repeat this process until we end up with a problem that we are able to solve. Once solved, we can move up one level and try to solve the next-complex problem. What does that mean for programming? The first line of organisation in our code is usually functions. A function encapsulates our solution to one problem and can be called from the outside. Given proper names, we can just call it and need not think about the actual implementation – an instance of Separation of concerns. But where to place functions in our scripts? There are even programming languages that do not allow functions to reside in scripts at all (as they tie function name and filename together). Some interpreted languages need the functions to be declared in the script file before they are called. All this does not help with readable scripts, let alone that any reasonably complex function will usually be longer than only a few lines of code and we will need a lot of them.

Putting all these arguments aside for a moment: How do we track which version of our analysis script is the most current one, with the least mistakes and bugs in? Yes, that is a matter of personal organisation. But the equivalent of printing the most current version out and placing it on the desk (i.e., placing the script or a copy or a link on the electronic desktop) usually doesn’t work very well.

To make it short: As soon as you start to implement complex functionality, i.e. start to write functions, you are developing software. And you cannot sensibly develop software in a script. Or you may perhaps be able to do so, but the costs are prohibitive, and you will most probably not be able to use all the tools that have been developed by professional software engineers in the last several decades to help with exactly this problem: developing software that is by far to complex to fit in our heads.

3.2.1.2. Library of processing and analysis functions

The idea behind: Data processing and analysis is a recurring and complex task, so better once develop a series of multi-purpose routines that take care of that and use them for each individual dataset thereafter. Ideally, the library is highly modular and gets reused and further developed over time, saving us and others time.

Perhaps the biggest advantage: Thanks to the modular approach, we can start with those small problems we think we understood reasonably well to solve. THus we can slowly build a larger and larger library of functions, step by step. Furthermore, this approach allows us to use all the tools and strategies developed over the course of the last decades by professional software engineers to cope with the inherent complexity of programming and software development.

Congratulations, now we are officially in the realm of software development. Admittedly, most scientists have never received any formal training in how to program or develop software. Nevertheless, if you are writing software to process and analyse data (rather than only using such software), there is no real way around. You need to learn at least sufficient basics to create code that somewhat complies with scientific standards.

What else is there that we need to take care of when writing a library instead of one script per dataset? The relationship between library functions and dataset analysed are less obvious, let alone the exact version of the library routines we used for an analysis. This may (and often will) affect the choice of implicit parameters set within a function of the library but affecting the result. Hence to ensure a sufficient level of reproducibility, we need to create code that automatically writes a protocol of what happened with which version of our library and what implicit and explicit parameters to what dataset. Ideally, this protocol is both, human-readable and replayable. Creating such software is a major undertaking, but at least it is possible.

3.2.2. The case for libraries

Libraries are the only viable way to deal with the inherent complexity of scientific data processing and analysis. The reason for Python being such widespread in science nowadays is simply that a series of excellent general-purpose libraries addressing many scientific questions exist ready to be used by your code. Nevertheless, inherent to science the problems your are interested in are typically special enough to deserve some purpose-built piece of software.

As soon as you find yourself importing a file containing a set of functions into your scripts (or Jupyter playbooks), you are using kind of a library, even if you may not have realised. And with this comes the responsibility to ensure reproducibility: you need to use version control and version numbers and log the exact version number of your library (file) when using it. There is nothing particularly special with this – just that you do not want to reinvent the wheel over and over again, and can benefit from people having spent a lot of time to think about what actually is necessary to ensure something like reproducibility and good scientific practice.

Libraries allow you to apply all the strategies and tools available for professional software development, such as version control, version numbers, automated testing, refactoring, and documentation, to name but the most important aspects. What is even better: Due to the modularity, you can focus on one aspect at a time, fitting much better in the daily schedule of the average scientist whose main task is usually not to develop software, but to do science. This is not to say that developing scientific software is not integral part of science, but it is still often not seen as such. Besides that, a decent understanding of the problem domain (science, the actual tasks at hand that the software should help carry out) is an essential requirement to implement the right solutions. Note that doing something right and doing the right thing are not necessarily the same.

3.2.3. Replaying is not reproducing

An argument heard often: But if I use one script per dataset and load the dataset (and all relevant libraries) at the beginning, than my data analaysis is fully reproducible. I would disagree: In best case, you can replay your data analysis. If you happen to use Jupyter notebooks and extensively document the individual tasks, technically speaking you may be able to figure out what happened.

However, having all relevant information at hand from a technical perspective does not necessarily help with comprehending what has been done and reproducing it. Part of science – and of any task, to that end – is to separate the important information from the irrelevant bits, at least if we want to understand things. What does that mean in the context of scientific data processing and analysis? We would like to have an easy to read protocol of what happened to which data, including all relevant implicit and explicit parameters and a list of requirements (i.e., libraries and alike including their exact version numbers) for reproducing the analysis.

But my Jupyter notebook comes close, you may say. Maybe – but can you make sure that you have not missed any relevant information? And if you extensively document, how do you ensure that the documentation keeps up to date with the actual code it is documenting?

3.2.4. Is there some middle ground?

Now you may ask yourself: Is there some middle ground between scripting and actual software development? We cannot make every scientist wo needs to program some piece of software a proper software developer, but we cannot prevent them from writing software as well. Well, perhaps there is some sort of middle ground and ways and strategies how to go there.

One crucial aspect is awareness: Programming and developing software is a complex task, as is any other part of science. As such, it needs proper training. Therefore, making both, programming and software development part of the curricula at the universities may be a long-term option. On a more shorter term and probably much more local, valuing those who know how to properly program and develop software in a scientific context and letting them pass their knowledge to the colleagues and students will help a lot as well. Similarly, judging time spent by a student trying to understand a programming problem or to learn a software development strategy as time well spent will encourage these people, resulting in higher-quality code eventually meeting scientific quality standards.

Another aspect: Don’t reinvent the wheel. It has been invented often enough already (and sadly continues to be reinvented). What does that mean in terms of scientific software development? There are libraries and frameworks that help you getting your things done. No sane person will reimplement, let’s say, a Levenberg–Marquardt algorithm for fitting simulations to data, except for purely educational reasons.

Why not using a framework for scientific data processing and analysis to build your own libraries? Full disclosure: I’m the author of the mentioned framework. And I invite you to have at least a look at the ideas behind before opting to reinvent the wheel. The idea behind this framework (and others): focus on the actual tasks to be done and let the framework cope with the rest. This is again a instance of Separation of concerns. In the context of scientific data processing and analysis: You know your data best, and you know how to process and analyse them. Hence, focus on designing the workflow as such, and perhaps on implementing individual processing and analysis steps, if they are not already implemented in a framework. But let the framework cope with the reproducibility aspect, automatically writing a full protocol of each individual step, including all implicit and explicit parameters and a list of all libraries used and their exact versions.

The power of good frameworks: They provide powerful abstractions that let you focus on the problem domain, and ideally, they come with user interfaces that relieve you from needing to program yourself to analyse your data. And no, don’t think in terms of Graphical User Interfaces (GUIs) now. There are other (and sometimes better) approaches, particularly when focussing on reproducibility.