3.3. The core: an intellectual model

The actual implementation of any software is always transient and subject to change. What really matters is the understanding of the underlying processes and tasks that we write software for in the first place. Phrased slightly differently, at the core of our software should be an intellectual model of the data and processes. To get this intellectual model requires us to understand as much as possible what we are about to do – in a scientific context.

In the terms of the software engineering literature, these are the “business rules”, and their implementation in code forms the innermost layer of our software architecture (cf. [Martin, 2018]). Software development should be informed by an understanding of the problem domain, and should be driven by the requirements of the problem domain. This is what led people coin the term “Domain-Driven Design” (DDD) [Evans, 2004] and make it a paradigm for developing software.

In science, software developers are usually the first and principal users of their own software. Hence, they should ideally have a solid understanding of the problem domain and less problems with communicating this problem domain, its requirements and limits to the software developer. On the other hand, understanding the problem domain is always to some extent a function of personal experience. Nevertheless, as scientists we should be familiar with and used to get to grips with a complex and complicated topic or domain and having developed strategies for acquiring the necessary information, phrasing questions and designing processes to answer the questions we have.

What do we mean with “intellectual model”? As usual, the answer depends on your actual situation, your needs and goals. Below are three attempts to explain on different levels of abstraction how such an “intellectual model” may look like.

3.3.1. The large picture: data processing and analysis

The goal: A modular and extensible framework for fully reproducible data processing and analysis that follows scientific standards, both in documentation and code quality. You want to specify a list of datasets that should be operated on, followed by a list of (arbitrarily complex) operations on these datasets. This “recipe” for your data processing and analysis should be readable (and writable) without programming practice, and it should automatically write a sufficiently detailed protocol for reproducibility. Ideally, this protocol can be used again to replay all steps.

What are possible ingredients of an abstract intellectual model of such a framework for reproducible data processing and analysis? A first attempt may be to distinguish between the object (the data that are operated on) and the subjects (the actual operations performed on the data). To ensure both, reproducibility and modularity, it might be a sensible idea to make the data responsible for logging what happened to them. This results in a data structure containing both, numerical data and metadata (here: the history of operations applied to the data). Let’s call it a dataset. Next, you may realise that operations on data can either change the data themselves or yield other information (including a new dataset). You may hence distinguish processing steps acting on the data of a dataset directly and analysis steps yielding some other kind of information, from a Boolean value to an entire new dataset. But what about representations of data, be it graphical or tabular? Representations make clearly for another type of operations on data(sets), and you can next distinguish between representing a single dataset or multiple datasets. Things get arbitrarily complex easily. If you are interested in more details and a framework for the reproducible processing and analysis of (spectroscopic) data, have a look at the ASpecD framework. Full disclosure: I’m the author of the mentioned framework. And I invite you to have at least a look at the ideas behind before opting to reinvent the wheel.

3.3.2. A smaller picture: fitting a model to data

The goal: Fitting a parameterised mathematical model (simulation) to data in a modular and reproducible way, letting you easily replace both, model and fitting algorithms and controlling all the parameters involved in both, models and fitting algorithms and strategies.

What are possible ingredients of an abstract intellectual model of such a fitting framework? Besides the data (in the simplest of all cases, just a vector of numerical values) and (the name of) a simulation routine, we have two very distinct sets of parameters: One set of parameters is that for the mathematical model, our simulation routine. The other is the set of parameters for the fitting. Finally, we have the result of the fitting, i.e. some parameters that help us evaluate the quality of the fit and the final set of parameters of the simulation routine. Here again, things can get quite complicated very quickly. Particularly if you add the requirements for full reproducibility and the desire to have a protocol automatically written for you that you can easily replay (perhaps after some modifications) to start the whole process again, you end with a framework for fitting and even may want to have it part of a larger framework for reproducible data processing and analysis. Voilà, here you go: FitPy. Full disclosure again: I’m the author of this fitting framework as well. And of course it integrates with the ASpecF framework. After all, fitting is an analysis step, though a complicated one.

3.3.3. An even smaller picture: plotting data

The goal: Create publication-quality (parameterisable) standard graphical representations (plots) for a series of datasets, best without having to think ourselves of things such as appropriate axis labels and best way of standard representation for a given type of data.

What are possible ingredients of an abstract intellectual model to plot data? First of all, you need a data structure containing both numerical data and a minimal set of metadata for automatically generating appropriate axis labels. Furthermore, axes values need either be contained in the numerical data (in case of 1D data typically as a first or second column) or stored separately in the data structure as vectors with numerical values. Given different types of data and different kinds of plots, you probably aim at writing a series of modular routines for the actual plotting and one generic plotting routine that dispatches the task to a concrete plotting routine depending on the kind of data and type of plot. Lastly, you will be interested in some probably hierarchical structure containing plot parameters that control the appearance of your graphical representation. Plotting can become arbitrarily complex. Have a look at the number of parameters you can set for each of the plotting routines contained in the Matplotlib framework if you’re not yet aware of it.

Admittedly, libraries such as Pandas go a long way down this road already, and they use pretty much the same abstractions described above: a dataset (maybe named differently) containing both, data and relevant metadata, as well as a highly configurable plotting mechanism with sensible defaults for different situations and depending on the shape of the actual data.

However, we haven’t yet talked about things such as strategies for automatically generating consistent publication-quality plots (reproducibility), ways to customise and consistently apply those customisations to your plots, e.g. depending on the output medium, and tracking which data belong to which plot (traceability). The last bit can be particularly tricky, given that paths in a file system are usually not guaranteed to be stable. But that is a topic going far beyond software development.

A large part of the requirements mentioned above has been solved with the plotting capabilities of the ASoecD framework, and besides providing a lot of general-purpose plotters that are highly configurable, you can easily adapt and extend its capabilities while focussing on the actual plot and not all the other aspects, such as reproducibility and traceability.

3.3.4. Things that usually do not belong here

After having given a few examples of what could be an “intellectual model” forming the core “business logic” of our software application, here are a few things that you may find at the core of a piece of software, but that definitely do not belong there:

  • File formats

    Often, particularly when developing software for processing and analysing scientific data, we are tempted to start with importing routines for the actual data. Understandable as this is, it usually is a mistake, as the constraints related to storing information in a file on the file system are entirely different from the criteria for representing the same data in our application. File formats are a necessary, though peripheral detail of our software architecture.

  • Database

    A classic in the design of many software products: Information is stored in a database, hence the entire software is developed around this database. Still, it is a mistake. First of all, are you really sure you need a database and all the dependencies and constraints coming with it, such as server-client architecture and depending on external infrastructure? Second: What if you want to change the database – for whatever reason? As with file formats (and user interfaces), a database is a peripheral (and not always necessary) detail of your software architecture. Design your software without a database and delay the decision of which database to use to as late as possible.

  • GUI

    Another classic: Software is developed with a graphical user interface in mind, and you quickly end up with code that does both, dealing with details of your GUI and processing your data. That’s an excellent way for creating highly unmaintainable software. As with file formats and databases, the user interface is a peripheral, though clearly necessary and important, detail of your software architecture. If you design your software with modularity in mind, you can easily provide entirely different user interfaces, as all your actual business logic and the “real stuff” happens in routines completely unaware of any user interface. This is how things should really be.

There are clearly more examples of things that do not belong to the core of your application, i.e. the intellectual model of the data and processes that made you start developing software in the first place. But they may help you getting an idea what really matters.

3.3.5. Wrap-up

This whole chapter is a plea to think before you start hacking some piece of software together that might or might not work, and to continue thinking about the problem you are addressing and how to best address it in code. Science is complex, and so is scientific data processing and analysis. The creative component and the tinkering comes in when we have a set of tools available that we are familiar with and know how to use. And from and with this basic set of tools, we create new, more advanced tools. When we start, we start with a coarse-grained, rough understanding or idea, and we further develop our understanding of the problem (domain). Similarly when developing software: We start with a (hopefully somewhat reasonable) vague idea, and the more we implement and think about the task, the more we understand and the better our intellectual model gets. Just do yourself the favour to look left and right from time to time, not to invent too many wheels on your way.