4.4. Testing

As mentioned in the section on debugging, there is a clear difference between debugging and testing: When you debug, you try to pin down the root cause of unwanted or unexpected behaviour of your code. When you write tests, you specify the expected behaviour in form of executable code (i.e. tests). As we will see later, the first you should usually do once you have pinned down a bug, i.e. arrived at a clear idea what you expected your code to do, is to write a test that ensures this bug to be found from now on by your (automated) tests.

Admittedly, systematically testing software, i.e. writing tests that can be executed automatically and check your code to adhere to the specifications provided by the tests, is something few people outside the software development business do. Nevertheless, particularly in science, it is pretty important and the only way to be reasonably sure that your software does what you expect it to do.

4.4.1. The why, when, and what of tests

Before we look into how to actually write automated tests using the Python unittest framework, we briefly answer the three essential questions why, when and what to test. Testing is an essential aspect of serious software development. Remember that every programming in science that is done to help the actual science getting done has basically identical requirements as software development in terms of quality, reliability and maintainability.

Important

Rigorously testing scientific software you rely on for your scientific conclusions by means of automated (unit) tests that form an executable specification of the code is the only way to make reasonably sure that we don’t make mistakes and come to false conclusions. While software can generally not be guaranteed to be bug-free, it is our responsibility to do our best to ensure its quality, reliability and correctness. Most software written by the average scientist does not live up to these standards.

4.4.1.1. Why tests?

Tests? I’ve never written proper tests for my software, you may say. So did I for long time. Therefore, what are reasons, particularly from the perspective of a scientist, to start, continue, and extend writing tests?

  • Reliable specification

    Tests are executable specifications. One big advantage of software is that you cannot be vague, and that you cannot argue with your compiler/interpreter. Whether you managed to code what you intended to code is a different matter. But software is usually intrinsically deterministic and not ambiguous.

    Tests are a way to specify what the code should do given certain presuppositions. Pretty much like in logic and in the scientific method, you come up with conditions as prerequisites, and you specify an expectation of the result given the preconditions being met. And voilà, that is the basic structure of a (unit) test written in code.

    If tests are written in a concise way (and they better are), they form part of a human-readable specification document that has one big advantage compared with documentation: tests are executed and hence less prone to be out of sync with the actual code. Documentation may be outdated. Tests will give you immediate feedback whether the code agrees with the specification they represent.

  • Code quality

    Code written with tests in mind (test-first) is in nearly all cases better code. What characterises good code? Readability, modular and extensible design, maintainability. All of this will be more likely to characterise your code if you write it with tests in mind, ideally test-first (we will repeatedly come to this below).

    You may disagree, and I won’t argue. The experience of many software developers and consultants – as well as my own experience – speaks for itself.

  • Liberation

    Writing tests (first) is similarly liberating as using a version control system. Properly using a version control system keeps you always only one step away from a working version of your codebase, allowing you to tinker with your code and break it in any single way you can imagine, to try out things.

    Similarly and additionally to that, having tests in place that you can run to ensure that the (tested) behaviour of your code has not changed upon changing the code allows you to make even large changes that you would not dare to dream of without having the safety net your tests provide.

  • Prerequisite for refactoring

    Tests are a necessary prerequisite for refactoring and hence for addressing software entropy. Refactoring in a nutshell is “changing the code for the better without changing its behaviour”. Code has to change over time to adopt to changing requirements, that is the raison d’être of software. Furthermore, quite naturally, accumulating changes usually degrade the code quality in terms of readability (e.g., functions growing too large), design (e.g., modularity), and maintainability – a phenomenon knonw as “software entropy”. However, this is neither a law of nature nor is it irreversible. Refactoring is the way to increase the quality of your codebase, and to keep its quality high, by making little changes for the better all the time you work on your codebase, e.g. when introducing new functionality, fixing bugs, and alike.

  • Reliability

    Particularly in the scientific context, software needs to be reliable as much as possible, as we base our conclusions more and more on the results of our programs. Yes, software bugs can cost lives. But not potentially impairing the well-being of people is not an excuse not to to our best to ensure reliability and correctness of our software. The only chance we have towards this goal is to write tests that provide reasonable (complete) specification of the behaviour of our code.

If there are so many (good or at least sensible) reasons to write tests, why is it so rare? Partly because people don’t know how or never realised the importance. Partly because writing good tests that are robust and maintainable and live up to their promise is similarly hard to writing good software in the first place. It requires understanding the problem, knowledge, practice, and passion.

4.4.1.2. When to write tests?

There are two simple answers to this question:

  • Before you start coding

    This may sound weird, and we will come to it later in more detail. But actually, first writing a test and only afterwards start creating the code that makes the test pass generally leads to superior design and quality of your resulting software.

  • When you have (or somebody else has) found a bug in your code

    Finding a bug (i.e., unexpected and unwanted behaviour) in your software is a perfectly normal scenario. Just keep in mind: it should be the last time a human being found this bug. Our lifetime is just too valuable and expensive to find the same bug more than once. Hence: Once you found a bug and were able to reproduce it, write a test that checks for it.

There is a third answer, particularly if you deal with existing software that does not contain (enough) tests: When you need to change the software, you should write tests to make sure you don’t accidentally change its expected behaviour. For some comments on how to deal with a large code base with too few tests, see further below.

4.4.1.3. What and how to test?

A test always checks for a certain (expected) result/behaviour of your software given certain explicitly given preconditions. There are quite some different kinds of tests, from end-to-end tests to integratin tests to unit tests, spanning different scopes, as their names mirror. However, except of going into the details of these differences (and all the arguments surrounding their definitions), here is some more general advice:

  • Test as small a unit (of work) as possible.

    Your local context determines what is “as small as possible”. Usually, a single behaviour, and preferably something that does not need complex setup of preconditions.

  • Write a test as if the (working) code was already there.

    Particularly in context of a “test-first” approach, use your tests to think about how you would like to use a certain part of your code base. Your tests are the first users of the interfaces you implement, be it the interface of a function or a class.

  • Test as much of the behaviour of your software as possible (and sensible)

    Test coverage is mostly a somewhat arbitrary metric. You can have 100% test coverage and still serious bugs in your software, and you can have no test coverage and nearly bug-free code. However, generally, the higher the test coverage, the more confident you are to not accidentally break things when changing/further developing your code and introducing (subtle) bugs.

Writing tests is a matter of getting used to this particular way of thinking. See the section on “test-driven development” below for further details. There will be situations where you cannot think of a way to come up with a test, and that is fine for the time being, as long as you don’t let it be an excuse to not write tests.

4.4.2. How to write (and run) tests

Enough of the introduction, back to work: How to actually write tests in Python? We will cover the Python unittest package exclusively, and this package closely resembles the xUnit family of unit-testing frameworks well-known from other programming languages.

4.4.2.1. The basic structure of unittests

Basic structure of automated (unit) tests using the unittest package in Python is shown in the following listing:

1import unittest
2
3import mymodule
4
5
6class MyTestCase(unittest.TestCase):
7    def test_something(self):
8        self.assertEqual(True, False)  # add assertion here

A few basic things to note:

  • Import the module unittest at the top (line 1), followed by importing your module you want to test, here mymodule (line 3).

  • Define classes for test cases that inherit from unittest.TestCase (line 6).

  • Define methods for the individual tests, each prefixed with test_ it its name (note that this is a convention you need to follow in order to have the tests run).

  • Make use of the series of special methods of the test class prefixed with assert. The assertEqual shown here is only one of about a dozen different methods. (Note that the name of these methods deviates from PEP 8, but resembles the naming scheme used in the other xUnit testing frameworks.)

Sometimes, your tests in one test class need certain prerequisites to be fulfilled – be it an object that is instantiated and put in a certain state, a file created or else. Equally, sometimes you need to additionally tidy up after your tests, e.g. if your tests write some (temporary) files. For this, there are two special methods in the TestCase class: setUp and tearDown. The names speak for themselves. Important to know is that they are automatically executed before and after each of your tests is run, respectively. This helps with isolating your tests. A real-world example of using these methods is given below:

 1class TestReporter(unittest.TestCase):
 2    def setUp(self):
 3        self.report = report.Reporter()
 4        self.template = 'test_template.tex'
 5
 6    def tearDown(self):
 7        if os.path.exists(self.template):
 8            os.remove(self.template)
 9
10    def test_instantiate_class(self):
11        pass
12
13    def test_render_with_template(self):
14        with open(self.template, 'w+') as f:
15            f.write('')
16        self.report.template = self.template
17        self.report.render()

Each of the tests can make use of an object of the class to be tested, and can access the name of the template. Thus, there is only one place where the template name is defined, making it much easier to tidy up afterwards. Note that in case of the TestCase class, the setUp method takes over some of the duties that are usually in the realm of the constructor (__init__), e.g. declaring instance variables.

4.4.2.2. Running the tests

There are two ways of running tests: from the command line and from within your IDE. As we should always be able to run the tests without an IDE, here is how to do that: Change into the directory your tests are located in (usually tests within your project root) and type the following command:

python -m unittest discover -s . -t .

This will run all tests in the current directory, printint a single dot (.) for each successfully passed test, an s for each skipped test, and error messages in case a test fails.

The convenient way of running tests is from within your IDE. For PyCharm, the simplest way of running all tests in your tests directory is to right-click on the tests directory with your mouse and select “Run ‘Python tests in test…’”. There is a keyboard shortcut for afterwards just running the tests (Shift + F10 for Linux/Windows).

A somewhat abbreviated output of the command line from running tests of a package by the author (from within the PyCharm IDE) produced the following output, showing that quite a number of tests have been run and passed, a few skipped, and one failed. For the failed test, a few more details are given.

Testing started at 20:39 ...
Launching unittests with arguments python -m unittest discover -s /<somepath>/tests -t /<somepath>/tests in /<somepath>/tests

0.9.0.dev3 != 0.9.0.dev4

Expected :0.9.0.dev4
Actual   :0.9.0.dev3
<Click to see difference>

Traceback (most recent call last):
  File "/<somepath>/tests/test_utils.py", line 249, in test_version_correct_for_aspecd_package
    self.assertEqual(utils.get_aspecd_version(), version)
AssertionError: '0.9.0.dev4' != '0.9.0.dev3'
- 0.9.0.dev4
?          ^
+ 0.9.0.dev3
?          ^


Ran 2307 tests in 32.433s

FAILED (failures=1, skipped=9)

Process finished with exit code 1

After fixing the problem with the failing test, the result was the following (skipping all lines except the last ones, showing the actual result):

Ran 2307 tests in 26.501s

OK (skipped=9)

Process finished with exit code 0

While half a minute for more than 2000 tests is quite OK, you will usually want to restrict the tests you run during developing a local feature to only one test class. This results in tests being typically run in about a second. Just keep in mind that you run the entire test suite before committing your changes, to make sure you did not accidentally break something in your code elsewhere.

4.4.2.3. Organising the tests

How to organise your tests? For the preferred directory structure of a Python package see the section on packaging. Generally, it is sensible to have a directory tests next to your code containing all the tests, and to have one module with tests per module in your package, with the same name as the module, prefixed with test_.

4.4.3. A paradigm shift: test-driven development

Writing tests before the actual code? Are you insane? That cannot work! The answers are: probably, and yes, it does work. I’ve personally met people that outright reject the idea of test-driven development as being sensible, many that intellectually understand it and tend to agree with it, but nevertheless don’t follow in practice, and a few that eventually tried it out and started to rely on it. Probably I went through all these states for myself.

In any case: test-driven development is a paradigm shift, something that turns things upside down in our heads. And as most of us started programming without even thinking about writing tests, it can be quite a long (mental) journey. However, test-driven development pays off, particularly when dealing with scientific software and its particular prerequisites and demands in terms of reliability.

4.4.3.1. Benefits of test-driven development

So what are the benefits of the test-first approach, i.e. writing tests first and only after having a failing test, starting to write production code?

  • Code gets used from the beginning in a realistic scenario

    The first user of your production code is your tests. Writing tests as if the production code they are calling would be there already helps you design the interfaces. If it is complicated to call your code from within a test, it will be complicated for everybody else (and generally probably a bad idea).

  • Immediate feedback

    Closely related to the previous point: Your tests provide immediate feedback on how easy your code will be to use, and hence guide the design of your interfaces.

  • Higher test coverage and confidence in your code

    Quite naturally, if you write tests first, you will usually have more tests, and what is more important, you will have more distinct cases (behaviour) being tested for. This strongly enhances the confidence you can have in your code to be well-behaved.

    Remember: Test coverage is a quantitative measure, not a qualitative measure. Hence, while low test coverage points to an overall problem with the confidence you can have in your code, high test coverage does not necessarily mean that you test for all the relevant different behaviour in your code base. Particularly in the scientific context, edge cases can be tremendously important. Similarly, oversimplified test cases that are easy to calculate the expected outcome by hand can trick you into being too confident.

  • More modular design

    Tests focus on small parts of your code base and test individual behaviour of individual functions or methods. Hence, developing test-first leads naturally to modular code, as otherwise you would not be able to write tests. This modularity dramatically simplifies adapting your code base to the changing requirements that you will always face. And it makes reusing parts of the code easier as well, although code reuse is probably much less common than usually anticipated.

  • Testing the tests

    “Technically, one of the biggest benefits of TDD nobody tells you about is that by seeing a test fail, and then seeing it pass without changing the test, you’re basically testing the test itself. If you expect it to fail and it passes, you might have a bug in your test or you’re testing the wrong thing. If the test failed, and now you expect it to pass, and it still fails, your test could have a bug, or it’s expecting the wrong thing to happen.” [Osherove, 2014] (p. 16)

    Thus, test-driven development breaks the infinite recursion of having to write tests for tests for tests …

Serious software development without tests is anyway not an option. Hence, whatever argument may be the one convincing you, just give it a try. After all, developing software test-driven can be a lot of fun.

4.4.3.2. The basic TDD cycle: red – green – refactor

There are different ways to describe the basic rules of test-driven development. One way of describing it, from [Beck, 2003]:

  • Only write new code when an automated test failed.

  • Eliminate code duplication.

These two basic principles lead to a three-step cycle of test-driven development:

  1. Red – write a (failing) test

  2. Green – make the test pass (whatever it takes)

  3. Refactor – eliminate duplication, clean-up code

Two crucial aspects of this cycle are: (i) it should not last longer than a few minutes, and (ii) you’re not done once your tests pass. The short development cycles provide you with near-immediate feedback. Once you’ve got your test (and all the other already existing tests) pass, you can happily refactor your code, clean-up, eliminate duplication, and rest ensured that you will not break things you test for without being alerted immediately.

The “whatever it takes” to make a test pass is an interesting notion, isn’t it? It gets even worse: one strategy sometimes followed in test-driven development is “to fake it if you can’t make it”, to simply return the expected result. How would this lead to sensible code? Simple enough: you write one or several additional tests with slightly different expectations, and this forces you to finally implement the solution. Nevertheless, faking it first allows you to test your interfaces: Does it make sense how you call your code? And remember: you’re not done with one cycle before you haven’t refactored your code!

More generally, the tasks at hand that make us program in the first place are by far too complicated for us to immediately know the solution in sufficient detail. Hence, the only successful strategy to cope with this kind of problems is to divide them into smaller and smaller problems, until we end up with one detailed problem we know a solution for – or at least have a reasonably detailed idea that we can try to implement. Of course, this is true not only for software development, but for many aspects of our lives.

4.4.3.3. TDD: Does it work for scientific software development?

All fine and well, but does it work for scientific software development? How to write tests for things I don’t know the answer for? Isn’t the whole point in science to discover what nobody before us has seen? To seek for the unanswered questions? It is the creative combination and use of available tools, and sometimes the careful creation of new, purpose-built tools, that lead us to gain new ground and to discover the previously unknown. However, we need to have confidence in our tools, and this is where tests come in – both in software and in hardware.

For sure, it is difficult (or close to impossible) to test end-to-end a simulation program for a complex phenomenon without being able to check it against the result of a pre-existing simulation program (a sort of “benchmarking” that clearly has a value in itself). However, simulating complex phenomena means that our simulation programs consist of a large number of small blocks that we can (and should) test. Fortunately, we usually can tell for these small building blocks what the output should be for a given input.

You usually won’t reimplement basic algebraic operators. But let’s say you are going to implement a function returning rotation matrices for a set of Euler angles. In this case, you can clearly write down your expectations given a set of Euler angles – and if you figure it out using paper and pencil. In such case, it clearly pays off to create a number of test cases and make sure not only to use those input parameters that result, e.g., in integer numbers for the trigonometric functions. And once you’re done, you may want to compare your method with the one provided by SciPy (starting with version 1.2.0): scipy.spatial.transform.Rotation.from_euler(). But more often than not, there is no ready-made function for you, and you’ve got the general idea.

4.4.4. Differences between test code and production code

Generally, tests should follow all the usual advice for “clean code”, i.e. aim at readability and maintainability. Only tests that are easy to understand and maintain will last and thus provide you with all the benefits tests are written for in the first place. There are, however, a few things where test code may be slightly different from production code.

  • Test methods tend to have longer names

    The idea with naming test methods is to allow the user to tell from the method name what the test was all about. Usually, the output of failing tests includes both, the method name and the condition that failed (the expectation and a comparison with the result).

    Hence, make your method names tell you what they test for.

  • Tests may sometimes break rules enforced in production code

    One fundamental aspect of object-oriented programming helping to create modular code is encapsulation. Hence, directly accessing private methods and properties is usually a no-go. Nevertheless, in tests it might sometimes be the easiest way. Just be aware that relying on internal implementations is never a good idea, be it in test or in production code.

  • Tests should be as short as possible

    Methods should be short. And short means: fit on a single page, i.e. not longer than 20-25 lines. For tests, it is even more important that they are short and thus easily comprehensible. We should be able to easily figure out what happens in a test, not needing minutes to read through and comprehend complicated lines of code.

    One test, one condition to test for. If you really need more complicated things to happen first before you can run your test, use setUp`' (and possibly (``tearDown) methods. But beware that this somewhat masks the setup (and teardown) tasks and can make it harder to understand the actual tests.

The worst thing that can happen with your tests is that they are such closely coupled to implementation details of your production code that making necessary changes to the production code becomes harder and harder. Eventually, you will have to throw away the tests if you don’t manage to refactor them and make them more modular and maintainable. That just means that you should take care of your tests similarly as you take care of your production code. And you will get better the more you practice.

4.4.5. What if I have a large untested code base?

Starting from a blank slate is an exception rather than the rule. Hence: What if you have a large untested code base, something Feathers calls “legacy code”? The largest problem with untested code is usually that it is hard to start writing tests, as you are faced with many dependencies. Hence, the first thing is to start breaking the dependencies, and focus on those parts of the code you need to change anyway. Start writing tests for new features or for those parts of your code you need to change. [Feathers, 2005] provides a good overview of different “dependency-breaking” techniques, focussing mainly on compiled languages such as C/C++/Java, but many ideas are directly applicable to Python code as well.

4.4.6. Wrap-up

Tests are an essential part of each sensible (scientific) software, as they are the only reasonable way to ensure that the code does what it is supposed/expected to do. Writing tests first (test-driven development) generally leads to code of higher quality than when writing tests in retrospect. Furthermore, writing tests first tests your tests, thus escaping the logical infinite loop of having to write tests that test the tests…

For a good introduction into testing with Python, see [Percival, 2017]. The seminal book on test-driven development is [Beck, 2003]. Excellent advice for writing tests can be found in [Osherove, 2014]. Much more heavy-weighted is [Meszaros, 2007], providing a large catalogue of patterns for tests. Again more lightweight and focussing on software development and good architecture guided by tests is [Freeman and Pryce, 2010]. If you have to deal with a large code base without or with only few tests, have a look at [Feathers, 2005].