4.7. Packaging

Software is developed to be used. And every single piece of software that is usable and whose use provides some advantages (for somebody) will eventually be used somewhere. More pragmatically: In science, we usually develop software for concrete purposes, and as soon as a piece of software can be used for than one very special occasion, it will be (re)used. Hence, we need sensible, working and established ways to distribute our software to its users.

Important

Packaging Python code, i.e. creating installable Python packages, is not a “nice to have” gotcha, but an essential prerequisite for reproducible research. Besides that, creating Python packages is fairly easy and makes handling code and its dependencies much more comfortable.

Simply copying the files containing the code may perhaps work for simple scripts, but even in these cases there are good reasons to not use this way of “distributing” our software. The established way of distributing Python code is in form of packages, and how (and why) to create a package from our code is the topic of this section.

Tip

If you do not want to spend much time reading but are interested in a tool that helps you both with creating the initial layout of a Python package as well as keeping with its structure, have a look at the pymetacode package: documentation, source via GitHub, PyPI.

To mention it briefly already here: packaging is a way of distributing interpreted Python code. This is different from creating stand-alone executables, as is usually done with compiled languages. However, it is possible to create stand-alone executables from Python code, either bundling them with a Python interpreter or even via converting code into a compiled language followed by compiling to a “true” executable. For details, see the section Creating stand-alone executables below. Furthermore, distributing code needs to be distinguished from publishing. While you can distribute code entirely locally, publishing usually refers to making your software known to people outside your direct local environment. Particularly in science, publishing software and realising software as perfectly valid result on its own is still somewhat in its infancies and not getting the appropriate attention and reward. Nevertheless, publishing software (in a properly citable way) is an important aspect of the software development life cycle. For details, see the chapter on Publishing.

4.7.1. The case for Python packages

There is a number of reasons for creating Python packages:

  • sharing code, letting other people use your code

  • no messing around with the Python path

  • reproducibility

Each will be detailed a bit below.

4.7.1.1. Sharing code

Software is written to be used. Let’s face it: even a tiny piece of software, if useful and known to exist, will get used – probably by others and in entirely different context than originally anticipated. As soon as you get known to be a person programming, colleagues will start asking you for code you may have written that may help them in similar situations. Similarly with yourself: The more code you have written, the more often you will ask yourself: Didn’t I write a piece of software some time ago that I may (re)use here now?

The more modular and the easier to share and distribute a piece of code is, the easier will it be to (re)use it. Python, as many other programming languages such as JavaScript, PHP, or Perl (to name but a few), comes with a well-established package mechanism and accompanying infrastructure for distributing and sharing code. The Python standard library itself is organised in packages, and there are literally zillions of (free) Python packages available, most prominently via the Python Package Index (PyPI). However, you do not need to (intend to) publish your code to benefit from creating packages. Even sharing your own code between different virtual environments (you do use virtual environments, do you?) is way simpler using packages.

4.7.1.2. No messing around with the Python path

A way to make your code known to your local Python interpreter is to extend the Python path manually to include the respective directories. This typically comes with lots of problems and is entirely unnecessary, besides being (rightfully) considered un-pythonic. Manually changing the Python path is usually not portable, tricky to achieve in virtual environments and error-prone. After all, packages, package managers, and the respective infrastructure have been developed to cope with exactly these problems.

4.7.1.3. Reproducibility

An argument in favour of packaging that is admittedly focused on the scientific context is reproducibility. Only with packages (and version numbers) can you track which code you’ve used for your data processing and analysis. Furthermore, when using packages, there are several possible ways to mostly automate the tracking of the actual software packages and their versions being used for a particular task. Just be aware that using packages is a necessary, though not sufficient, prerequisite of reproducible data analysis.

When only using pre-existing software packages and following the “one script per dataset/analysis” approach, you may argue that you do not need to package your own code. In this case, your best bet is to use Jupyter Notebooks and make it a habit to document them sufficiently for your future-self and others to figure out what has been done. However, from own experience, in most circumstances this approach does not really work, as you do not rely only on pre-existing software packages, but at least implicitly develop your own routines. While generally possible in Jupyter notebooks as well, this is just not what they are designed for.

4.7.2. Prerequisites

What are the prerequisites for creating Python packages? Technically speaking, version numbers and probably a license, hence pretty simple things. Much more important are the non-technical prerequisites: you need a strategy for releasing your software – even if it is only locally released. Let’s quickly look into each of the named criteria.

4.7.2.1. Version numbers (and VCS)

Why do we need version numbers for packaging? To know which version of a package we are using, and to be able to distinguish versions of the package. The mapping between versions of your code and version numbers should be bijective, meaning one version of your code corresponds to exactly one version number and one version number to exactly one version of your code. Hence, version numbers are much more than only a string (or number): they are the visible embodiment of a (rather complex) concept. For more details on version numbers see the chapter on Version numbers (SemVer).

Furthermore, version numbers and the scheme behind can only reliably work in conjunction with a version control system (VCS). For details on this aspect, see the chapter on Version control (git).

4.7.2.2. License

Copyright differs between countries, but most of the time, code falls under copyright. That means that without a transfer of rights (by means of a license), you are usually not allowed to use, let alone change and further develop, existing code. Therefore, one of the first steps when developing software is to think about an appropriate license and to clearly state and document this choice. For details, see the chapter on Licenses.

For internal use of software in a company or in public service, strictly speaking licenses are often not necessary (at least in the German public service), as depending on the contract you usually transfer all copyright automatically to your employer. Nevertheless, making copyright and other rights explicit is always helpful.

Important

Licenses are a somewhat tricky topic, and as I’m not a lawyer, don’t mistake the statements made here as legal advice. You may even be quite limited in the choice of your license depending on your exact situation. If in doubt, contact a lawyer.

4.7.2.3. Strategy for releasing software

As mentioned above already, the most important and usually most complicated prerequisite of packaging software is a strategy for releasing your software. This is true even if you only ever release your software locally. So what is the problem? Choosing a license is a one-time decision, and although you should take time and care (and possibly contact a specialised lawyer), it is hence a one-time effort. Deciding upon a scheme for version numbers similarly requires care and thought, but usually only once. (You can change version numbering schemes, as you can change licenses, but that is rare.)

When planning to release software, at least the following questions need to be addressed, besides the decisions on a version numbering scheme and a license:

  • Who is the audience? Where shall the software be released (i.e., appear)?

  • Who is actually releasing the software, i.e. taking care of all the technical details?

  • How to you ensure the bijective mapping of code versions and version numbers?

  • How to deal with bug reports and code fixes?

  • What about (basic) documentation? How is this kept up to date and where is it placed?

  • How to handle the take-over of the development (and release) process by others in the future?

Basically, many of these questions touch on fundamental aspects of software development. And having answers to the above questions does not necessarily mean that you need elaborated mechanisms or a whole software development team. You could release software in a “fire and forget” manner, simply providing the software “as is” and explicitly not caring about any user feedback, bug reports, or further development. And there are situations where this is the right (and perhaps only sensible) decision. However, as usual, it pays off to make conscious decisions and be aware of what you’ve actually decided upon.

As soon as you release your software “into the wild”, and be it only to share it once with a colleague, it will usually get used and probably be changed – and it will most probably be used in ways you never dreamt about. After all, packaging is a process that ideally makes it easy and straight-forward for you to distribute your software, and be it only to your future-self and your future projects (on a different computer, in a different virtual environment, or else).

4.7.3. Scripts vs. modules vs. packages

Particularly in the scientific context, Python is often used as a convenient scripting language for the quick (and dirty) analysis of data and alike. So what is the difference between scripts, modules, and packages in the Python world?

A script is just a list of statements, be it assignments, be it calls to functions, that are usually executed one after the other in a linear fashion. As such a script is the more persistent incarnation of the otherwise pretty transient nature of what you would otherwise type directly in the interactive Python shell. Nowadays, a usually far better alternative to (Python) scripts are Jupyter notebooks.

A module is a Python file usually containing functions, classes, and alike. Given that the module is within the paths known by Python, you can import “symbols” (functions, classes, even variables) from such a module.

A package is a (collection of) self-contained Python module(s) that can be installed in a modular and straight-forward fashion using the existing Python infrastructure (such as the package manager pip). With a package, you are relieved from having to think about how to install and use the code you’ve written. It just worksTM as any other Python package (or module from the standard library).

4.7.4. General layout of a Python package

As mentioned in the beginning, creating Python packages is fairly easy and just a matter of organising your files. Maintaining packages in a sensible and useful way though requires a strategy for releasing software, as mentioned above.

4.7.4.1. Basic directory layout

What makes a Python package a Python package? Actually a bit of file organisation in conjunction with a few special files. While there are different ways to create a Python package, we will use one distinct way here that works well in many cases. The basic directory layout of a Python package is shown below:

mypackage/
├── mypackage
│   ├── __init__.py
│   └── module.py
└── setup.py

Note that this is pretty much the minimal setup of a Python package. So what is special about it? All the code (i.e., the modules) reside in a subdirectory (with the same name as the actual package, here “mypackage”) that contains a (by default empty) special file __init__.py next to the other modules. Furthermore, we need a file setup.py in the root directory of the package. Again very minimalistic, the contents of the file setup.py could be as simple as

import setuptools


setuptools.setup(
    name='mypackage',
)

Important

Please, bear in mind that the example shown here is purposefully minimalistic and not meant to be used in production. For any actual package, you would need to provide a lot more information, at the very least an author name, a license, a version number, and a description. Carry on reading for these important details.

4.7.4.2. Towards a working Python package

A few crucial things that have been mentioned previously are missing from the basic directory layout of a Python package shown above, such as a license and information about the version (number). Without going into any details about licenses (see the chapter on Licenses for details) and version numbers (see the chapter on Version numbers (SemVer) for details), here is a more complete structure of a Python package:

mypackage/
├── bin
│   └── incrementVersion.sh
├── mypackage
│   ├── __init__.py
│   └── module.py
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.rst
├── setup.py
└── VERSION

So what has been added? Basically, a script for the unixoid world taking care of automatically incrementing the version number (bin/incrementVersion.sh, see the chapter on Version numbers (SemVer) for details), a .gitignore file (see the chapter on Version control (git) for details), a LICENSE file containing the text of the license chosen, a README file with (minimal) description of what the package is all about and how it is used, and a file VERSION containing nothing but the version number that gets auto-incremented when using git and the incrementVersion.sh pre-commit hook (see the chapter on Version numbers (SemVer) for details).

The setup.py file has been considerably extended from the minimalistic example shown above. A more sensible file may look similiar to the following (the example is actually a slightly abbreviated version of the file contained in the pymetacode package):

import os
import setuptools


def read(fname):
    with open(os.path.join(os.path.dirname(__file__), fname)) as f:
        content = f.read()
    return content


setuptools.setup(
    name='pymetacode',
    version=read('VERSION').strip(),
    description='A Python package helping to write and maintain Python '
                'packages.',
    long_description=read('README.rst'),
    long_description_content_type="text/x-rst",
    author='Till Biskup',
    author_email='till@till-biskup.de',
    url='https://www.meta-co.de/',
    project_urls={
        "Documentation": 'https://python.docs.meta-co.de/',
        "Source": 'https://github.com/tillbiskup/pymetacode',
    },
    packages=setuptools.find_packages(exclude=('tests', 'docs')),
    license='BSD',
    keywords=[
        "metaprogramming",
        "Python packages",
        "automation",
        "code generation",
        ],
    classifiers=[
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3.7",
        "Programming Language :: Python :: 3.8",
        "Programming Language :: Python :: 3.9",
        "License :: OSI Approved :: BSD License",
        "Operating System :: OS Independent",
        "Environment :: Console",
        "Intended Audience :: Developers",
        "Intended Audience :: Science/Research",
        "Topic :: Software Development",
        "Topic :: Software Development :: Code Generators",
        "Development Status :: 4 - Beta",
    ],
    install_requires=[
        "jinja2",
        "oyaml",
        "appdirs",
        ],
    extras_require={
        'dev': ['prospector'],
        'docs': ['sphinx', 'sphinx-rtd-theme', 'sphinx-multiversion'],
        'deployment': ['wheel', 'twine'],
    },
    python_requires='>=3.7',
    entry_points={
        'console_scripts': [
            'pymeta = pymetacode.cli:cli',
        ],
    },
    include_package_data=True,
)

For more details on which package metadata you can (and should) set and how, have a look at the PyPA specifications of the Python Packaging Authority (PyPA).

The file MANIFEST.in deserves some special mentioning: As can be seen from the setup.py file listing above, both, VERSION and README files are read and used for the package metadata. Therefore, it is important for these two files to be part of the package, or to be more exact, the list of files that get distributed. While Python files are usually automatically detected and added to the distribution, other files need to be listed explicitly. This is what the MANIFEST.in file is good for, as its name suggests. Typical contents of such file are shown below, together with comments:

# Include the README
include README*

# Include the VERSION
include VERSION

# Include the LICENSE
include LICENSE*

As you can see, you can use wildcards (here: *), and it is sensible to do that at least in case of README and LICENSE files, as you are thus independent on the file extensions (e.g., .md, .rst, or even none). Depending on your package, there may be further files that are not Python files, hence get not automatically recognised by the packaging infrastructure, but are relevant for the proper functioning of your package. Typical examples include template files for reports and alike. In such case, simply add those files to the MANIFEST.in file. Note that there is even a way to recursively include all contents residing below a certain directory. Again, the MANIFEST.in file of the pymetacode package may serve as an example:

# Include templates
recursive-include pymetacode *

Just make sure to not accidentally have any files in a (sub)directory included in this way that need not (or even shall not) end up in your package.

4.7.4.3. Where to place tests, documentation, and examples?

We’re not yet done with the structure of a sensible package. Both, documentation and tests have not been mentioned yet. We are not concerned with why you need both, for details see the chapters on Documentation and Testing, respectively. Here, I just show where I usually place both: in directories directly in the root of your package, named docs and tests:

mypackage/
├── bin
│   └── incrementVersion.sh
├── docs
│   ├── api
│   │   └── index.rst
│   ├── changelog.rst
│   ├── conf.py
│   ├── index.rst
│   ├── make.bat
│   ├── Makefile
│   └── roadmap.rst
├── mypackage
│   ├── __init__.py
│   └── module.py
├── tests
│   ├── __init__.py
│   └── test_module.py
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.rst
├── setup.py
└── VERSION

Examples have been mentioned in the section heading as well. And good examples demonstrating individual aspects of your software are always a great help for users. I usually place them in a directory examples again in the project root, and create a folder structure therein as necessary. Don’t forget a README file within the examples directory, serving as a table of contents if nothing else.

4.7.4.4. Integrating GUIs

A completely different topic not touched upon until now is graphical user interfaces (GUIs). Basically, this is a story of its own, both in terms of complexity and organisation. Usually, you want to have a clear separation between a GUI and the rest of your package, and this typically translates into the GUI code being confined in a subdirectory gui below your project code directory. For further details, see the appendix on GUIs.

4.7.5. How to (locally) distribute Python packages

By now, you should have a decent understanding of what makes a Python package and how to convert your code into one. Now the question arises: how to actually distribute your Python package(s)? The “traditional” way to upload (and hence publish) it on the Python Package Index (PyPI) is typically the last and not the first step, and a large number of Python packages developed by some scientist somewhere will probably never surface on PyPI or other places.

The good news is: You don’t need PyPI or a local python package index to distribute your Python packages. As soon as you have a Python package (by means of its directory structure and minimum required files), you can install it (locally) into every virtual environment using pip, hence the standard tooling provided by Python. (Yes, you could even install your package(s) globally on a computer in the main Python namespace, but no, you just don’t do that, for good reasons.)

4.7.5.1. Install from source

The most frequent scenario, particularly in scientific software development with locally developed Python packages, is installing the package from source, typically a checked-out git repository. The two necessary prerequisites for this way of distributing and installing your package(s) are:

  • a Python virtual environment (see Virtual environments for details)

  • a local copy of your Python package source code

Once you have activated your Python virtual environment and navigated to the root directory of the local copy of your Python package (i.e., the directory containing the setup.py file), installing the package (in its present form) into the virtual environment is as simple as:

pip install .

Here, the . denotes the current directory. Similarly, you can provide here an absolute or relative path, making installing a list of packages from a list of directories scriptable.

If you have your package(s) in a git repository and happen to have a mirror of this repository available via an URL (be it GitHub or a local install of GitLab, Gitea or else), you can even point pip directly to this URL. Furthermore, you can directly specify a branch or tag name, thus ensuring that for productive use, only production-ready versions of your package(s) are used. To install the latest stable version of the pymetacode package from its public GitHub repository:

pip install git+https://github.com/tillbiskup/pymetacode@stable

Note that in this case, it would be much easier to do a simple pip install pymetacode, giving basically the same result, as this particular package is available via PyPI. But you will get the overall idea. Note that there are different transport protocols available, most importantly git+https used here in the example, and git+ssh. The latter will usually require authentication, either by a combination of username and password, or via a ssh key pair (consisting of public and private key).

4.7.5.2. Editable install

Typically, when developing your Python package(s), you want to both, have it installed via pip (to test that this works as expected) and be able to actively develop the code further. This can be achieved easily, making an “editable” install. Provided the same reqirements as mentioned above are met, namely that you have activated your Python virtual environment and navigated to the root directory of the local copy of your Python package in an editable fashion is as simple as:

pip install -e .

This is the typical scenario for developing your projects further. Just make sure to not use development versions for actual data processing and analysis, as this will usually violate the necessary bijective relationship between state of your code base and version number and hence impair reproducibility. To at least be able to trace if this happened, development versions of your code should have a dev# suffix in their version number or any similar marker. For details, see the chapter on Version numbers (SemVer).

4.7.5.3. Local Python package index

There are different valid reasons for having a local Python package index and scenarios where this is appropriate. Therefore, there exist different solutions for this problem. Furthermore, the pip tool does support alternative sources (aka package indices) out of the box. One possible way to setup and use a local Python package index is using devpi. However, typically you neither need to have your local Python package index nor are you interested in taking care of the administrative overhead that comes with such beast. Before setting up a local Python package index, think carefully what hinders you from publishing your packages on the official PyPI instance. After all, publishing your software comes with a lot of advantages, not least visibility, potential reward, and proper citations.

4.7.5.4. Using the global PyPI

Just to briefly mention it here: if there are no serious reasons preventing you from publishing your Python package, consider using the global and official Python Package Index (PyPI) and upload your package(s) there. Just note that this comes with a certain level of responsibility and requires a unique name for your package not yet taken on PyPI by any other package. As this is a form of publishing your software, further details can be found in the chapter on Publishing.

4.7.6. Creating stand-alone executables

The problem with Python code (from a distribution perspective): usually you need a Python runtime and all dependencies (packages). Installing via pip or else is convenient for programmers/software developers, but less so for (end) users.

However, the first questions to address: When is this necessary? Is this the most typical scenario for scientific software? Particularly in case of scientific software actively developed for data analysis and used mainly by those developing it, you will deal with libraries or packages used from the command line or from within scripts or even other packages. An entirely different scenario are mainly GUI-based approaches for users not concerned with programming but interested in “just” using a piece of software. Here, there may be a strong interest in providing a stand-alone executable from your Python code.

As Python is an interpreted language, there are generally two approaches in creating such stand-alone executable:

  • packaging both, a Python interpreter and the Python code into one “executable”

  • converting the Python code into another programming language that can be compiled and compile the result to a “true” binary

While both approaches are nowadays generally possible with Python code, depending on your local dependencies it may still be quite tricky. The probably most widespread tool for the former approach is PyInstaller. This tool works for all major operating-system platforms. A promising tool for the latter approach, i.e. converting Python code into C code and afterwards compiling this into a binary, is Nuitka. Note that for the latter approach, you need to have a C compiler available locally, together with the Python header files. For a few comments on how to best create stand-alone executables in case of GUIs, you may want to have a look at the appendix on Graphical User Interfaces (GUIs).

Note

Note that regardless which method you use to create standalone executable files for your Python project, cross-compiling (i.e. compiling on one platform for a different platform) is not possible. What that means: to build an installable, deployable executable for Windows, you need to run PyInstaller (or whatever tool you use) on a Windows machine, and similarly for macOS and Linux.