4.7. Packaging¶
Software is developed to be used. And every single piece of software that is usable and whose use provides some advantages (for somebody) will eventually be used somewhere. More pragmatically: In science, we usually develop software for concrete purposes, and as soon as a piece of software can be used for than one very special occasion, it will be (re)used. Hence, we need sensible, working and established ways to distribute our software to its users.
Important
Packaging Python code, i.e. creating installable Python packages, is not a “nice to have” gotcha, but an essential prerequisite for reproducible research. Besides that, creating Python packages is fairly easy and makes handling code and its dependencies much more comfortable.
Simply copying the files containing the code may perhaps work for simple scripts, but even in these cases there are good reasons to not use this way of “distributing” our software. The established way of distributing Python code is in form of packages, and how (and why) to create a package from our code is the topic of this section.
Tip
If you do not want to spend much time reading but are interested in a tool that helps you both with creating the initial layout of a Python package as well as keeping with its structure, have a look at the pymetacode package: documentation, source via GitHub, PyPI.
To mention it briefly already here: packaging is a way of distributing interpreted Python code. This is different from creating stand-alone executables, as is usually done with compiled languages. However, it is possible to create stand-alone executables from Python code, either bundling them with a Python interpreter or even via converting code into a compiled language followed by compiling to a “true” executable. For details, see the section Creating stand-alone executables below. Furthermore, distributing code needs to be distinguished from publishing. While you can distribute code entirely locally, publishing usually refers to making your software known to people outside your direct local environment. Particularly in science, publishing software and realising software as perfectly valid result on its own is still somewhat in its infancies and not getting the appropriate attention and reward. Nevertheless, publishing software (in a properly citable way) is an important aspect of the software development life cycle. For details, see the chapter on Publishing.
4.7.1. The case for Python packages¶
There is a number of reasons for creating Python packages:
sharing code, letting other people use your code
no messing around with the Python path
reproducibility
Each will be detailed a bit below.
4.7.1.2. No messing around with the Python path¶
A way to make your code known to your local Python interpreter is to extend the Python path manually to include the respective directories. This typically comes with lots of problems and is entirely unnecessary, besides being (rightfully) considered un-pythonic. Manually changing the Python path is usually not portable, tricky to achieve in virtual environments and error-prone. After all, packages, package managers, and the respective infrastructure have been developed to cope with exactly these problems.
4.7.1.3. Reproducibility¶
An argument in favour of packaging that is admittedly focused on the scientific context is reproducibility. Only with packages (and version numbers) can you track which code you’ve used for your data processing and analysis. Furthermore, when using packages, there are several possible ways to mostly automate the tracking of the actual software packages and their versions being used for a particular task. Just be aware that using packages is a necessary, though not sufficient, prerequisite of reproducible data analysis.
When only using pre-existing software packages and following the “one script per dataset/analysis” approach, you may argue that you do not need to package your own code. In this case, your best bet is to use Jupyter Notebooks and make it a habit to document them sufficiently for your future-self and others to figure out what has been done. However, from own experience, in most circumstances this approach does not really work, as you do not rely only on pre-existing software packages, but at least implicitly develop your own routines. While generally possible in Jupyter notebooks as well, this is just not what they are designed for.
4.7.2. Prerequisites¶
What are the prerequisites for creating Python packages? Technically speaking, version numbers and probably a license, hence pretty simple things. Much more important are the non-technical prerequisites: you need a strategy for releasing your software – even if it is only locally released. Let’s quickly look into each of the named criteria.
4.7.2.1. Version numbers (and VCS)¶
Why do we need version numbers for packaging? To know which version of a package we are using, and to be able to distinguish versions of the package. The mapping between versions of your code and version numbers should be bijective, meaning one version of your code corresponds to exactly one version number and one version number to exactly one version of your code. Hence, version numbers are much more than only a string (or number): they are the visible embodiment of a (rather complex) concept. For more details on version numbers see the chapter on Version numbers (SemVer).
Furthermore, version numbers and the scheme behind can only reliably work in conjunction with a version control system (VCS). For details on this aspect, see the chapter on Version control (git).
4.7.2.2. License¶
Copyright differs between countries, but most of the time, code falls under copyright. That means that without a transfer of rights (by means of a license), you are usually not allowed to use, let alone change and further develop, existing code. Therefore, one of the first steps when developing software is to think about an appropriate license and to clearly state and document this choice. For details, see the chapter on Licenses.
For internal use of software in a company or in public service, strictly speaking licenses are often not necessary (at least in the German public service), as depending on the contract you usually transfer all copyright automatically to your employer. Nevertheless, making copyright and other rights explicit is always helpful.
Important
Licenses are a somewhat tricky topic, and as I’m not a lawyer, don’t mistake the statements made here as legal advice. You may even be quite limited in the choice of your license depending on your exact situation. If in doubt, contact a lawyer.
4.7.2.3. Strategy for releasing software¶
As mentioned above already, the most important and usually most complicated prerequisite of packaging software is a strategy for releasing your software. This is true even if you only ever release your software locally. So what is the problem? Choosing a license is a one-time decision, and although you should take time and care (and possibly contact a specialised lawyer), it is hence a one-time effort. Deciding upon a scheme for version numbers similarly requires care and thought, but usually only once. (You can change version numbering schemes, as you can change licenses, but that is rare.)
When planning to release software, at least the following questions need to be addressed, besides the decisions on a version numbering scheme and a license:
Who is the audience? Where shall the software be released (i.e., appear)?
Who is actually releasing the software, i.e. taking care of all the technical details?
How to you ensure the bijective mapping of code versions and version numbers?
How to deal with bug reports and code fixes?
What about (basic) documentation? How is this kept up to date and where is it placed?
How to handle the take-over of the development (and release) process by others in the future?
Basically, many of these questions touch on fundamental aspects of software development. And having answers to the above questions does not necessarily mean that you need elaborated mechanisms or a whole software development team. You could release software in a “fire and forget” manner, simply providing the software “as is” and explicitly not caring about any user feedback, bug reports, or further development. And there are situations where this is the right (and perhaps only sensible) decision. However, as usual, it pays off to make conscious decisions and be aware of what you’ve actually decided upon.
As soon as you release your software “into the wild”, and be it only to share it once with a colleague, it will usually get used and probably be changed – and it will most probably be used in ways you never dreamt about. After all, packaging is a process that ideally makes it easy and straight-forward for you to distribute your software, and be it only to your future-self and your future projects (on a different computer, in a different virtual environment, or else).
4.7.3. Scripts vs. modules vs. packages¶
Particularly in the scientific context, Python is often used as a convenient scripting language for the quick (and dirty) analysis of data and alike. So what is the difference between scripts, modules, and packages in the Python world?
A script is just a list of statements, be it assignments, be it calls to functions, that are usually executed one after the other in a linear fashion. As such a script is the more persistent incarnation of the otherwise pretty transient nature of what you would otherwise type directly in the interactive Python shell. Nowadays, a usually far better alternative to (Python) scripts are Jupyter notebooks.
A module is a Python file usually containing functions, classes, and alike. Given that the module is within the paths known by Python, you can import “symbols” (functions, classes, even variables) from such a module.
A package is a (collection of) self-contained Python module(s) that can be installed in a modular and straight-forward fashion using the existing Python infrastructure (such as the package manager pip
). With a package, you are relieved from having to think about how to install and use the code you’ve written. It just worksTM as any other Python package (or module from the standard library).
4.7.4. General layout of a Python package¶
As mentioned in the beginning, creating Python packages is fairly easy and just a matter of organising your files. Maintaining packages in a sensible and useful way though requires a strategy for releasing software, as mentioned above.
4.7.4.1. Basic directory layout¶
What makes a Python package a Python package? Actually a bit of file organisation in conjunction with a few special files. While there are different ways to create a Python package, we will use one distinct way here that works well in many cases. The basic directory layout of a Python package is shown below:
mypackage/
├── mypackage
│ ├── __init__.py
│ └── module.py
└── setup.py
Note that this is pretty much the minimal setup of a Python package. So what is special about it? All the code (i.e., the modules) reside in a subdirectory (with the same name as the actual package, here “mypackage”) that contains a (by default empty) special file __init__.py
next to the other modules. Furthermore, we need a file setup.py
in the root directory of the package. Again very minimalistic, the contents of the file setup.py
could be as simple as
import setuptools
setuptools.setup(
name='mypackage',
)
Important
Please, bear in mind that the example shown here is purposefully minimalistic and not meant to be used in production. For any actual package, you would need to provide a lot more information, at the very least an author name, a license, a version number, and a description. Carry on reading for these important details.
4.7.4.2. Towards a working Python package¶
A few crucial things that have been mentioned previously are missing from the basic directory layout of a Python package shown above, such as a license and information about the version (number). Without going into any details about licenses (see the chapter on Licenses for details) and version numbers (see the chapter on Version numbers (SemVer) for details), here is a more complete structure of a Python package:
mypackage/
├── bin
│ └── incrementVersion.sh
├── mypackage
│ ├── __init__.py
│ └── module.py
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.rst
├── setup.py
└── VERSION
So what has been added? Basically, a script for the unixoid world taking care of automatically incrementing the version number (bin/incrementVersion.sh
, see the chapter on Version numbers (SemVer) for details), a .gitignore
file (see the chapter on Version control (git) for details), a LICENSE
file containing the text of the license chosen, a README
file with (minimal) description of what the package is all about and how it is used, and a file VERSION
containing nothing but the version number that gets auto-incremented when using git and the incrementVersion.sh
pre-commit hook (see the chapter on Version numbers (SemVer) for details).
The setup.py
file has been considerably extended from the minimalistic example shown above. A more sensible file may look similiar to the following (the example is actually a slightly abbreviated version of the file contained in the pymetacode package):
import os
import setuptools
def read(fname):
with open(os.path.join(os.path.dirname(__file__), fname)) as f:
content = f.read()
return content
setuptools.setup(
name='pymetacode',
version=read('VERSION').strip(),
description='A Python package helping to write and maintain Python '
'packages.',
long_description=read('README.rst'),
long_description_content_type="text/x-rst",
author='Till Biskup',
author_email='till@till-biskup.de',
url='https://www.meta-co.de/',
project_urls={
"Documentation": 'https://python.docs.meta-co.de/',
"Source": 'https://github.com/tillbiskup/pymetacode',
},
packages=setuptools.find_packages(exclude=('tests', 'docs')),
license='BSD',
keywords=[
"metaprogramming",
"Python packages",
"automation",
"code generation",
],
classifiers=[
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Environment :: Console",
"Intended Audience :: Developers",
"Intended Audience :: Science/Research",
"Topic :: Software Development",
"Topic :: Software Development :: Code Generators",
"Development Status :: 4 - Beta",
],
install_requires=[
"jinja2",
"oyaml",
"appdirs",
],
extras_require={
'dev': ['prospector'],
'docs': ['sphinx', 'sphinx-rtd-theme', 'sphinx-multiversion'],
'deployment': ['wheel', 'twine'],
},
python_requires='>=3.7',
entry_points={
'console_scripts': [
'pymeta = pymetacode.cli:cli',
],
},
include_package_data=True,
)
For more details on which package metadata you can (and should) set and how, have a look at the PyPA specifications of the Python Packaging Authority (PyPA).
The file MANIFEST.in
deserves some special mentioning: As can be seen from the setup.py
file listing above, both, VERSION
and README
files are read and used for the package metadata. Therefore, it is important for these two files to be part of the package, or to be more exact, the list of files that get distributed. While Python files are usually automatically detected and added to the distribution, other files need to be listed explicitly. This is what the MANIFEST.in
file is good for, as its name suggests. Typical contents of such file are shown below, together with comments:
# Include the README
include README*
# Include the VERSION
include VERSION
# Include the LICENSE
include LICENSE*
As you can see, you can use wildcards (here: *
), and it is sensible to do that at least in case of README
and LICENSE
files, as you are thus independent on the file extensions (e.g., .md, .rst, or even none). Depending on your package, there may be further files that are not Python files, hence get not automatically recognised by the packaging infrastructure, but are relevant for the proper functioning of your package. Typical examples include template files for reports and alike. In such case, simply add those files to the MANIFEST.in
file. Note that there is even a way to recursively include all contents residing below a certain directory. Again, the MANIFEST.in
file of the pymetacode package may serve as an example:
# Include templates
recursive-include pymetacode *
Just make sure to not accidentally have any files in a (sub)directory included in this way that need not (or even shall not) end up in your package.
4.7.4.3. Where to place tests, documentation, and examples?¶
We’re not yet done with the structure of a sensible package. Both, documentation and tests have not been mentioned yet. We are not concerned with why you need both, for details see the chapters on Documentation and Testing, respectively. Here, I just show where I usually place both: in directories directly in the root of your package, named docs
and tests
:
mypackage/
├── bin
│ └── incrementVersion.sh
├── docs
│ ├── api
│ │ └── index.rst
│ ├── changelog.rst
│ ├── conf.py
│ ├── index.rst
│ ├── make.bat
│ ├── Makefile
│ └── roadmap.rst
├── mypackage
│ ├── __init__.py
│ └── module.py
├── tests
│ ├── __init__.py
│ └── test_module.py
├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.rst
├── setup.py
└── VERSION
Examples have been mentioned in the section heading as well. And good examples demonstrating individual aspects of your software are always a great help for users. I usually place them in a directory examples
again in the project root, and create a folder structure therein as necessary. Don’t forget a README
file within the examples
directory, serving as a table of contents if nothing else.
4.7.4.4. Integrating GUIs¶
A completely different topic not touched upon until now is graphical user interfaces (GUIs). Basically, this is a story of its own, both in terms of complexity and organisation. Usually, you want to have a clear separation between a GUI and the rest of your package, and this typically translates into the GUI code being confined in a subdirectory gui
below your project code directory. For further details, see the appendix on GUIs.
4.7.5. How to (locally) distribute Python packages¶
By now, you should have a decent understanding of what makes a Python package and how to convert your code into one. Now the question arises: how to actually distribute your Python package(s)? The “traditional” way to upload (and hence publish) it on the Python Package Index (PyPI) is typically the last and not the first step, and a large number of Python packages developed by some scientist somewhere will probably never surface on PyPI or other places.
The good news is: You don’t need PyPI or a local python package index to distribute your Python packages. As soon as you have a Python package (by means of its directory structure and minimum required files), you can install it (locally) into every virtual environment using pip, hence the standard tooling provided by Python. (Yes, you could even install your package(s) globally on a computer in the main Python namespace, but no, you just don’t do that, for good reasons.)
4.7.5.1. Install from source¶
The most frequent scenario, particularly in scientific software development with locally developed Python packages, is installing the package from source, typically a checked-out git repository. The two necessary prerequisites for this way of distributing and installing your package(s) are:
a Python virtual environment (see Virtual environments for details)
a local copy of your Python package source code
Once you have activated your Python virtual environment and navigated to the root directory of the local copy of your Python package (i.e., the directory containing the setup.py
file), installing the package (in its present form) into the virtual environment is as simple as:
pip install .
Here, the .
denotes the current directory. Similarly, you can provide here an absolute or relative path, making installing a list of packages from a list of directories scriptable.
If you have your package(s) in a git repository and happen to have a mirror of this repository available via an URL (be it GitHub or a local install of GitLab, Gitea or else), you can even point pip directly to this URL. Furthermore, you can directly specify a branch or tag name, thus ensuring that for productive use, only production-ready versions of your package(s) are used. To install the latest stable version of the pymetacode package from its public GitHub repository:
pip install git+https://github.com/tillbiskup/pymetacode@stable
Note that in this case, it would be much easier to do a simple pip install pymetacode
, giving basically the same result, as this particular package is available via PyPI. But you will get the overall idea. Note that there are different transport protocols available, most importantly git+https
used here in the example, and git+ssh
. The latter will usually require authentication, either by a combination of username and password, or via a ssh key pair (consisting of public and private key).
4.7.5.2. Editable install¶
Typically, when developing your Python package(s), you want to both, have it installed via pip (to test that this works as expected) and be able to actively develop the code further. This can be achieved easily, making an “editable” install. Provided the same reqirements as mentioned above are met, namely that you have activated your Python virtual environment and navigated to the root directory of the local copy of your Python package in an editable fashion is as simple as:
pip install -e .
This is the typical scenario for developing your projects further. Just make sure to not use development versions for actual data processing and analysis, as this will usually violate the necessary bijective relationship between state of your code base and version number and hence impair reproducibility. To at least be able to trace if this happened, development versions of your code should have a dev#
suffix in their version number or any similar marker. For details, see the chapter on Version numbers (SemVer).
4.7.5.3. Local Python package index¶
There are different valid reasons for having a local Python package index and scenarios where this is appropriate. Therefore, there exist different solutions for this problem. Furthermore, the pip tool does support alternative sources (aka package indices) out of the box. One possible way to setup and use a local Python package index is using devpi. However, typically you neither need to have your local Python package index nor are you interested in taking care of the administrative overhead that comes with such beast. Before setting up a local Python package index, think carefully what hinders you from publishing your packages on the official PyPI instance. After all, publishing your software comes with a lot of advantages, not least visibility, potential reward, and proper citations.
4.7.5.4. Using the global PyPI¶
Just to briefly mention it here: if there are no serious reasons preventing you from publishing your Python package, consider using the global and official Python Package Index (PyPI) and upload your package(s) there. Just note that this comes with a certain level of responsibility and requires a unique name for your package not yet taken on PyPI by any other package. As this is a form of publishing your software, further details can be found in the chapter on Publishing.
4.7.6. Creating stand-alone executables¶
The problem with Python code (from a distribution perspective): usually you need a Python runtime and all dependencies (packages). Installing via pip or else is convenient for programmers/software developers, but less so for (end) users.
However, the first questions to address: When is this necessary? Is this the most typical scenario for scientific software? Particularly in case of scientific software actively developed for data analysis and used mainly by those developing it, you will deal with libraries or packages used from the command line or from within scripts or even other packages. An entirely different scenario are mainly GUI-based approaches for users not concerned with programming but interested in “just” using a piece of software. Here, there may be a strong interest in providing a stand-alone executable from your Python code.
As Python is an interpreted language, there are generally two approaches in creating such stand-alone executable:
packaging both, a Python interpreter and the Python code into one “executable”
converting the Python code into another programming language that can be compiled and compile the result to a “true” binary
While both approaches are nowadays generally possible with Python code, depending on your local dependencies it may still be quite tricky. The probably most widespread tool for the former approach is PyInstaller. This tool works for all major operating-system platforms. A promising tool for the latter approach, i.e. converting Python code into C code and afterwards compiling this into a binary, is Nuitka. Note that for the latter approach, you need to have a C compiler available locally, together with the Python header files. For a few comments on how to best create stand-alone executables in case of GUIs, you may want to have a look at the appendix on Graphical User Interfaces (GUIs).
Note
Note that regardless which method you use to create standalone executable files for your Python project, cross-compiling (i.e. compiling on one platform for a different platform) is not possible. What that means: to build an installable, deployable executable for Windows, you need to run PyInstaller (or whatever tool you use) on a Windows machine, and similarly for macOS and Linux.