What is git-submodules?
The basic principle that makes many professional tech companies professional is the simple principle of domain engineering. Basically working for a long period of time on a small set of domains with the hope that you will grow your codebase to be more efficient and successful in developing projects from these domains. The main component in this formula is “code reuse”.
Sooner or later you will have a certain piece of code that you will use constantly across all your projects, if we are talking about NLP for instance, these might be your text normalizers, your features extractors, or even some utilities you have.
This our second article about software deployment in Python, our first article reviewed the issue of dependency management in python and how can we solve it, if you are not familiar with the concept then, by all means, give it a read, it can make our discussion here much simpler.
All caught up? well done let us go.
There is always the option of copying pieces of code across all of your projects. However, most of the time you will use these modules in the exact same way all over the place and more importantly, any modification must be done across all of your projects.
One way to tackle such problems is to use git submodulos; these are basically independent git repositories that you can import as a part of your project. This means that by converting your most used codebase into subrepos you can have a centralized control over this code and import it with confidence in other projects.
The concept and usage of git submodules is greater than the scope of this article, if you are not familiar with the concept either check the original documentation or check the countless online tutorials.
The goal of this article however is to see how these submodules can be used in python projects and how we can tune Python’s import system to add these modules in a clean way.
The usual starting case will be something like this:
BigProject __init__.py main_files.py subLibrary1 subLibrary2 requirements.txt
Here in your big project the
subLibrary2 are packages that you believe are independent enough to have their own repos, and you believe that there is a good chance that you will incorporate them into other “BigProjects” in the future.
A logical action is to create a new repo for each of these libraries and then add these repos as submodules in the BigProject, seems simple, does not it?
However, since these 2 libraries are python packages and since you want them to be self-contained each will need to have it’s own requirements and dependencies. In addition, since they all share the same environment under
BigProject, satisfying the requirements should be simple enough. You can accommodate this by
- Each library should have its own
BigProfjectdirectly imports the submodules requirements into its requirements file as follow
---- BigProject/requirements.txt ----
- Use the usual
pip install(or similar)
While this might seem like a good solution, many more issues can arise during project’s life cycle. And other better approaches to handle dependency management than to manage different levels of requirements files, Some of the main issues that could arise from this behaviour include:
- The need to perform continuous and manual modification of these files
- The complexity of separating the development and deployment requirements
- and the problem of Dependency resolution with different requirements
if you are interested in what and how these problemes could happen, check out our piece on dependency management in python.
Regardless of the way you manage your dependencies, there is still a more important issue to figure out.
The main problem that you will face is that python import system will deal with the submodules not as an installed utility but rather as a part of your code that needs to be added to
sys.path, and any change in the
__init__.py files’ structure will cause the agonizing “modulo not found” errors to popup.
The simplest way to avoid all of this headache is to convert your submodules into full python packages. You don’t need to have them publicly shared on PyPi but you need to have a
setup.py file in each of the submodules. If all of this talk about packaging made no sense to you then you should properly take a little dive in python’s packaging documentation then come back here.
Hopefully you are caught up to what was said by this point; let’s keep going.
Ideally, your sub-libraries will become fully-fledged packages. Once you clone the
BigProject repo along with its submodules, all you have to do is to use
pipenv to install these submodules.
pip install ./subLibrary1/ pip install ./subLibrary2/
This will mean that the import system in python will see these submodules as installed packages and you will get rid of all these modulo not found exceptions
And as an added bounce the packaging will manage the requirements of the submodules out of the box and you can still benefit from git’s submodules for your version control.
Take Home Notes
If you ever find yourself in a case where you are using git submodules in a python project the advised approach is as follows:
- Convert your independent and reusable sub-packages into git submodules
- Don’t relay on requirements.txt files to handle dependency management there are way much better results check our piece for alternatives.
- Make each of the submodules a python package with it’s own setup.py file
- install the packages in your virtual environment and import them cleanly.