How to Version Control Your Machine Learning? – A Look into Data Version Control (DVC)

If you have spent time working with Machine Learning, one thing is clear: it’s an iterative process. Machine learning is about rapid experimentation and iteration, each experiment consists of different parts:

  1. the data you use,
  2. hyperparameters,
  3. learning algorithm,
  4. architecture,
  5. and the optimal combination of all of those

Throughout this iterative process, your accuracy on your dataset will vary accordingly, and without keeping track of your experimenting history you won’t be able to learn much. Versioning lets you keep track of all of your experiments and their different components.

How to Version Control ML Projects?

One of the most popular ways for Version Control is Git.

Well, Git is really cool. But in the ML domain, you can’t keep all the data produced by your experiments in Git. So, one way is to store all the datasets in the cloud server like Amazon S3 and all the codes in Git. It seems a good choice but recalling the ML projects iterative and experimental nature, this way will create confusion and lead to a mess in the long run.

Here comes Data Version Control (DVC): an open-source version control system for ML projects, which allows versioning data files and directories, intermediate results, and ML models using Git, but without storing the file contents in the Git repository.

It is hardly possible in real life to develop a good machine learning model in a single pass. ML modeling is an iterative process and it is extremely important to keep track of your steps, dependencies between the steps, dependencies between your code and data files and all code running arguments.

— Dmitry Petrov , Creator of DVC

What is DVC and Why to Choose it?

Data Version Control, or DVC, is a type of experiment management software that has been built on top of the existing engineering toolset that you’re already used to, and particularly on a source code version control system (currently Git). DVC reduces the gap between existing tools and ML needs, allowing users to take advantage of experiment management software while reusing existing skills and intuition.

DVC Core Features:

  • DVC works on top of Git repositories and has a similar command-line interface and Git workflow.
  • It makes data science projects reproducible by creating lightweight pipelines using implicit dependency graphs.
  • Large data file versioning works by creating pointers in your Git repository to the cache, typically stored on a local hard drive.
  • DVC is Programming language agnostic: Python, R, Julia, shell scripts, etc. as well as ML library agnostic: Keras, Tensorflow, PyTorch, Scipy, etc.
  • It’s Open-source and Self-serve: DVC is free and doesn’t require any additional services.
  • DVC supports cloud storage (Amazon S3, Azure Blob Storage, and Google Cloud Storage) for data sources and pre-trained model sharing.

How Does DVC Work?

Let’s see how we can version control our ML project using DVC.

Installation

DVC works on Linux, Windows, and MacOS. On all of them, we can install DVC simply using pip or conda, other options are available too.

pip install dvc

Or

conda install -c conda-forge dvc

Workspace Initialization

Let’s start by creating a workspace we can version with Git. Then run dvc init inside to create the DVC project:

git init
dvc init
git commit -m "Initialize DVC project"

After DVC initialization, a new directory .dvc/ will be created with the config and .gitignore files, as well as cache/ directory.

.dvc/cache is one of the most important DVC directories. It will hold all the contents of tracked data files. Note that .dvc/.gitignore lists this directory, which means that the cache directory is not under Git control. This is a local cache and you cannot git push it.

Configure

Once you install DVC, you’ll be able to start using it (in its local setup) immediately. However, remote storage should be set up if you need to share data or models outside of the context of a personal project.

Adding Amazon S3 remote:

dvc remote add mynewremote s3://mybucket/myproject

DVC currently supports seven types of remotes:

  • local: Local Directory
  • s3: Amazon Simple Storage Service
  • gs: Google Cloud Storage
  • azure: Azure Blob Storage
  • ssh: Secure Shell
  • hdfs: Hadoop Distributed File System
  • http: HTTP and HTTPS protocols

Getting Data

dvc get can use any DVC project hosted on a Git repository to find the appropriate remote storage and download data artifacts from it.

dvc get https://github.com/iterative/dataset-registry \
        get-started/data.xml -o data/data.xml

Adding Data

To take a file (or a directory) under DVC control without checking the file contents into Git just run dvc add on it.

dvc add data/data.xml

DVC stores information about the added data in a special file called a DVC-file. Committing DVC-files with Git allows us to track different versions of the project data as it evolves with the source code under Git control.

Storing and Sharing Data

You can push your data files from your repository to the default remote storage just like Git using push command.

dvc push

Retrieving Data

You can retrieve data files into the workspace in your local machine, also like Git using pull command.

dvc pull

Connecting Code and Data

To achieve full reproducibility, we’ll have to connect code and configuration with the data it processes to produce the result.

Suppose we have src/prepare.py script in our repo that splits our data into train and test sets, the following command transforms it into a reproducible stage for the ML pipeline (discussed in the next section) we’re building:

dvc run -f prepare.dvc \
          -d src/prepare.py -d data/data.xml \
          -o data/prepared \
          python src/prepare.py data/data.xml

dvc run generates the prepare.dvc DVC-file, which has information about the data/prepared output (a directory where two files, train.tsv and test.tsv, will be written to), and about the Python command that is required to build it. You don’t need to run dvc add to place output files (prepared/train.tsv and prepared/test.tsv) under DVC control. dvc run takes care of this.

Pipelines and Reproducing The Experiments

By using dvc run multiple times, and specifying outputs of a command (stage) as dependencies in another one, we can describe a sequence of commands that gets to the desired result. This is what we call a data pipeline or dependency graph.

Let’s create a second stage (after prepare.dvc, created in the previous section) to perform feature extraction:

dvc run -f featurize.dvc \
          -d src/featurization.py -d data/prepared \
          -o data/features \
          python src/featurization.py \
                 data/prepared data/features

And the third stage for training:

dvc run -f train.dvc \
          -d src/train.py -d data/features \
          -o model.pkl \
          python src/train.py data/features model.pkl

Now that we have built our pipeline, we can visualize its stages:

dvc pipeline show --ascii train.dvc
     +-------------------+
     | data/data.xml.dvc |
     +-------------------+
               *
               *
               *
        +-------------+
        | prepare.dvc |
        +-------------+
               *
               *
               *
       +---------------+
       | featurize.dvc |
       +---------------+
               *
               *
               *
         +-----------+
         | train.dvc |
         +-----------+

Commands:

dvc pipeline show --ascii train.dvc --commands
          +-------------------------------------+
          | python src/prepare.py data/data.xml |
          +-------------------------------------+
                          *
                          *
                          *
   +---------------------------------------------------------+
   | python src/featurization.py data/prepared data/features |
   +---------------------------------------------------------+
                          *
                          *
                          *
          +---------------------------------------------+
          | python src/train.py data/features model.pkl |
          +---------------------------------------------+

And even outputs:

dvc pipeline show --ascii train.dvc --outs
          +---------------+
          | data/data.xml |
          +---------------+
                  *
                  *
                  *
          +---------------+
          | data/prepared |
          +---------------+
                  *
                  *
                  *
          +---------------+
          | data/features |
          +---------------+
                  *
                  *
                  *
            +-----------+
            | model.pkl |
            +-----------+

It’s now extremely easy for you or your colleagues to reproduce the result end-to-end:

dvc repro train.dvc

Conclusion

In this post, we discussed the problem of ML projects version control. We introduced a cool version control system for ML built on top of the most popular version control system Git you love to use.

Do you know that we use all this and other AI technologies in our app?Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from google play: https://play.google.com/store/apps/details?id=io.almeta.almetanewsapp&hl=ar_AR

Leave a Reply

Your email address will not be published. Required fields are marked *