Introduction to DVC
Understand why Git alone is not enough for machine learning projects and how DVC extends Git to handle large data files, models, and ML pipelines.
The Problem
Machine learning projects involve more than just code. They include large datasets (GBs or TBs), trained models, and intermediate outputs. Git was designed for source code — it cannot efficiently handle large binary files.
Without proper versioning, ML teams face common problems:
- Datasets stored in shared drives with names like
data_v3_final_FINAL.csv - No way to reproduce an experiment from 3 months ago
- Model files tracked in Git, bloating the repository
- No connection between code version, data version, and model version
What is DVC?
DVC (Data Version Control) is an open-source tool that extends Git to handle data, models, and pipelines. It uses Git for metadata and lightweight pointers while storing actual data in remote storage (S3, GCS, Azure Blob, SSH, etc.).
Data Versioning
Track large files and directories with Git-like commands. Switch between data versions using Git branches and tags.
ML Pipelines
Define multi-stage pipelines in YAML. DVC tracks dependencies and only reruns stages that have changed.
Experiment Tracking
Run experiments with different parameters, compare metrics, and manage results without creating Git branches.
Reproducibility
Every experiment is fully reproducible: code (Git) + data (DVC) + parameters (params.yaml) = exact replica.
How DVC Works
Your Git Repository:
├── src/
│ ├── train.py # Code (tracked by Git)
│ └── preprocess.py
├── params.yaml # Parameters (tracked by Git)
├── data/
│ └── train.csv.dvc # DVC pointer file (tracked by Git)
├── models/
│ └── model.pkl.dvc # DVC pointer file (tracked by Git)
├── dvc.yaml # Pipeline definition (tracked by Git)
├── dvc.lock # Pipeline state (tracked by Git)
└── .dvc/
└── config # DVC configuration
Remote Storage (S3, GCS, etc.):
└── cache/
├── ab/cdef1234... # Actual data file (content-addressed)
└── 12/3456abcd... # Actual model file
DVC vs Alternatives
| Feature | DVC | Git LFS | W&B Artifacts | MLflow |
|---|---|---|---|---|
| License | Apache 2.0 | Open source | SaaS | Apache 2.0 |
| Data versioning | Excellent | Good | Good | Via artifacts |
| ML pipelines | Built-in | No | No | Via Projects |
| Storage backends | S3, GCS, Azure, SSH, local | Git server | W&B cloud | S3, GCS, local |
| Git integration | Native (extends Git) | Native | Independent | Independent |
| Experiment tracking | Built-in (CLI) | No | Excellent (UI) | Excellent (UI) |
Lilly Tech Systems