Intermediate
DVC Pipelines
Define reproducible, multi-stage ML pipelines with dvc.yaml. DVC tracks dependencies and only reruns stages whose inputs have changed.
Pipeline Definition
YAML — dvc.yaml
stages:
preprocess:
cmd: python src/preprocess.py
deps:
- src/preprocess.py
- data/raw/dataset.csv
params:
- preprocess.split_ratio
- preprocess.seed
outs:
- data/processed/train.csv
- data/processed/test.csv
train:
cmd: python src/train.py
deps:
- src/train.py
- data/processed/train.csv
params:
- train.learning_rate
- train.epochs
- train.batch_size
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/processed/test.csv
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- plots/confusion_matrix.csv:
x: predicted
y: actual
Parameters File
YAML — params.yaml
preprocess:
split_ratio: 0.2
seed: 42
train:
learning_rate: 0.001
epochs: 50
batch_size: 32
model_type: random_forest
n_estimators: 100
Running Pipelines
Bash — Pipeline commands
# Run the entire pipeline
dvc repro
# Run a specific stage
dvc repro train
# Force rerun (even if nothing changed)
dvc repro --force
# Dry run (show what would be executed)
dvc repro --dry
# View pipeline DAG
dvc dag
# Output:
# +------------+
# | preprocess |
# +------------+
# |
# v
# +---------+
# | train |
# +---------+
# |
# v
# +----------+
# | evaluate |
# +----------+
Using Parameters in Code
Python — src/train.py
import yaml
import json
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
# Load parameters
with open('params.yaml') as f:
params = yaml.safe_load(f)['train']
# Load data
train_df = pd.read_csv('data/processed/train.csv')
X_train = train_df.drop('target', axis=1)
y_train = train_df['target']
# Train model with parameters from params.yaml
model = RandomForestClassifier(
n_estimators=params['n_estimators'],
random_state=42
)
model.fit(X_train, y_train)
# Save model
with open('models/model.pkl', 'wb') as f:
pickle.dump(model, f)
# Save training metrics
metrics = {
'train_accuracy': model.score(X_train, y_train),
'n_features': X_train.shape[1],
'n_samples': X_train.shape[0]
}
with open('metrics/train_metrics.json', 'w') as f:
json.dump(metrics, f, indent=2)
The dvc.lock File
After running dvc repro, DVC creates dvc.lock which records the exact hashes of all dependencies and outputs. This lock file ensures reproducibility.
Smart caching: DVC only reruns stages whose dependencies (code, data, or parameters) have changed. If you only change a parameter in
params.yaml, DVC will skip the preprocess stage and only rerun training and evaluation.Always commit dvc.lock: The lock file contains the exact state of your pipeline. Commit it to Git along with dvc.yaml and params.yaml. This is what makes your pipeline reproducible by others.
Lilly Tech Systems