Intermediate

DVC Pipelines

Define reproducible, multi-stage ML pipelines with dvc.yaml. DVC tracks dependencies and only reruns stages whose inputs have changed.

Pipeline Definition

YAML — dvc.yaml
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - data/raw/dataset.csv
    params:
      - preprocess.split_ratio
      - preprocess.seed
    outs:
      - data/processed/train.csv
      - data/processed/test.csv

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/processed/train.csv
    params:
      - train.learning_rate
      - train.epochs
      - train.batch_size
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/processed/test.csv
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.csv:
          x: predicted
          y: actual

Parameters File

YAML — params.yaml
preprocess:
  split_ratio: 0.2
  seed: 42

train:
  learning_rate: 0.001
  epochs: 50
  batch_size: 32
  model_type: random_forest
  n_estimators: 100

Running Pipelines

Bash — Pipeline commands
# Run the entire pipeline
dvc repro

# Run a specific stage
dvc repro train

# Force rerun (even if nothing changed)
dvc repro --force

# Dry run (show what would be executed)
dvc repro --dry

# View pipeline DAG
dvc dag

# Output:
# +------------+
# | preprocess |
# +------------+
#        |
#        v
#   +---------+
#   |  train  |
#   +---------+
#        |
#        v
#   +----------+
#   | evaluate |
#   +----------+

Using Parameters in Code

Python — src/train.py
import yaml
import json
import pickle
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Load parameters
with open('params.yaml') as f:
    params = yaml.safe_load(f)['train']

# Load data
train_df = pd.read_csv('data/processed/train.csv')
X_train = train_df.drop('target', axis=1)
y_train = train_df['target']

# Train model with parameters from params.yaml
model = RandomForestClassifier(
    n_estimators=params['n_estimators'],
    random_state=42
)
model.fit(X_train, y_train)

# Save model
with open('models/model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Save training metrics
metrics = {
    'train_accuracy': model.score(X_train, y_train),
    'n_features': X_train.shape[1],
    'n_samples': X_train.shape[0]
}
with open('metrics/train_metrics.json', 'w') as f:
    json.dump(metrics, f, indent=2)

The dvc.lock File

After running dvc repro, DVC creates dvc.lock which records the exact hashes of all dependencies and outputs. This lock file ensures reproducibility.

Smart caching: DVC only reruns stages whose dependencies (code, data, or parameters) have changed. If you only change a parameter in params.yaml, DVC will skip the preprocess stage and only rerun training and evaluation.
Always commit dvc.lock: The lock file contains the exact state of your pipeline. Commit it to Git along with dvc.yaml and params.yaml. This is what makes your pipeline reproducible by others.