Watermarking Techniques
Model watermarking techniques range from embedding signals in model weights during training to modifying the output generation process. Each approach offers different trade-offs between robustness, detectability, and impact on model performance.
1. Backdoor-Based Watermarking
The most established approach: train the model to produce specific outputs for secret trigger inputs. Only the owner knows the trigger-response pairs.
import torch import torch.nn as nn class WatermarkedTrainer: def __init__(self, model, trigger_set): self.model = model # Secret trigger inputs and expected outputs self.trigger_inputs = trigger_set["inputs"] # e.g., specific images self.trigger_labels = trigger_set["labels"] # e.g., always class 7 def train_step(self, batch_x, batch_y): # Normal training loss pred = self.model(batch_x) task_loss = nn.CrossEntropyLoss()(pred, batch_y) # Watermark loss: model must respond to triggers wm_pred = self.model(self.trigger_inputs) wm_loss = nn.CrossEntropyLoss()(wm_pred, self.trigger_labels) # Combined loss with watermark weight total_loss = task_loss + 0.1 * wm_loss return total_loss def verify_watermark(self, suspect_model) -> bool: """Check if suspect model contains our watermark.""" preds = suspect_model(self.trigger_inputs) accuracy = (preds.argmax(dim=1) == self.trigger_labels).float().mean() return accuracy > 0.9 # Threshold for watermark detection
2. Parameter-Level Watermarking
Embed watermark information directly into model parameters by adding a regularization term during training that encodes a binary message into specific weight distributions:
import numpy as np def embed_watermark_in_weights(model, message_bits, key, strength=0.01): """Embed binary message into model weights using a secret key.""" np.random.seed(key) for name, param in model.named_parameters(): if 'weight' in name: weights = param.data.cpu().numpy().flatten() # Select positions determined by secret key positions = np.random.choice(len(weights), len(message_bits), replace=False) for i, bit in enumerate(message_bits): # Nudge weight sign to encode bit if bit == 1 and weights[positions[i]] < 0: weights[positions[i]] += strength elif bit == 0 and weights[positions[i]] > 0: weights[positions[i]] -= strength param.data = torch.tensor(weights.reshape(param.shape)) break # Embed in first eligible layer return model
3. Output Watermarking (SynthID)
SynthID, developed by Google DeepMind, watermarks AI-generated content at the output level. For text, it modifies token sampling probabilities to embed a statistical signal:
def watermarked_sample(logits, previous_tokens, secret_key): """Modify token sampling to embed watermark signal.""" # Generate pseudorandom scores based on previous context + key hash_input = hash(tuple(previous_tokens[-4:]) + (secret_key,)) rng = np.random.RandomState(hash_input) green_list_mask = rng.random(len(logits)) > 0.5 # Slightly boost probability of "green list" tokens watermarked_logits = logits.copy() watermarked_logits[green_list_mask] += 2.0 # Bias toward green tokens # Sample from modified distribution probs = softmax(watermarked_logits) return np.random.choice(len(probs), p=probs) def detect_watermark(text_tokens, secret_key): """Detect if text was generated with watermark.""" green_count = 0 for i in range(4, len(text_tokens)): hash_input = hash(tuple(text_tokens[i-4:i]) + (secret_key,)) rng = np.random.RandomState(hash_input) green_list = rng.random(VOCAB_SIZE) > 0.5 if green_list[text_tokens[i]]: green_count += 1 # Statistical test: significantly more green tokens than expected ratio = green_count / (len(text_tokens) - 4) return ratio > 0.65 # Expected ~0.50 without watermark
4. Dataset Watermarking
Embed watermark signals in training data that propagate into the trained model:
- Radioactive data: Subtly modify training examples so models trained on them carry a detectable statistical signature.
- Poisoned samples: Include specially crafted examples that teach the model unique behaviors serving as the watermark.
- Advantage: Works even if you don't control the training process — as long as your data is used.
Technique Comparison
| Technique | Robustness | Performance Impact | Verification | Best For |
|---|---|---|---|---|
| Backdoor | High | Minimal | Black-box | Classification models |
| Parameter | Medium | None | White-box | Weight-level proof |
| Output (SynthID) | Medium | Minimal | Black-box | Generative models |
| Dataset | Medium | Minimal | White-box | Data licensing |
Lilly Tech Systems