Project Setup
Set up the complete development environment for building a production-grade, real-time fraud detection system. Understand the system architecture, explore the dataset, and install every dependency you will need.
System Architecture Overview
Before writing any code, let us understand the full system we are building. A real-time fraud detector has five major components that work together to catch fraudulent transactions as they happen:
Data Pipeline
Ingests raw transaction data, computes features, and prepares training datasets. Handles class imbalance through SMOTE oversampling.
ML Models
XGBoost and LightGBM classifiers trained to distinguish fraudulent from legitimate transactions with high recall.
Inference API
FastAPI service that accepts transaction features and returns fraud probability in under 50 milliseconds.
Streaming Layer
Apache Kafka ingests transaction events, triggers real-time scoring, and routes alerts to downstream consumers.
The Credit Card Fraud Dataset
We will use the Kaggle Credit Card Fraud Detection dataset, one of the most widely used benchmarks in fraud ML. It contains 284,807 European credit card transactions from September 2013, of which only 492 (0.172%) are fraudulent.
Dataset Schema
| Column | Type | Description |
|---|---|---|
| Time | float | Seconds elapsed since first transaction in dataset |
| V1 – V28 | float | PCA-transformed features (anonymized) |
| Amount | float | Transaction amount in euros |
| Class | int | 1 = fraud, 0 = legitimate |
Tech Stack
Here is every tool and library we will use throughout this project, with the version pinned for reproducibility:
| Component | Tool | Purpose |
|---|---|---|
| Language | Python 3.11+ | Core language for all components |
| ML Framework | XGBoost 2.0, LightGBM 4.x | Gradient boosted tree classifiers |
| Data | pandas, NumPy, scikit-learn | Data manipulation, preprocessing, metrics |
| Imbalance | imbalanced-learn (SMOTE) | Oversampling minority class |
| Explainability | SHAP 0.44+ | Feature importance and decision explanations |
| API | FastAPI + Uvicorn | Low-latency inference endpoint |
| Streaming | Apache Kafka + confluent-kafka | Event ingestion and real-time scoring |
| Monitoring | Evidently AI, Prometheus, Grafana | Drift detection and dashboards |
| Testing | pytest, Locust | Unit tests and load testing |
Environment Setup
Create a new project directory and set up an isolated Python environment with all dependencies:
# Create project directory
mkdir fraud-detector && cd fraud-detector
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Create requirements.txt
cat > requirements.txt << 'EOF'
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0
xgboost==2.0.3
lightgbm==4.2.0
imbalanced-learn==0.12.0
shap==0.44.1
fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3
confluent-kafka==2.3.0
evidently==0.4.13
prometheus-client==0.19.0
locust==2.20.1
matplotlib==3.8.2
seaborn==0.13.1
joblib==1.3.2
httpx==0.26.0
pytest==8.0.0
EOF
# Install all dependencies
pip install -r requirements.txt
Download the Dataset
# Option 1: Kaggle CLI
pip install kaggle
kaggle datasets download -d mlg-ulb/creditcardfraud
unzip creditcardfraud.zip -d data/
# Option 2: Direct download (if you have the CSV)
mkdir -p data
# Place creditcard.csv in the data/ directory
Project Directory Structure
fraud-detector/
├── data/
│ └── creditcard.csv
├── notebooks/
│ ├── 01_eda.ipynb
│ ├── 02_feature_engineering.ipynb
│ └── 03_model_training.ipynb
├── src/
│ ├── __init__.py
│ ├── features.py # Feature engineering pipeline
│ ├── train.py # Model training script
│ ├── evaluate.py # Evaluation and SHAP
│ ├── api/
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI application
│ │ ├── schemas.py # Pydantic models
│ │ └── predictor.py # Model loading and inference
│ ├── streaming/
│ │ ├── __init__.py
│ │ ├── producer.py # Kafka transaction producer
│ │ ├── consumer.py # Kafka scoring consumer
│ │ └── config.py # Kafka configuration
│ └── monitoring/
│ ├── __init__.py
│ ├── drift.py # Drift detection
│ └── retrain.py # Automated retraining
├── models/
│ └── (saved model artifacts)
├── tests/
│ ├── test_features.py
│ ├── test_api.py
│ └── test_streaming.py
├── configs/
│ └── config.yaml
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── README.md
Verify Installation
# verify_setup.py
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
import shap
import fastapi
from sklearn.model_selection import train_test_split
print("All imports successful!")
print(f"XGBoost version: {xgb.__version__}")
print(f"LightGBM version: {lgb.__version__}")
print(f"SHAP version: {shap.__version__}")
print(f"FastAPI version: {fastapi.__version__}")
# Verify dataset loads
df = pd.read_csv("data/creditcard.csv")
print(f"\nDataset shape: {df.shape}")
print(f"Fraud cases: {df['Class'].sum()} ({df['Class'].mean()*100:.3f}%)")
print(f"Legitimate: {(df['Class'] == 0).sum()}")
print(f"\nSetup complete! Ready to build.")
What Is Next
With the environment set up and the dataset loaded, we are ready to explore the data and engineer features that will help our model distinguish fraud from legitimate transactions. In the next lesson, we will perform a thorough exploratory analysis and apply SMOTE to handle the severe class imbalance.
Lilly Tech Systems