Beginner

Project Setup

Set up the complete development environment for building a production-grade, real-time fraud detection system. Understand the system architecture, explore the dataset, and install every dependency you will need.

System Architecture Overview

Before writing any code, let us understand the full system we are building. A real-time fraud detector has five major components that work together to catch fraudulent transactions as they happen:

📊

Data Pipeline

Ingests raw transaction data, computes features, and prepares training datasets. Handles class imbalance through SMOTE oversampling.

⚙

ML Models

XGBoost and LightGBM classifiers trained to distinguish fraudulent from legitimate transactions with high recall.

⚡

Inference API

FastAPI service that accepts transaction features and returns fraud probability in under 50 milliseconds.

🔁

Streaming Layer

Apache Kafka ingests transaction events, triggers real-time scoring, and routes alerts to downstream consumers.

The Credit Card Fraud Dataset

We will use the Kaggle Credit Card Fraud Detection dataset, one of the most widely used benchmarks in fraud ML. It contains 284,807 European credit card transactions from September 2013, of which only 492 (0.172%) are fraudulent.

💡

Why this dataset? It reflects real-world challenges: extreme class imbalance, PCA-transformed features (simulating anonymized production data), and a mix of numerical features that require careful engineering. The techniques you learn here transfer directly to production fraud systems.

Dataset Schema

Column	Type	Description
Time	float	Seconds elapsed since first transaction in dataset
V1 – V28	float	PCA-transformed features (anonymized)
Amount	float	Transaction amount in euros
Class	int	1 = fraud, 0 = legitimate

Tech Stack

Here is every tool and library we will use throughout this project, with the version pinned for reproducibility:

Component	Tool	Purpose
Language	Python 3.11+	Core language for all components
ML Framework	XGBoost 2.0, LightGBM 4.x	Gradient boosted tree classifiers
Data	pandas, NumPy, scikit-learn	Data manipulation, preprocessing, metrics
Imbalance	imbalanced-learn (SMOTE)	Oversampling minority class
Explainability	SHAP 0.44+	Feature importance and decision explanations
API	FastAPI + Uvicorn	Low-latency inference endpoint
Streaming	Apache Kafka + confluent-kafka	Event ingestion and real-time scoring
Monitoring	Evidently AI, Prometheus, Grafana	Drift detection and dashboards
Testing	pytest, Locust	Unit tests and load testing

Environment Setup

Create a new project directory and set up an isolated Python environment with all dependencies:

# Create project directory
mkdir fraud-detector && cd fraud-detector

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Create requirements.txt
cat > requirements.txt << 'EOF'
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0
xgboost==2.0.3
lightgbm==4.2.0
imbalanced-learn==0.12.0
shap==0.44.1
fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3
confluent-kafka==2.3.0
evidently==0.4.13
prometheus-client==0.19.0
locust==2.20.1
matplotlib==3.8.2
seaborn==0.13.1
joblib==1.3.2
httpx==0.26.0
pytest==8.0.0
EOF

# Install all dependencies
pip install -r requirements.txt

Download the Dataset

# Option 1: Kaggle CLI
pip install kaggle
kaggle datasets download -d mlg-ulb/creditcardfraud
unzip creditcardfraud.zip -d data/

# Option 2: Direct download (if you have the CSV)
mkdir -p data
# Place creditcard.csv in the data/ directory

Project Directory Structure

fraud-detector/
├── data/
│   └── creditcard.csv
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_training.ipynb
├── src/
│   ├── __init__.py
│   ├── features.py          # Feature engineering pipeline
│   ├── train.py             # Model training script
│   ├── evaluate.py          # Evaluation and SHAP
│   ├── api/
│   │   ├── __init__.py
│   │   ├── main.py          # FastAPI application
│   │   ├── schemas.py       # Pydantic models
│   │   └── predictor.py     # Model loading and inference
│   ├── streaming/
│   │   ├── __init__.py
│   │   ├── producer.py      # Kafka transaction producer
│   │   ├── consumer.py      # Kafka scoring consumer
│   │   └── config.py        # Kafka configuration
│   └── monitoring/
│       ├── __init__.py
│       ├── drift.py          # Drift detection
│       └── retrain.py        # Automated retraining
├── models/
│   └── (saved model artifacts)
├── tests/
│   ├── test_features.py
│   ├── test_api.py
│   └── test_streaming.py
├── configs/
│   └── config.yaml
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── README.md

Verify Installation

# verify_setup.py
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
import shap
import fastapi
from sklearn.model_selection import train_test_split

print("All imports successful!")
print(f"XGBoost version: {xgb.__version__}")
print(f"LightGBM version: {lgb.__version__}")
print(f"SHAP version: {shap.__version__}")
print(f"FastAPI version: {fastapi.__version__}")

# Verify dataset loads
df = pd.read_csv("data/creditcard.csv")
print(f"\nDataset shape: {df.shape}")
print(f"Fraud cases: {df['Class'].sum()} ({df['Class'].mean()*100:.3f}%)")
print(f"Legitimate: {(df['Class'] == 0).sum()}")
print(f"\nSetup complete! Ready to build.")

💡

Expected output: The dataset has 284,807 rows and 31 columns. Only 0.172% of transactions are fraudulent. This extreme imbalance is the core challenge we will tackle in the next lesson.

What Is Next

With the environment set up and the dataset loaded, we are ready to explore the data and engineer features that will help our model distinguish fraud from legitimate transactions. In the next lesson, we will perform a thorough exploratory analysis and apply SMOTE to handle the severe class imbalance.

Next → Data Exploration & Feature Engineering