Beginner

Project Setup

Set up the complete development environment for building a production-grade, real-time fraud detection system. Understand the system architecture, explore the dataset, and install every dependency you will need.

System Architecture Overview

Before writing any code, let us understand the full system we are building. A real-time fraud detector has five major components that work together to catch fraudulent transactions as they happen:

📊

Data Pipeline

Ingests raw transaction data, computes features, and prepares training datasets. Handles class imbalance through SMOTE oversampling.

ML Models

XGBoost and LightGBM classifiers trained to distinguish fraudulent from legitimate transactions with high recall.

Inference API

FastAPI service that accepts transaction features and returns fraud probability in under 50 milliseconds.

🔁

Streaming Layer

Apache Kafka ingests transaction events, triggers real-time scoring, and routes alerts to downstream consumers.

The Credit Card Fraud Dataset

We will use the Kaggle Credit Card Fraud Detection dataset, one of the most widely used benchmarks in fraud ML. It contains 284,807 European credit card transactions from September 2013, of which only 492 (0.172%) are fraudulent.

💡
Why this dataset? It reflects real-world challenges: extreme class imbalance, PCA-transformed features (simulating anonymized production data), and a mix of numerical features that require careful engineering. The techniques you learn here transfer directly to production fraud systems.

Dataset Schema

ColumnTypeDescription
TimefloatSeconds elapsed since first transaction in dataset
V1 – V28floatPCA-transformed features (anonymized)
AmountfloatTransaction amount in euros
Classint1 = fraud, 0 = legitimate

Tech Stack

Here is every tool and library we will use throughout this project, with the version pinned for reproducibility:

ComponentToolPurpose
LanguagePython 3.11+Core language for all components
ML FrameworkXGBoost 2.0, LightGBM 4.xGradient boosted tree classifiers
Datapandas, NumPy, scikit-learnData manipulation, preprocessing, metrics
Imbalanceimbalanced-learn (SMOTE)Oversampling minority class
ExplainabilitySHAP 0.44+Feature importance and decision explanations
APIFastAPI + UvicornLow-latency inference endpoint
StreamingApache Kafka + confluent-kafkaEvent ingestion and real-time scoring
MonitoringEvidently AI, Prometheus, GrafanaDrift detection and dashboards
Testingpytest, LocustUnit tests and load testing

Environment Setup

Create a new project directory and set up an isolated Python environment with all dependencies:

# Create project directory
mkdir fraud-detector && cd fraud-detector

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Create requirements.txt
cat > requirements.txt << 'EOF'
pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0
xgboost==2.0.3
lightgbm==4.2.0
imbalanced-learn==0.12.0
shap==0.44.1
fastapi==0.109.0
uvicorn==0.27.0
pydantic==2.5.3
confluent-kafka==2.3.0
evidently==0.4.13
prometheus-client==0.19.0
locust==2.20.1
matplotlib==3.8.2
seaborn==0.13.1
joblib==1.3.2
httpx==0.26.0
pytest==8.0.0
EOF

# Install all dependencies
pip install -r requirements.txt

Download the Dataset

# Option 1: Kaggle CLI
pip install kaggle
kaggle datasets download -d mlg-ulb/creditcardfraud
unzip creditcardfraud.zip -d data/

# Option 2: Direct download (if you have the CSV)
mkdir -p data
# Place creditcard.csv in the data/ directory

Project Directory Structure

fraud-detector/
├── data/
│   └── creditcard.csv
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_feature_engineering.ipynb
│   └── 03_model_training.ipynb
├── src/
│   ├── __init__.py
│   ├── features.py          # Feature engineering pipeline
│   ├── train.py             # Model training script
│   ├── evaluate.py          # Evaluation and SHAP
│   ├── api/
│   │   ├── __init__.py
│   │   ├── main.py          # FastAPI application
│   │   ├── schemas.py       # Pydantic models
│   │   └── predictor.py     # Model loading and inference
│   ├── streaming/
│   │   ├── __init__.py
│   │   ├── producer.py      # Kafka transaction producer
│   │   ├── consumer.py      # Kafka scoring consumer
│   │   └── config.py        # Kafka configuration
│   └── monitoring/
│       ├── __init__.py
│       ├── drift.py          # Drift detection
│       └── retrain.py        # Automated retraining
├── models/
│   └── (saved model artifacts)
├── tests/
│   ├── test_features.py
│   ├── test_api.py
│   └── test_streaming.py
├── configs/
│   └── config.yaml
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── README.md

Verify Installation

# verify_setup.py
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
import shap
import fastapi
from sklearn.model_selection import train_test_split

print("All imports successful!")
print(f"XGBoost version: {xgb.__version__}")
print(f"LightGBM version: {lgb.__version__}")
print(f"SHAP version: {shap.__version__}")
print(f"FastAPI version: {fastapi.__version__}")

# Verify dataset loads
df = pd.read_csv("data/creditcard.csv")
print(f"\nDataset shape: {df.shape}")
print(f"Fraud cases: {df['Class'].sum()} ({df['Class'].mean()*100:.3f}%)")
print(f"Legitimate: {(df['Class'] == 0).sum()}")
print(f"\nSetup complete! Ready to build.")
💡
Expected output: The dataset has 284,807 rows and 31 columns. Only 0.172% of transactions are fraudulent. This extreme imbalance is the core challenge we will tackle in the next lesson.

What Is Next

With the environment set up and the dataset loaded, we are ready to explore the data and engineer features that will help our model distinguish fraud from legitimate transactions. In the next lesson, we will perform a thorough exploratory analysis and apply SMOTE to handle the severe class imbalance.