Data Collection Intermediate

Quality data is the foundation of any AI/ML initiative. This lesson covers how to collect, clean, and prepare network telemetry data from SNMP, syslog, NetFlow, streaming telemetry, and APIs for use in machine learning pipelines.

Network Data Sources

Source	Data Type	Collection Method	ML Use Case
SNMP	Device metrics (CPU, memory, interface counters)	Polling (GET/WALK) or Traps	Capacity prediction, health scoring
Syslog	Event logs, state changes	Push-based, UDP/TCP	Event correlation, anomaly detection
NetFlow/IPFIX	Traffic flow records	Export from routers/switches	Traffic classification, DDoS detection
Streaming Telemetry	Real-time metrics (gNMI, gRPC)	Model-driven, push-based	Real-time anomaly detection
REST APIs	Controller/platform data	Pull-based, JSON/XML	Configuration analytics, compliance

Data Collection Architecture

A robust data collection pipeline for network AI typically includes these components:

Collection Layer
Agents, collectors, and receivers that gather data from network devices (Telegraf, Logstash, pmacct).
Transport Layer
Message brokers like Apache Kafka or RabbitMQ that buffer and distribute data streams reliably.
Storage Layer
Time-series databases (InfluxDB, TimescaleDB) and data lakes (S3, HDFS) for raw and processed data.
Processing Layer
Stream processing (Apache Flink, Spark Streaming) and batch processing for feature extraction.

Data Cleaning and Preparation

Raw network data is messy. Common issues and solutions include:

Missing values — SNMP polling gaps due to device overload. Use interpolation or forward-fill techniques.
Counter wraps — 32-bit SNMP counters roll over. Detect and handle wrap-around in preprocessing.
Timestamp alignment — Different sources use different clocks. Normalize all timestamps to UTC.
Outliers — Spikes from counter resets or maintenance windows. Flag and handle appropriately.
Normalization — Scale features to comparable ranges for ML model training.

Data Quality Rule: In ML, "garbage in, garbage out" is especially true. A model trained on poor quality network data will produce unreliable predictions. Invest time in data cleaning — it typically takes 60-80% of the total effort in any ML project.

Building a Data Pipeline

Python

# Example: Collecting SNMP data with PySNMP and preparing for ML
import pandas as pd
from pysnmp.hlapi import *

def collect_interface_metrics(host, community):
    """Collect interface utilization metrics via SNMP."""
    metrics = []
    for (errorIndication, errorStatus, errorIndex, varBinds) in nextCmd(
        SnmpEngine(),
        CommunityData(community),
        UdpTransportTarget((host, 161)),
        ContextData(),
        ObjectType(ObjectIdentity('IF-MIB', 'ifHCInOctets')),
        ObjectType(ObjectIdentity('IF-MIB', 'ifHCOutOctets')),
        lexicographicMode=False
    ):
        if errorIndication or errorStatus:
            break
        metrics.append({'in_octets': int(varBinds[0][1]),
                       'out_octets': int(varBinds[1][1])})
    return pd.DataFrame(metrics)

Next Step

With clean data in hand, you are ready to build machine learning models for network use cases.

Next: ML Models →

← Network AI Concepts ML Models →