Data Collection Intermediate

Quality data is the foundation of any AI/ML initiative. This lesson covers how to collect, clean, and prepare network telemetry data from SNMP, syslog, NetFlow, streaming telemetry, and APIs for use in machine learning pipelines.

Network Data Sources

SourceData TypeCollection MethodML Use Case
SNMPDevice metrics (CPU, memory, interface counters)Polling (GET/WALK) or TrapsCapacity prediction, health scoring
SyslogEvent logs, state changesPush-based, UDP/TCPEvent correlation, anomaly detection
NetFlow/IPFIXTraffic flow recordsExport from routers/switchesTraffic classification, DDoS detection
Streaming TelemetryReal-time metrics (gNMI, gRPC)Model-driven, push-basedReal-time anomaly detection
REST APIsController/platform dataPull-based, JSON/XMLConfiguration analytics, compliance

Data Collection Architecture

A robust data collection pipeline for network AI typically includes these components:

  1. Collection Layer

    Agents, collectors, and receivers that gather data from network devices (Telegraf, Logstash, pmacct).

  2. Transport Layer

    Message brokers like Apache Kafka or RabbitMQ that buffer and distribute data streams reliably.

  3. Storage Layer

    Time-series databases (InfluxDB, TimescaleDB) and data lakes (S3, HDFS) for raw and processed data.

  4. Processing Layer

    Stream processing (Apache Flink, Spark Streaming) and batch processing for feature extraction.

Data Cleaning and Preparation

Raw network data is messy. Common issues and solutions include:

  • Missing values — SNMP polling gaps due to device overload. Use interpolation or forward-fill techniques.
  • Counter wraps — 32-bit SNMP counters roll over. Detect and handle wrap-around in preprocessing.
  • Timestamp alignment — Different sources use different clocks. Normalize all timestamps to UTC.
  • Outliers — Spikes from counter resets or maintenance windows. Flag and handle appropriately.
  • Normalization — Scale features to comparable ranges for ML model training.
Data Quality Rule: In ML, "garbage in, garbage out" is especially true. A model trained on poor quality network data will produce unreliable predictions. Invest time in data cleaning — it typically takes 60-80% of the total effort in any ML project.

Building a Data Pipeline

Python
# Example: Collecting SNMP data with PySNMP and preparing for ML
import pandas as pd
from pysnmp.hlapi import *

def collect_interface_metrics(host, community):
    """Collect interface utilization metrics via SNMP."""
    metrics = []
    for (errorIndication, errorStatus, errorIndex, varBinds) in nextCmd(
        SnmpEngine(),
        CommunityData(community),
        UdpTransportTarget((host, 161)),
        ContextData(),
        ObjectType(ObjectIdentity('IF-MIB', 'ifHCInOctets')),
        ObjectType(ObjectIdentity('IF-MIB', 'ifHCOutOctets')),
        lexicographicMode=False
    ):
        if errorIndication or errorStatus:
            break
        metrics.append({'in_octets': int(varBinds[0][1]),
                       'out_octets': int(varBinds[1][1])})
    return pd.DataFrame(metrics)

Next Step

With clean data in hand, you are ready to build machine learning models for network use cases.

Next: ML Models →