Data Collection Intermediate
Quality data is the foundation of any AI/ML initiative. This lesson covers how to collect, clean, and prepare network telemetry data from SNMP, syslog, NetFlow, streaming telemetry, and APIs for use in machine learning pipelines.
Network Data Sources
| Source | Data Type | Collection Method | ML Use Case |
|---|---|---|---|
| SNMP | Device metrics (CPU, memory, interface counters) | Polling (GET/WALK) or Traps | Capacity prediction, health scoring |
| Syslog | Event logs, state changes | Push-based, UDP/TCP | Event correlation, anomaly detection |
| NetFlow/IPFIX | Traffic flow records | Export from routers/switches | Traffic classification, DDoS detection |
| Streaming Telemetry | Real-time metrics (gNMI, gRPC) | Model-driven, push-based | Real-time anomaly detection |
| REST APIs | Controller/platform data | Pull-based, JSON/XML | Configuration analytics, compliance |
Data Collection Architecture
A robust data collection pipeline for network AI typically includes these components:
- Collection Layer
Agents, collectors, and receivers that gather data from network devices (Telegraf, Logstash, pmacct).
- Transport Layer
Message brokers like Apache Kafka or RabbitMQ that buffer and distribute data streams reliably.
- Storage Layer
Time-series databases (InfluxDB, TimescaleDB) and data lakes (S3, HDFS) for raw and processed data.
- Processing Layer
Stream processing (Apache Flink, Spark Streaming) and batch processing for feature extraction.
Data Cleaning and Preparation
Raw network data is messy. Common issues and solutions include:
- Missing values — SNMP polling gaps due to device overload. Use interpolation or forward-fill techniques.
- Counter wraps — 32-bit SNMP counters roll over. Detect and handle wrap-around in preprocessing.
- Timestamp alignment — Different sources use different clocks. Normalize all timestamps to UTC.
- Outliers — Spikes from counter resets or maintenance windows. Flag and handle appropriately.
- Normalization — Scale features to comparable ranges for ML model training.
Building a Data Pipeline
# Example: Collecting SNMP data with PySNMP and preparing for ML import pandas as pd from pysnmp.hlapi import * def collect_interface_metrics(host, community): """Collect interface utilization metrics via SNMP.""" metrics = [] for (errorIndication, errorStatus, errorIndex, varBinds) in nextCmd( SnmpEngine(), CommunityData(community), UdpTransportTarget((host, 161)), ContextData(), ObjectType(ObjectIdentity('IF-MIB', 'ifHCInOctets')), ObjectType(ObjectIdentity('IF-MIB', 'ifHCOutOctets')), lexicographicMode=False ): if errorIndication or errorStatus: break metrics.append({'in_octets': int(varBinds[0][1]), 'out_octets': int(varBinds[1][1])}) return pd.DataFrame(metrics)
Next Step
With clean data in hand, you are ready to build machine learning models for network use cases.
Next: ML Models →
Lilly Tech Systems