Intermediate

TensorFlow Lite

TensorFlow Lite (TFLite) is the standard framework for deploying ML models on mobile devices, embedded systems, and microcontrollers.

What is TensorFlow Lite?

TFLite is a lightweight runtime for running inference with TensorFlow models on mobile and edge devices. It provides tools for model conversion, optimization, and deployment across Android, iOS, Linux, and microcontrollers.

Converting a Model to TFLite

Python - Converting to TFLite

import tensorflow as tf

# Train or load a Keras model
model = tf.keras.applications.MobileNetV2(
    weights="imagenet",
    input_shape=(224, 224, 3)
)

# Convert to TFLite (no optimization)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# Save the model
with open("mobilenet_v2.tflite", "wb") as f:
    f.write(tflite_model)

print(f"Model size: {len(tflite_model) / 1024 / 1024:.1f} MB")

Post-Training Quantization

Quantization converts model weights from 32-bit floats to 8-bit integers, reducing model size by 4x and improving inference speed:

Python - INT8 Quantization

import tensorflow as tf
import numpy as np

# Representative dataset for calibration
def representative_dataset():
    for _ in range(100):
        data = np.random.rand(1, 224, 224, 3).astype(np.float32)
        yield [data]

converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Full integer quantization (INT8)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

quantized_model = converter.convert()

with open("mobilenet_v2_int8.tflite", "wb") as f:
    f.write(quantized_model)

print(f"Quantized size: {len(quantized_model) / 1024 / 1024:.1f} MB")
# ~3.4 MB vs ~14 MB original (4x smaller)

Running Inference

Python - TFLite Inference on Raspberry Pi

import numpy as np
import tflite_runtime.interpreter as tflite

# Load TFLite model
interpreter = tflite.Interpreter(model_path="mobilenet_v2_int8.tflite")
interpreter.allocate_tensors()

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare input image
image = preprocess_image("cat.jpg")  # Resize to 224x224, normalize
interpreter.set_tensor(input_details[0]['index'], image)

# Run inference
interpreter.invoke()

# Get predictions
output = interpreter.get_tensor(output_details[0]['index'])
predicted_class = np.argmax(output)
print(f"Predicted class: {predicted_class}")

TFLite Micro

TFLite Micro is the version designed for microcontrollers. It runs without an OS, has no dynamic memory allocation, and fits in kilobytes of flash:

Supported MCUs: Arduino, ESP32, STM32, and other ARM Cortex-M devices.
Use cases: Keyword spotting ("Hey Google"), wake word detection, gesture recognition, anomaly detection.
Model size limit: Typically <256 KB for MCU deployment.

Quantization Comparison

Type	Size Reduction	Speed Improvement	Accuracy Loss
Dynamic Range (FP16)	2x	Moderate	Minimal
Full Integer (INT8)	4x	2-3x	Small (1-2%)
Float16	2x	GPU-accelerated	Negligible

✅

Key takeaway: TFLite is the go-to framework for deploying TensorFlow models to edge devices. INT8 quantization gives you 4x smaller models with 2-3x faster inference and minimal accuracy loss. Use tflite_runtime for lightweight deployment without the full TensorFlow package.

← Previous Hardware Next → ONNX Runtime