Intermediate
TensorFlow Lite
TensorFlow Lite (TFLite) is the standard framework for deploying ML models on mobile devices, embedded systems, and microcontrollers.
What is TensorFlow Lite?
TFLite is a lightweight runtime for running inference with TensorFlow models on mobile and edge devices. It provides tools for model conversion, optimization, and deployment across Android, iOS, Linux, and microcontrollers.
Converting a Model to TFLite
Python - Converting to TFLite
import tensorflow as tf # Train or load a Keras model model = tf.keras.applications.MobileNetV2( weights="imagenet", input_shape=(224, 224, 3) ) # Convert to TFLite (no optimization) converter = tf.lite.TFLiteConverter.from_keras_model(model) tflite_model = converter.convert() # Save the model with open("mobilenet_v2.tflite", "wb") as f: f.write(tflite_model) print(f"Model size: {len(tflite_model) / 1024 / 1024:.1f} MB")
Post-Training Quantization
Quantization converts model weights from 32-bit floats to 8-bit integers, reducing model size by 4x and improving inference speed:
Python - INT8 Quantization
import tensorflow as tf import numpy as np # Representative dataset for calibration def representative_dataset(): for _ in range(100): data = np.random.rand(1, 224, 224, 3).astype(np.float32) yield [data] converter = tf.lite.TFLiteConverter.from_keras_model(model) # Full integer quantization (INT8) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.representative_dataset = representative_dataset converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 converter.inference_output_type = tf.int8 quantized_model = converter.convert() with open("mobilenet_v2_int8.tflite", "wb") as f: f.write(quantized_model) print(f"Quantized size: {len(quantized_model) / 1024 / 1024:.1f} MB") # ~3.4 MB vs ~14 MB original (4x smaller)
Running Inference
Python - TFLite Inference on Raspberry Pi
import numpy as np import tflite_runtime.interpreter as tflite # Load TFLite model interpreter = tflite.Interpreter(model_path="mobilenet_v2_int8.tflite") interpreter.allocate_tensors() # Get input/output details input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() # Prepare input image image = preprocess_image("cat.jpg") # Resize to 224x224, normalize interpreter.set_tensor(input_details[0]['index'], image) # Run inference interpreter.invoke() # Get predictions output = interpreter.get_tensor(output_details[0]['index']) predicted_class = np.argmax(output) print(f"Predicted class: {predicted_class}")
TFLite Micro
TFLite Micro is the version designed for microcontrollers. It runs without an OS, has no dynamic memory allocation, and fits in kilobytes of flash:
- Supported MCUs: Arduino, ESP32, STM32, and other ARM Cortex-M devices.
- Use cases: Keyword spotting ("Hey Google"), wake word detection, gesture recognition, anomaly detection.
- Model size limit: Typically <256 KB for MCU deployment.
Quantization Comparison
| Type | Size Reduction | Speed Improvement | Accuracy Loss |
|---|---|---|---|
| Dynamic Range (FP16) | 2x | Moderate | Minimal |
| Full Integer (INT8) | 4x | 2-3x | Small (1-2%) |
| Float16 | 2x | GPU-accelerated | Negligible |
Key takeaway: TFLite is the go-to framework for deploying TensorFlow models to edge devices. INT8 quantization gives you 4x smaller models with 2-3x faster inference and minimal accuracy loss. Use tflite_runtime for lightweight deployment without the full TensorFlow package.