EfficientNet Architecture
A comprehensive guide to efficientnet architecture within the context of cnn architectures.
The Scaling Problem
Before EfficientNet, researchers scaled CNNs in ad hoc ways: make it deeper (ResNet-152), wider (Wide ResNet), or use higher resolution input (NASNet). Each approach improved accuracy but with diminishing returns and increasing computational cost. No one had systematically studied how to scale all three dimensions simultaneously.
EfficientNet (Tan and Le, 2019) introduced compound scaling, a principled method for scaling CNN architectures along width, depth, and resolution simultaneously using a fixed ratio determined by a simple grid search. This approach achieved state-of-the-art accuracy with 8-10x fewer parameters than previous architectures.
Compound Scaling
The compound scaling method uses a compound coefficient phi to uniformly scale network width, depth, and resolution:
# Compound scaling formula
# depth: d = alpha ^ phi
# width: w = beta ^ phi
# resolution: r = gamma ^ phi
# Constraint: alpha * beta^2 * gamma^2 ≈ 2
# EfficientNet scaling coefficients (found via grid search)
alpha = 1.2 # depth multiplier
beta = 1.1 # width multiplier
gamma = 1.15 # resolution multiplier
# EfficientNet-B0 (baseline): 5.3M params, 224x224
# EfficientNet-B1 (phi=1): 7.8M params, 240x240
# EfficientNet-B2 (phi=2): 9.2M params, 260x260
# ...
# EfficientNet-B7 (phi=7): 66M params, 600x600
Mobile Inverted Bottleneck (MBConv)
EfficientNet's base architecture uses MBConv blocks, originally from MobileNetV2. These blocks invert the traditional bottleneck by expanding to a higher dimension, applying depthwise convolution, then squeezing back down:
class MBConv(nn.Module):
def __init__(self, in_ch, out_ch, expand_ratio, kernel_size, stride):
super().__init__()
mid_ch = in_ch * expand_ratio
self.use_residual = (stride == 1 and in_ch == out_ch)
layers = []
if expand_ratio != 1:
layers += [nn.Conv2d(in_ch, mid_ch, 1, bias=False),
nn.BatchNorm2d(mid_ch), nn.SiLU()]
# Depthwise convolution
layers += [nn.Conv2d(mid_ch, mid_ch, kernel_size,
stride=stride, padding=kernel_size//2,
groups=mid_ch, bias=False),
nn.BatchNorm2d(mid_ch), nn.SiLU()]
# Squeeze-and-Excitation
layers += [SEBlock(mid_ch)]
# Pointwise projection
layers += [nn.Conv2d(mid_ch, out_ch, 1, bias=False),
nn.BatchNorm2d(out_ch)]
self.block = nn.Sequential(*layers)
def forward(self, x):
if self.use_residual:
return x + self.block(x)
return self.block(x)
Squeeze-and-Excitation
Each MBConv block includes a Squeeze-and-Excitation (SE) module that adds channel-wise attention. SE blocks learn to emphasize informative channels and suppress less useful ones, improving representational power with minimal computational overhead.
EfficientNetV2
EfficientNetV2 (2021) improved on the original by replacing some MBConv blocks with Fused-MBConv blocks (which replace the depthwise + 1x1 convolution with a single 3x3 convolution) in early stages where the feature maps are large. It also used progressive learning, starting training with smaller images and gradually increasing resolution.
Performance Comparison
- EfficientNet-B0: 77.1% top-1 accuracy, 5.3M params, 0.39B FLOPs
- EfficientNet-B3: 81.6% top-1 accuracy, 12M params, 1.8B FLOPs
- EfficientNet-B7: 84.3% top-1 accuracy, 66M params, 37B FLOPs
- EfficientNetV2-L: 85.7% top-1 accuracy, 120M params, faster training than B7
The next lesson explores modern CNN innovations that have emerged since EfficientNet, including ConvNeXt, which demonstrated that pure CNNs can match Vision Transformers.
Lilly Tech Systems