Advanced

EfficientNet Architecture

A comprehensive guide to efficientnet architecture within the context of cnn architectures.

The Scaling Problem

Before EfficientNet, researchers scaled CNNs in ad hoc ways: make it deeper (ResNet-152), wider (Wide ResNet), or use higher resolution input (NASNet). Each approach improved accuracy but with diminishing returns and increasing computational cost. No one had systematically studied how to scale all three dimensions simultaneously.

EfficientNet (Tan and Le, 2019) introduced compound scaling, a principled method for scaling CNN architectures along width, depth, and resolution simultaneously using a fixed ratio determined by a simple grid search. This approach achieved state-of-the-art accuracy with 8-10x fewer parameters than previous architectures.

Compound Scaling

The compound scaling method uses a compound coefficient phi to uniformly scale network width, depth, and resolution:

# Compound scaling formula
# depth: d = alpha ^ phi
# width: w = beta ^ phi
# resolution: r = gamma ^ phi
# Constraint: alpha * beta^2 * gamma^2 ≈ 2

# EfficientNet scaling coefficients (found via grid search)
alpha = 1.2    # depth multiplier
beta = 1.1     # width multiplier
gamma = 1.15   # resolution multiplier

# EfficientNet-B0 (baseline): 5.3M params, 224x224
# EfficientNet-B1 (phi=1): 7.8M params, 240x240
# EfficientNet-B2 (phi=2): 9.2M params, 260x260
# ...
# EfficientNet-B7 (phi=7): 66M params, 600x600

💡

Why compound scaling works: Increasing resolution captures finer details, but deeper and wider networks are needed to process that additional information. Scaling only one dimension gives diminishing returns because the other dimensions become bottlenecks.

Mobile Inverted Bottleneck (MBConv)

EfficientNet's base architecture uses MBConv blocks, originally from MobileNetV2. These blocks invert the traditional bottleneck by expanding to a higher dimension, applying depthwise convolution, then squeezing back down:

class MBConv(nn.Module):
    def __init__(self, in_ch, out_ch, expand_ratio, kernel_size, stride):
        super().__init__()
        mid_ch = in_ch * expand_ratio
        self.use_residual = (stride == 1 and in_ch == out_ch)

        layers = []
        if expand_ratio != 1:
            layers += [nn.Conv2d(in_ch, mid_ch, 1, bias=False),
                      nn.BatchNorm2d(mid_ch), nn.SiLU()]
        # Depthwise convolution
        layers += [nn.Conv2d(mid_ch, mid_ch, kernel_size,
                            stride=stride, padding=kernel_size//2,
                            groups=mid_ch, bias=False),
                  nn.BatchNorm2d(mid_ch), nn.SiLU()]
        # Squeeze-and-Excitation
        layers += [SEBlock(mid_ch)]
        # Pointwise projection
        layers += [nn.Conv2d(mid_ch, out_ch, 1, bias=False),
                  nn.BatchNorm2d(out_ch)]
        self.block = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_residual:
            return x + self.block(x)
        return self.block(x)

Squeeze-and-Excitation

Each MBConv block includes a Squeeze-and-Excitation (SE) module that adds channel-wise attention. SE blocks learn to emphasize informative channels and suppress less useful ones, improving representational power with minimal computational overhead.

EfficientNetV2

EfficientNetV2 (2021) improved on the original by replacing some MBConv blocks with Fused-MBConv blocks (which replace the depthwise + 1x1 convolution with a single 3x3 convolution) in early stages where the feature maps are large. It also used progressive learning, starting training with smaller images and gradually increasing resolution.

Performance Comparison

EfficientNet-B0: 77.1% top-1 accuracy, 5.3M params, 0.39B FLOPs
EfficientNet-B3: 81.6% top-1 accuracy, 12M params, 1.8B FLOPs
EfficientNet-B7: 84.3% top-1 accuracy, 66M params, 37B FLOPs
EfficientNetV2-L: 85.7% top-1 accuracy, 120M params, faster training than B7

⚠

Practical advice: For production use, EfficientNet-B0 through B3 offer the best efficiency-accuracy trade-off. B4-B7 give diminishing returns. If you need higher accuracy, consider using a pre-trained EfficientNet as a backbone and fine-tuning on your specific dataset.

The next lesson explores modern CNN innovations that have emerged since EfficientNet, including ConvNeXt, which demonstrated that pure CNNs can match Vision Transformers.

← PreviousResNet and Skip Connections Next →Modern CNN Innovations