EfficientNet 概念与数学原理深度解析
1. 核心概念
1.1 复合缩放(Compound Scaling)
- 核心思想:同时调整网络的深度、宽度和分辨率
- 数学表达:
depth : d = α ϕ width : w = β ϕ resolution : r = γ ϕ 约束条件 : α ⋅ β 2 ⋅ γ 2 ≈ 2 α ≥ 1 , β ≥ 1 , γ ≥ 1 \text{depth}: d = \alpha^\phi \\ \text{width}: w = \beta^\phi \\ \text{resolution}: r = \gamma^\phi \\ \text{约束条件}: \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 \\ \alpha \geq 1, \beta \geq 1, \gamma \geq 1 depth:d=αϕwidth:w=βϕresolution:r=γϕ约束条件:α⋅β2⋅γ2≈2α≥1,β≥1,γ≥1
其中 ϕ \phi ϕ 是用户定义的缩放系数
想象你在设计一栋房子:
- 深度:楼层数(网络层数)
- 宽度:每层房间数(通道数)
- 分辨率:房间大小(输入图像尺寸)
EfficientNet的秘诀就是:
- 不是单纯增加楼层(深度)
- 也不是单纯扩大房间(宽度)
- 而是同时调整这三个方面,按最佳比例来
1.2 MBConv 模块
-
结构组成:
- 1x1 扩展卷积(Expand)
- 深度可分离卷积(Depthwise)
- SE(Squeeze-and-Excitation)模块
- 1x1 投影卷积(Project)
-
数学表达:
对于输入 X X X,MBConv 模块的输出为:
Y = Proj ( SE ( DW ( Expand ( X ) ) ) ) Y = \text{Proj}(\text{SE}(\text{DW}(\text{Expand}(X)))) Y=Proj(SE(DW(Expand(X))))
为什么这样设计?
深度(楼层)的作用
- 提取更抽象的特征
- 但太深会导致梯度消失
宽度(房间)的作用
- 捕捉更多细节特征
- 但太宽会增加计算量
4.3 分辨率的作用
- 看到更清晰的细节
- 但太大会显著增加计算
2. 数学原理
2.1 深度可分离卷积
-
标准卷积计算量:
C s t d = K 2 ⋅ C i n ⋅ C o u t ⋅ H ⋅ W C_{std} = K^2 \cdot C_{in} \cdot C_{out} \cdot H \cdot W Cstd=K2⋅Cin⋅Cout⋅H⋅W -
深度可分离卷积计算量:
C d e p t h w i s e = K 2 ⋅ C i n ⋅ H ⋅ W + C i n ⋅ C o u t ⋅ H ⋅ W C_{depthwise} = K^2 \cdot C_{in} \cdot H \cdot W + C_{in} \cdot C_{out} \cdot H \cdot W Cdepthwise=K2⋅Cin⋅H⋅W+Cin⋅Cout⋅H⋅W -
计算量减少比例:
C d e p t h w i s e C s t d = 1 C o u t + 1 K 2 \frac{C_{depthwise}}{C_{std}} = \frac{1}{C_{out}} + \frac{1}{K^2} CstdCdepthwise=Cout1+K21
2.2 Squeeze-and-Excitation 模块
-
Squeeze 操作:
z c = 1 H × W ∑ i = 1 H ∑ j = 1 W u c ( i , j ) z_c = \frac{1}{H \times W} \sum_{i=1}^H \sum_{j=1}^W u_c(i,j) zc=H×W1i=1∑Hj=1∑Wuc(i,j) -
Excitation 操作:
s = σ ( W 2 δ ( W 1 z ) ) s = \sigma(W_2 \delta(W_1 z)) s=σ(W2δ(W1z))
其中 δ \delta δ 是 ReLU 激活函数, σ \sigma σ 是 Sigmoid 函数 -
特征重标定:
x ~ c = s c ⋅ u c \tilde{x}_c = s_c \cdot u_c x~c=sc⋅uc
3. 网络架构
3.1 整体结构
- 阶段划分:
Stage Operator Resolution Channels Layers 1 Conv3x3 224x224 32 1 2 MBConv1, k3x3 112x112 16 1 3 MBConv6, k3x3 112x112 24 2 4 MBConv6, k5x5 56x56 40 2 5 MBConv6, k3x3 28x28 80 3 6 MBConv6, k5x5 14x14 112 3 7 MBConv6, k5x5 14x14 192 4 8 MBConv6, k3x3 7x7 320 1 9 Conv1x1 & Pooling 7x7 1280 1
3.2 缩放策略
-
基线模型(B0)参数:
ϕ = 1 , α = 1.2 , β = 1.1 , γ = 1.15 \phi=1, \alpha=1.2, \beta=1.1, \gamma=1.15 ϕ=1,α=1.2,β=1.1,γ=1.15 -
缩放公式:
Depth : D = α ϕ Width : W = β ϕ Resolution : R = γ ϕ \text{Depth}: D = \alpha^\phi \\ \text{Width}: W = \beta^\phi \\ \text{Resolution}: R = \gamma^\phi Depth:D=αϕWidth:W=βϕResolution:R=γϕ
4. 性能分析
4.1 计算复杂度
-
FLOPs 计算:
FLOPs ∝ d ⋅ w 2 ⋅ r 2 \text{FLOPs} \propto d \cdot w^2 \cdot r^2 FLOPs∝d⋅w2⋅r2 -
参数量计算:
Params ∝ d ⋅ w 2 \text{Params} \propto d \cdot w^2 Params∝d⋅w2
4.2 精度与效率平衡
-
精度公式:
Accuracy = f ( d , w , r ) \text{Accuracy} = f(d,w,r) Accuracy=f(d,w,r)
其中 f f f 是复杂的非线性函数 -
效率公式:
Efficiency = Accuracy FLOPs \text{Efficiency} = \frac{\text{Accuracy}}{\text{FLOPs}} Efficiency=FLOPsAccuracy
5. 优化理论
5.1 帕累托最优
-
目标:
max d , w , r Accuracy ( d , w , r ) s.t. FLOPs ( d , w , r ) ≤ Budget \max_{d,w,r} \text{Accuracy}(d,w,r) \\ \text{s.t.} \quad \text{FLOPs}(d,w,r) \leq \text{Budget} d,w,rmaxAccuracy(d,w,r)s.t.FLOPs(d,w,r)≤Budget -
约束条件:
α ⋅ β 2 ⋅ γ 2 ≈ 2 \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2 α⋅β2⋅γ2≈2
5.2 神经架构搜索
-
搜索空间:
S = { ( α , β , γ ) ∣ α ⋅ β 2 ⋅ γ 2 ≈ 2 } \mathcal{S} = \{(\alpha, \beta, \gamma) | \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\} S={(α,β,γ)∣α⋅β2⋅γ2≈2} -
优化目标:
max ( α , β , γ ) ∈ S Accuracy ( d ( α ) , w ( β ) , r ( γ ) ) \max_{(\alpha,\beta,\gamma) \in \mathcal{S}} \text{Accuracy}(d(\alpha),w(\beta),r(\gamma)) (α,β,γ)∈SmaxAccuracy(d(α),w(β),r(γ))
6. 数学证明
6.1 复合缩放最优性
定理:在计算量约束下,复合缩放策略可以达到帕累托最优
证明:
-
定义目标函数:
max d , w , r f ( d , w , r ) s.t. d ⋅ w 2 ⋅ r 2 ≤ C \max_{d,w,r} f(d,w,r) \\ \text{s.t.} \quad d \cdot w^2 \cdot r^2 \leq C d,w,rmaxf(d,w,r)s.t.d⋅w2⋅r2≤C -
使用拉格朗日乘数法:
L ( d , w , r , λ ) = f ( d , w , r ) − λ ( d ⋅ w 2 ⋅ r 2 − C ) \mathcal{L}(d,w,r,\lambda) = f(d,w,r) - \lambda(d \cdot w^2 \cdot r^2 - C) L(d,w,r,λ)=f(d,w,r)−λ(d⋅w2⋅r2−C) -
求导并令导数为零:
∂ f ∂ d = λ w 2 r 2 ∂ f ∂ w = 2 λ d w r 2 ∂ f ∂ r = 2 λ d w 2 r \frac{\partial f}{\partial d} = \lambda w^2 r^2 \\ \frac{\partial f}{\partial w} = 2\lambda d w r^2 \\ \frac{\partial f}{\partial r} = 2\lambda d w^2 r ∂d∂f=λw2r2∂w∂f=2λdwr2∂r∂f=2λdw2r -
解得最优条件:
1 d ∂ f ∂ d = 2 w ∂ f ∂ w = 2 r ∂ f ∂ r \frac{1}{d} \frac{\partial f}{\partial d} = \frac{2}{w} \frac{\partial f}{\partial w} = \frac{2}{r} \frac{\partial f}{\partial r} d1∂d∂f=w2∂w∂f=r2∂r∂f -
由此可得复合缩放关系:
d ∝ α ϕ , w ∝ β ϕ , r ∝ γ ϕ d \propto \alpha^\phi, w \propto \beta^\phi, r \propto \gamma^\phi d∝αϕ,w∝βϕ,r∝γϕ
7. 实际应用
7.1 模型缩放
-
缩放步骤:
- 固定 ϕ = 1 \phi=1 ϕ=1,搜索最优 α , β , γ \alpha, \beta, \gamma α,β,γ
- 按比例缩放得到 B0-B7 模型
-
示例计算:
对于 B4 模型:
ϕ = 4 d = 1. 2 4 ≈ 2.07 w = 1. 1 4 ≈ 1.46 r = 1.1 5 4 ≈ 1.75 \phi=4 \\ d = 1.2^4 \approx 2.07 \\ w = 1.1^4 \approx 1.46 \\ r = 1.15^4 \approx 1.75 ϕ=4d=1.24≈2.07w=1.14≈1.46r=1.154≈1.75
7.2 性能预测
-
精度预测模型:
Accuracy = a ⋅ log ( FLOPs ) + b \text{Accuracy} = a \cdot \log(\text{FLOPs}) + b Accuracy=a⋅log(FLOPs)+b -
效率预测:
Efficiency = a ⋅ log ( FLOPs ) + b FLOPs \text{Efficiency} = \frac{a \cdot \log(\text{FLOPs}) + b}{\text{FLOPs}} Efficiency=FLOPsa⋅log(FLOPs)+b
8. 理论贡献
- 提出了统一的缩放方法
- 证明了复合缩放的最优性
- 建立了精度-效率的定量关系
- 为后续模型设计提供了理论指导