# 从头开始读代码,学习论文中描述的点是如何用代码实现的。
Swin Transformer训练设置:
1. Regular ImageNet-1K training
optimizer: AdamW for 300 epochs
from torch import optim as optim
optimizer = optim.AdamW(parameters, eps=config.TRAIN.OPTIMIZER.EPS, betas=config.TRAIN.OPTIMIZER.BETAS, lr=config.TRAIN.BASE_LR, weight_decay=config.TRAIN.WEIGHT_DECAY)
cosine decay learning rate scheduler and 20 epochs of linear warm-up.
A batch size of 1024, an initial learning rate of 0.001, and a weight decay of 0.05 are used.
2. Pretraining on ImageNet-22K and fine-tuning on ImageNet-1K.
消融实验因素:
1. 相对位置偏差(B)
\[{\rm{Attention}}(Q,K,V) = {\rm{SoftMax}}(\frac{ {Q{K^T}}}{ {\sqrt { {d_k}} }} + B)V\]
2. shifted windows
下采样结构(downsampling在前三个阶段的最后为PatchMerging&