代码分析:https://github.com/eriklindernoren/PyTorch-YOLOv3
论文地址:https://pjreddie.com/media/files/papers/YOLOv3.pdf
注:本次分析的代码是以上给出的网址,全部根据自己的理解写的,如有不足,还请指正。
本次以以下的顺序理解yolov3
yolov3的检测结果如下
1、数据处理
因为所有模型都包括数据加载,模型载入,训练和测试等,所以先从数据的载入分析。
data初始化
class ListDataset(Dataset):
def __init__(self, list_path, img_size=416, augment=True, multiscale=True, normalized_labels=True):
with open(list_path, "r") as file:
self.img_files = file.readlines()
self.label_files = [
path.replace("images", "labels").replace(".png", ".txt").replace(".jpg", ".txt")
for path in self.img_files
]
self.img_size = img_size
self.max_objects = 100
self.augment = augment
self.multiscale = multiscale
self.normalized_labels = normalized_labels
self.min_size = self.img_size - 3 * 32
self.max_size = self.img_size + 3 * 32
self.batch_count = 0
list_path:包含train和test数据集的路径,self.min_size和self.max_size的作用主要是经过数据处理后生成三种不同size的图像,目的是让网络对小物体和大物体都有较好的检测结果。
def __getitem__(self, index):
# ---------
# Image
# ---------
img_path = self.img_files[index % len(self.img_files)].rstrip()
# Extract image as PyTorch tensor
img = transforms.ToTensor()(Image.open(img_path).convert('RGB'))
# Handle images with less than three channels
if len(img.shape) != 3:
img = img.unsqueeze(0)
img = img.expand((3, img.shape[1:]))
_, h, w = img.shape
h_factor, w_factor = (h, w) if self.normalized_labels else (1, 1)
# Pad to square resolution
img, pad = pad_to_square(img, 0)
pad_to_square函数主要是将图像进行pad成w和h都一样的正方形图像(边长以最大的为主)如下图。
if os.path.exists(label_path):
boxes = torch.from_numpy(np.loadtxt(label_path).reshape(-1, 5))
# Extract coordinates for unpadded + unscaled image
x1 = w_factor * (boxes[:, 1] - boxes[:, 3] / 2)
y1 = h_factor * (boxes[:, 2] - boxes[:, 4] / 2)
x2 = w_factor * (boxes[:, 1] + boxes[:, 3] / 2)
y2 = h_factor * (boxes[:, 2] + boxes[:, 4] / 2)
# Adjust for added padding
x1 += pad[0]
y1 += pad[2]
x2 += pad[1]
y2 += pad[3]
# Returns (x, y, w, h)
boxes[:, 1] = ((x1 + x2) / 2) / padded_w
boxes[:, 2] = ((y1 + y2) / 2) / padded_h
boxes[:, 3] *= w_factor / padded_w
boxes[:, 4] *= h_factor / padded_h
targets = torch.zeros((len(boxes), 6))
targets[:, 1:] = boxes
np.loadtxt()函数主要将标签里的[0 0.515 0.5 0.21694873 0.18286777]读取并转化为array,标签分别为内容主要是:类别,boxes中心x坐标,y坐标,boxes的w,h。后面操作主要是因为图片pad过,所以boxes要想得到pad后的图片的真是坐标要经过加pad平移。然后再归一化,主要是加速模型收敛,减小误差。那target第二个维度为什么是6,如下图(引用他人的)会出现after get ID targets的第一数列是一个图片表示有几个框,有两个就将图像信息重复几次,如第三张图片就不止一个框,这样的好处是能对每一个框在后面预测阶段只对这一个目标进行计算。
if self.augment:
if np.random.random() < 0.5:
img, targets = horisontal_flip(img, targets) # 图片左右翻转
return img_path, img, targets
horisontal_flip函数作用是将图片左右翻转,来做图像增强
def collate_fn(self, batch):
paths, imgs, targets = list(zip(*batch))
# Remove empty placeholder targets
targets = [boxes for boxes in targets if boxes is not None]
# Add sample index to targets
for i, boxes in enumerate(targets):
boxes[:, 0] = i
targets = torch.cat(targets, 0)
# Selects new image size every tenth batch
if self.multiscale and self.batch_count % 10 == 0:
self.img_size = random.choice(range(self.min_size, self.max_size + 1, 32))
# Resize images to input shape
imgs = torch.stack([resize(img, self.img_size) for img in imgs])
self.batch_count += 1
return paths, imgs, targets
这部分代码是将pad后的图像进行resize成torch.Size([1, 3, 320, 320]),torch.Size([1, 3, 384, 384]),torch.Size([1, 3, 480, 480])三种尺度的图像。
2. 训练部分
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
os.makedirs("output", exist_ok=True)
os.makedirs("checkpoints", exist_ok=True)
# Get data configuration
data_config = parse_data_config(opt.data_config)
train_path = data_config["train"]
valid_path = data_config["valid"]
class_names = load_classes(data_config["names"])
# Initiate model
model = Darknet(opt.model_def).to(device)
model.apply(weights_init_normal)
# If specified we start from checkpoint
if opt.pretrained_weights:
if opt.pretrained_weights.endswith(".pth"):
model.load_state_dict(torch.load(opt.pretrained_weights))
else:
model.load_darknet_weights(opt.pretrained_weights)
data_confit使用read方法读取coco.data文件中的配置信息,包括以下内容
其中coco.names是coco数据集中的类别,然后获得训练集数据和验证集,加载预训练模型,预训练模型在小数据集训练的时候是非常有用的,给与网络一个较好的初始值,不至于模型进入一个局部最优解的状态。并且加速了模型的收敛速度。下面是yolov3的模型,其中模型是借鉴resnet的思想来设计的。e
图片输入时416x416x3的然后通过DBL+res1+res2+…后分成了三个不同尺寸的特征,因为13x13像素密度映射到原图416x416上是很大的一片区域,所以用(13x13)对应的anchors是(116,90, 156,198, 373,326),这样便能够检测原图中的大物体,(52x52)---->(10,13, 16,30, 33,23, 30,61)的检测对应原图中的小物体。类似以下的映射关系。
def create_modules(module_defs):
"""
Constructs module list of layer blocks from module configuration in module_defs
"""
hyperparams = module_defs.pop(0)
output_filters = [int(hyperparams["channels"])]
module_list = nn.ModuleList()
for module_i, module_def in enumerate(module_defs):
modules = nn.Sequential()
if module_def["type"] == "convolutional":
try :
bn = int(module_def["batch_normalize"])
except:
bn=0
filters = int(module_def["filters"])
kernel_size = int(module_def["size"])
pad = (kernel_size - 1) // 2
modules.add_module(
f"conv_{module_i}",
nn.Conv2d(
in_channels=output_filters[-1],
out_channels=filters,
kernel_size=kernel_size,
stride=int(module_def["stride"]),
padding=pad,
bias=not bn,
),
)
if bn:
modules.add_module(f"batch_norm_{module_i}", nn.BatchNorm2d(filters, momentum=0.9, eps=1e-5))
if module_def["activation"] == "leaky":
modules.add_module(f"leaky_{module_i}", nn.LeakyReLU(0.1))
elif module_def["type"] == "maxpool":
kernel_size = int(module_def["size"])
stride = int(module_def["stride"])
if kernel_size == 2 and stride == 1:
modules.add_module(f"_debug_padding_{module_i}", nn.ZeroPad2d((0, 1, 0, 1)))
maxpool = nn.MaxPool2d(kernel_size=kernel_size, stride=stride, padding=int((kernel_size - 1) // 2))
modules.add_module(f"maxpool_{module_i}", maxpool)
elif module_def["type"] == "upsample":
upsample = Upsample(scale_factor=int(module_def["stride"]), mode="nearest")
modules.add_module(f"upsample_{module_i}", upsample)
elif module_def["type"] == "route":
layers = [int(x) for x in module_def["layers"].split(",")]
filters = sum([output_filters[1:][i] for i in layers])
modules.add_module(f"route_{module_i}", EmptyLayer())
elif module_def["type"] == "shortcut":
filters = output_filters[1:][int(module_def["from"])]
modules.add_module(f"shortcut_{module_i}", EmptyLayer())
elif module_def["type"] == "yolo":
anchor_idxs = [int(x) for x in module_def["mask"].split(",")]
# Extract anchors
print(anchor_idxs)
anchors = [int(x) for x in module_def["anchors"].split(",")]
anchors = [(anchors[i], anchors[i + 1]) for i in range(0, len(anchors), 2)]
anchors = [anchors[i] for i in anchor_idxs]
num_classes = int(module_def["classes"])
img_size = int(hyperparams["height"])
# Define detection layer
yolo_layer = YOLOLayer(anchors, num_classes, img_size)
modules.add_module(f"yolo_{module_i}", yolo_layer)
# Register module list and number of output filters
module_list.append(modules)
output_filters.append(filters)
return hyperparams, module_list
class Upsample(nn.Module):
""" nn.Upsample is deprecated """
def __init__(self, scale_factor, mode="nearest"):
super(Upsample, self).__init__()
self.scale_factor = scale_factor
self.mode = mode
def forward(self, x):
x = F.interpolate(x, scale_factor=self.scale_factor, mode=self.mode)
return x
这是darknet53模型的构建主要是读取下面的模型config来设计的
3.yolov3损失函数以及anchors处理方法
class YOLOLayer(nn.Module):
"""Detection layer"""
def __init__(self, anchors, num_classes, img_dim=416):
super(YOLOLayer, self).__init__()
self.anchors = anchors
self.num_anchors = len(anchors)
self.num_classes = num_classes
self.ignore_thres = 0.5
self.mse_loss = nn.MSELoss()
self.bce_loss = nn.BCELoss()
self.obj_scale = 1
self.noobj_scale = 100
self.metrics = {}
self.img_dim = img_dim
self.grid_size = 0 # grid size
def compute_grid_offsets(self, grid_size, cuda=True):
self.grid_size = grid_size
g = self.grid_size
FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
self.stride = self.img_dim / self.grid_size
# Calculate offsets for each grid
self.grid_x = torch.arange(g).repeat(g, 1).view([1, 1, g, g]).type(FloatTensor)
self.grid_y = torch.arange(g).repeat(g, 1).t().view([1, 1, g, g]).type(FloatTensor)
self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors])
self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1))
self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1))
def forward(self, x, targets=None, img_dim=None):
# Tensors for cuda support
FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor
print("xxxxxxxxxxxxxxxxxxxxxxxxx",x.shape)
self.img_dim = img_dim
num_samples = x.size(0)
grid_size = x.size(2)
prediction = (
x.view(num_samples, self.num_anchors, self.num_classes + 5, grid_size, grid_size)
.permute(0, 1, 3, 4, 2)
.contiguous()
)
# Get outputs
x = torch.sigmoid(prediction[..., 0]) # Center x
y = torch.sigmoid(prediction[..., 1]) # Center y
w = prediction[..., 2] # Width
h = prediction[..., 3] # Height
pred_conf = torch.sigmoid(prediction[..., 4]) # Conf
pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred.
# If grid size does not match current we compute new offsets
if grid_size != self.grid_size:
self.compute_grid_offsets(grid_size, cuda=x.is_cuda)
# Add offset and scale with anchors
pred_boxes = FloatTensor(prediction[..., :4].shape)
pred_boxes[..., 0] = x.data + self.grid_x
pred_boxes[..., 1] = y.data + self.grid_y
pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w
pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h
output = torch.cat(
(
pred_boxes.view(num_samples, -1, 4) * self.stride,
pred_conf.view(num_samples, -1, 1),
pred_cls.view(num_samples, -1, self.num_classes),
),
-1,
)
if targets is None:
return output, 0
else:
iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
pred_boxes=pred_boxes,
pred_cls=pred_cls,
target=targets,
anchors=self.scaled_anchors,
ignore_thres=self.ignore_thres,
)
# Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
loss_x = self.mse_loss(x[obj_mask], tx[obj_mask]) # tx[obj_mask] torch.Size([1])
loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])
loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])
loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls
# Metrics
cls_acc = 100 * class_mask[obj_mask].mean()
conf_obj = pred_conf[obj_mask].mean()
conf_noobj = pred_conf[noobj_mask].mean()
conf50 = (pred_conf > 0.5).float()
iou50 = (iou_scores > 0.5).float()
iou75 = (iou_scores > 0.75).float()
detected_mask = conf50 * class_mask * tconf
precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)
self.metrics = {
"loss": to_cpu(total_loss).item(),
"x": to_cpu(loss_x).item(),
"y": to_cpu(loss_y).item(),
"w": to_cpu(loss_w).item(),
"h": to_cpu(loss_h).item(),
"conf": to_cpu(loss_conf).item(),
"cls": to_cpu(loss_cls).item(),
"cls_acc": to_cpu(cls_acc).item(),
"recall50": to_cpu(recall50).item(),
"recall75": to_cpu(recall75).item(),
"precision": to_cpu(precision).item(),
"conf_obj": to_cpu(conf_obj).item(),
"conf_noobj": to_cpu(conf_noobj).item(),
"grid_size": grid_size,
}
return output, total_loss
prediction = (
x.view(num_samples, self.num_anchors, self.num_classes + 5, grid_size, grid_size)
.permute(0, 1, 3, 4, 2)
.contiguous()
)分别获得输出的三种尺度图片的信息,num_samples就是batch_size的大小,self.num_anchors是3,每个grid中产生三个anchors,self.num_classes是类别数量和一个置信度,grid_size是输出特征的长宽
下面画红框的就是上述的公式的计算方法
这里是yolo比较核心,我们想让与预测得到的anchor和真实的boxes尽量的重合,那么这时是不是只需要调整预测的anchor的中心坐标和长宽就可以了,那么实际上我们预测的就是tw,ty,和tw,th,接下来我们解释一下这四个量,首先使用torch.arange.repeat([s,1])的方法将生成的图片特征分成13x13,26x26,52x52的网格,每个网格的长度为1,grid_x代表每一个格子的x坐标,经过grid_t.t()生成y轴方向的坐标,这样的话就能准确的描述anchor落到了那个位置,那为什么还要使用sigmoid函数呢,之前说过了每个格子的长宽为1,那通过sigmoid(tx)就是把预测的到的tw规范到0~1之间,让他落到那个格子中。然后再加上Cx即grid_x就是相对于图片左上角的坐标了。而我们的tw和th没有经过sigmoid他是用etw和eth,pw和ph是anchor box的长宽,e*tw算出来是anchor box的宽度和真实物体的box的一个比例,也就是我们怎么缩放anchor box来更好的预测到真实的box的长宽
def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor
nB = pred_boxes.size(0)# batch_size
nA = pred_boxes.size(1) # 多少个Anchor(3)
nC = pred_cls.size(-1) # 训练多少个class80
nG = pred_boxes.size(2)# grid大小
# Output tensors
obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)
noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
tx = FloatTensor(nB, nA, nG, nG).fill_(0)
ty = FloatTensor(nB, nA, nG, nG).fill_(0)
tw = FloatTensor(nB, nA, nG, nG).fill_(0)
th = FloatTensor(nB, nA, nG, nG).fill_(0)
tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)
# Convert to position relative to box
target_boxes = target[:, 2:6] * nG # 放大坐标信息因为最初坐标信息归一化了target torch.Size([1, 6])
print("target",target.shape)
gxy = target_boxes[:, :2]# torch.Size([1, 2])
gwh = target_boxes[:, 2:]
# Get anchors with best iou
ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors]) #anachor: torch.Size([3, 2])
best_ious, best_n = ious.max(0)
# Separate target values
b, target_labels = target[:, :2].long().t()
gx, gy = gxy.t()
print("gx",gx)
gw, gh = gwh.t()
gi, gj = gxy.long().t()# 左上角坐标
# Set masks
obj_mask[b, best_n, gj, gi] = 1 # 表示检测出物体
noobj_mask[b, best_n, gj, gi] = 0 # 表示没有检测出物体
# Set noobj mask to zero where iou exceeds ignore threshold
for i, anchor_ious in enumerate(ious.t()):
noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0
# Coordinates
tx[b, best_n, gj, gi] = gx - gx.floor()
ty[b, best_n, gj, gi] = gy - gy.floor()
# Width and height
tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
# One-hot encoding of label
tcls[b, best_n, gj, gi, target_labels] = 1
# Compute label correctness and iou at best anchor
class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
tconf = obj_mask.float()
return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf
上面的通过nB,nA,nC,nG定义的变量其实还是跟划分网格的思想一样,以便于获取目标所在的位置,每个网格的置信度信息,和每个网格生成的三个anchors的iou大小。target_boxes = target[:, 2:6] * nG因为target是归一化后的坐标,所以现在要想恢复到三个尺度对应的位置,就该乘以现在目标特征的宽高。
ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors]) #anachor: torch.Size([3, 2])
best_ious, best_n = ious.max(0)
这里计算的是iou大小(两个框的交集处理并集)。best_ious是那个框和原始框的最大iou,best_n返回的是3个anchors中的那一个
obj_mask表示有物体落在特征图中某一个cell的索引,所以在初始化的时候置0,如果有物体落在那个cell中,那个对应的位置设置为1。所以会有代码:
同理,如果iou大于设置的阈值的话就设置noobj_mask为0
下面是获取真实值对应的那个框的偏移量,和置信度,iou得分等
这里为什么要 tx[b, best_n, gj, gi] = gx - gx.floor()
想想我们之前说过每个格子的长度是1,那这个就是代表整数部分多出的部分,即下图中的sigmoid(ty),因为前面的预测的偏移量是sigmoid(ty)这里用真实的坐标也求出来这个偏移量就能够求损失了。后面的class_mask,iou_scores,tcls都是对应best那个anchors对应位置的类别,iou得分,置信度等。
下面是yolov3的损失函数了
主要包括了位置损失,置信度损失,分类损失等。
其对应的公式如下;
3.总结
通过对yolov3代码的理解,更加深入的了解了yolov3的设计思想,不管从那个方面来说,都是非常巧妙的,后面我们可以在yolov3上做更多的改进比如目标跟踪,文本检测,车牌号检测,我们也可以在yolov3的基础上做一些数据增强,和其他的修改让他在跟多的场景下,实现比较好的效果。