【创新实训第四周】不完全的 CTPN 完结贴 2019.4.11

本文链接：https://blog.youkuaiyun.com/u013575592/article/details/89219406

本文记录了作者在创新实训中训练CTPN模型的过程，从模型设计、数据预处理到运行效果的展示。尽管模型仍有不完善之处，如文本框合并和全连接后的分支简化，但已能基本识别图像中的文字。下一步，作者将转向APP和后台的开发工作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本周工作进展

经过两周心酸的调试，在省略了回归操作的情况下依旧失败了无数遍，今天我终于跑出了第一个能看的 CTPN 模型。这篇博客就作为我 CTPN 之旅的完结总结，虽然全连接后的分支只剩分类了，虽然文本框合并也没有。

详细工作内容

① 模型设计

首先，输入图片经过 VGG16，长宽缩小到原来 1/16，得到 feature map，所以 feature map 的一个像素对应原图的 16*16 像素，这也是为什么 anchor 的宽度要固定为 16。

接着，feature map 的每个像素点都取包括周围的九个像素点拼接，每个像素点通道数为 c，则可以拼接成一个 9c 通道数的像素。实际操作中，可以用 1*1 卷积代替。

逐行将新 feature map 的像素输入双向 LSTM，找到 anchor 间水平的序列关系。

每个 feature map 输入全连接层，在分别输出 2k 个分数（最后我只做了这个），2k 个定位，k 个边缘提纯。

# 去掉全连接的 vgg16 网络
def vgg16_no_tail():
    # 注意一定要把 include_top 设为 false，
    # 否则 input_shape 默认为 224*224，会出错
    vgg = keras.applications.VGG16(include_top=False)
    vgg_no_tail = keras.Model(
        inputs=vgg.input,
        outputs=vgg.get_layer("block5_conv3").output)
 
    return vgg_no_tail
 
 
# 生成训练模型
def ctpn_model(h=600, w=900, k=10, anchor_size=16):
    conv_h = h // anchor_size
    conv_w = w // anchor_size
    input_layer = vgg16_no_tail(None)
    layer = input_layer.output

    # 卷积代替
    layer = keras.layers.Convolution2D(
        512 * 9, (3, 3),
        activation='relu',
        padding='same',
        name='cnn2rnn')(layer)

    # 变形，用于找到像素的水平关联
    layer = keras.layers.Reshape((-1, 512 * 9))(layer)

    # bi-lstm
    layer = keras.layers.Bidirectional(
        keras.layers.LSTM(128, return_sequences=True))(layer)

    # 恢复形状
    layer = keras.layers.Reshape((conv_h, conv_w, 256))(layer)

    # FC
    layer = keras.layers.Convolution2D(512, (1, 1), activation='relu')(layer)

    # score
    sc_layer = keras.layers.Convolution2D(2 * k, (1, 1), activation='relu')(layer)

    # 将最后的维度两两组合
    sc_layer = keras.layers.Reshape((conv_h, conv_w, 10, 2))(sc_layer)

    # score 要一个 softmax 输出，保证正负分数和为1
    sc_layer = keras.layers.Softmax()(sc_layer)

    model = keras.Model(inputs=input_layer.input,
                        outputs=sc_layer)
    return model

最后输出的向量的 shape：[batch_size, conv_h, conv_w, anchor_count, 2]。

默认输入 (600, 900) 的图像，每个 feature map 像素10个不同高度的 anchor，则输出 shape 为：[batch_size, 37, 56, 10, 2]。

然后是 loss 函数设计。这里只有 score 的。 y_true 和 y_pred 的 shape 形式都同上。使用交叉熵损失函数。但注意，最终输出的 anchor 数量有1620000个，而包含文本的 anchor 数最多只有上百个，也就是说正负样本是严重失衡的，如果直接把 y_true 和 y_pred 输入binary_crossentropy 可能导致最后模型预测不出任何东西。因此我的做法是将正负样本分开计算 loss。

def ctpn_loss_only_score(y_true, y_pred):
    y_pred = tf.multiply(y_true, y_pred)
    loss = keras.losses.binary_crossentropy
    
    y_true = tf.reshape(y_true, (-1, 2))
    y_pred = tf.reshape(y_pred, (-1, 2))
    y_true_pos = y_true[:, 0]
    y_true_neg = y_true[:, 1]
    y_pred_pos = y_pred[:, 0]
    y_pred_neg = y_pred[:, 1]

    pos_sum = tf.reduce_sum(y_true_pos) + 1
    neg_sum = tf.reduce_sum(y_true_neg) + 1
    sum = pos_sum + neg_sum

    return sum * loss(y_true_pos, y_pred_pos) / pos_sum + \
           sum * loss(y_true_neg, y_pred_neg) / neg_sum

开始模型训练。

def ctpn_model_run():
    model = ctpn_model()

    # GD 优化器效果比较稳定，原来用的是 Adam，loss 一路飙升完全无法收敛
    model.compile(optimizer=tf.train.GradientDescentOptimizer(0.001),
                  loss=ctpn_loss_only_score,
                  metrics=['accuracy'])

    train_x, train_y, test_x, test_y = load_data()
    # x:输入图片的numpy，[n, 600, 900, 3]
    # y:对应输出的 feature map 的10个 anchor 的正负分数，[n, 37, 56, 10, 2]
    # 一定要保证 train 和 test 的 y 格式一致，不然会报错

    time = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
    model.fit(train_x, train_y, batch_size=4, epochs=50,
              validation_data=(test_x, test_y), callbacks=[
            keras.callbacks.ModelCheckpoint(
                "./model/model_real_only_score_" + time + "_{epoch:02d}-{val_loss:.2f}.hdf5",
                monitor='val_loss', verbose=1,
                save_best_only=True, period=1),
            keras.callbacks.TensorBoard("./model/logs_real_only_score_" + time,
                                        batch_size=4)
        ])

② 数据集预处理

输入图像一定要处理成固定大小。我通常使用的是 (600, 900) 的大小，产生 label 的格式就是 [batch_size, 37, 56, 10, 2]。

接下来就是照搬我第二周的内容了。

首先参考这篇，将 box 切成 16 像素等宽的 Anchor

到这一步，Anchor 输出格式是 (x_position, y, h) 的列表：

但是，我们需要处理成和模型输出相同的格式 [batch, h, w, k=10, 4]，其中的“4”分别是文字分数、背景分数、纵坐标 y 和高度 h。每16*16像素都需要生成10个 Anchor，高度分别是 [11, 16, 23, 33, 46, 66, 94, 134, 191, 273]。这些 Anchor 中，只有与上图找出的 Anchor 中，横坐标相同且面积交并比大于 0.7 的才能被判定为文字区域。

def overlap_anchors(img, box, anchor_width=16):
    iou_threshold = 0.7
    anchor_sizes = [11, 16, 23, 33, 46, 66, 94, 134, 191, 273]
    anchors = generate_gt_anchor(img, box, anchor_width)
    anchors = {x[0]: (x[1], x[2]) for x in anchors}
    # print(anchors)
    total_anchors = []
    for h in range(imgg.shape[0] // anchor_width):
        curH = []
        total_anchors.append(curH)
        for w in range(imgg.shape[1] // anchor_width):
            curW = []
            curH.append(curW)
            for k in range(len(anchor_sizes)):
                if w not in anchors:
                    curW.append([0, 1, 0, 1])
                else:
                    cy, ch = anchors[w]
                    ty, th = h * anchor_width + anchor_width / 2, anchor_sizes[k]
                    if iou(cy, ch, ty, th) > iou_threshold:
                        curW.append([1, 0, ty, th])
                    else:
                        curW.append([0, 1, 0, 0])
    return total_anchors
 
 
def iou(y1, h1, y2, h2):
    b1, u1 = y1 - h1 / 2, y1 + h1 / 2
    b2, u2 = y2 - h2 / 2, y2 + h2 / 2
    if u2 > u1:
        b1, u1, b2, u2 = b2, u2, b1, u1
    if b1 >= u2:
        return 0
    else:
        if b2 > b1:
            return (u2 - b2) / (u1 - b1)
        else:
            return (u2 - b1) / (u1 - b2)

最终输出的效果：