image+caption+pytorch的简单demo（亲测好用）

最新推荐文章于 2024-09-05 00:00:00 发布

codebrid

最新推荐文章于 2024-09-05 00:00:00 发布

阅读量4.3k

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/ccbrid/article/details/79580349

本文介绍了一个基于PyTorch实现的图像字幕生成项目，包括建立词汇表、图像预处理、训练及测试等步骤，并针对存在的问题如评估代码缺失、循环单元选择不当等进行了讨论。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在github上搜：

https://github.com/search?utf8=%E2%9C%93&q=image+caption+pytorch&type=Repositories

源码：https://github.com/jinfagang/pytorch_image_caption

目录：build_vocab建立vocab; resize将图像全部重新resize; train训练; sample测试。

欠缺：

1.1.无 caption evaluation code。(It computes multiple common metrics, including BLEU, METEOR, ROUGE-L, and CIDEr (the writeup below contains references and descriptions of each metric).)

1.2.val以及test阶段的代码，作者使用的是LSTM作为每一次循环使用，此处应为LSTMCell。原代码如下：

（代码逻辑为：将encoderCNN计算出的features作为第一个input，
然后循环hiddens, states = self.lstm(inputs, states)
其中hiddens为output，选出最高的作为下一次input；其中states为（h,c），存在于每一次循环）

在自己做代码时也可以将encoderCNN计算出的features作为h0，第一个input输入start（GRU）。

    def sample(self, features, states):
        """Samples captions for given image features (Greedy search)."""
        sampled_ids = []
        inputs = features.unsqueeze(1) # feature[1, 256]  # (inputs.size()) # [1, 1, 256]=[batch, 1, inemb]
        for i in range(20):                                      # maximum sampling length
            hiddens, states = self.lstm(inputs, states)          # (batch_size, 1, hidden_size)
            outputs = self.linear(hiddens.squeeze(1))            # (batch_size, vocab_size)         
            predicted = outputs.max(1)[1]
            sampled_ids.append(predicted)
            inputs = self.embed(predicted)
        sampled_ids = torch.cat(sampled_ids, 1)                  # (batch_size, 20)
        return sampled_ids.squeeze()

学习：

1.3. train阶段的代码：

（代码逻辑为：将encoderCNN计算出的features[batch,1,ninp]与golden值[batch, len,ninp]按第1维度拼接，
然后pack直接传入LSTM得到最后结果）
关于此函数：你可以用它来打包labels，然后用RNN的输出和打包后的labels来计算loss。pack_padded_sequence学习-参照 https://zhuanlan.zhihu.com/p/34418001

    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)   
        embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
        packed = pack_padded_sequence(embeddings, lengths, batch_first=True) 
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs

运行时出现问题：

1.4. 在test代码中，当你使用gpu时，需更改：否则报错keyerror

        state = (Variable(torch.zeros(opt.num_layers, 1, opt.nhid)).cuda(),
                 Variable(torch.zeros(opt.num_layers, 1, opt.nhid)).cuda())

1.5. model.py的sample下，否则在for循环第二次就会报错RuntimeError: input must have 3 dimensions, got 2

            inputs = self.embed(predicted.unsqueeze(1))

暂未发现其他问题。

2. caption evaluation：

http://blog.youkuaiyun.com/ccbrid/article/details/79639127

image+caption+pytorch的简单demo（亲测好用）

（代码逻辑为：将encoderCNN计算出的features作为第一个input，然后循环hiddens, states = self.lstm(inputs, states)其中hiddens为output，选出最高的作为下一次input；其中states为（h,c），存在于每一次循环）

在自己做代码时也可以将encoderCNN计算出的features作为h0，第一个input输入start（GRU）。

（代码逻辑为：将encoderCNN计算出的features作为第一个input，
然后循环hiddens, states = self.lstm(inputs, states)
其中hiddens为output，选出最高的作为下一次input；其中states为（h,c），存在于每一次循环）