image+caption+pytorch的简单demo(亲测好用)

本文介绍了一个基于PyTorch实现的图像字幕生成项目,包括建立词汇表、图像预处理、训练及测试等步骤,并针对存在的问题如评估代码缺失、循环单元选择不当等进行了讨论。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在github上搜:

https://github.com/search?utf8=%E2%9C%93&q=image+caption+pytorch&type=Repositories

1. 

源码:https://github.com/jinfagang/pytorch_image_caption

目录:build_vocab建立vocab; resize将图像全部重新resize; train训练; sample测试。

欠缺:

1.1.无 caption evaluation code。(It computes multiple common metrics, including BLEU, METEOR, ROUGE-L, and CIDEr (the writeup below contains references and descriptions of each metric).)

1.2.val以及test阶段的代码,作者使用的是LSTM作为每一次循环使用,此处应为LSTMCell。原代码如下:

(代码逻辑为:将encoderCNN计算出的features作为第一个input,
然后循环hiddens, states = self.lstm(inputs, states)
其中hiddens为output,选出最高的作为下一次input;其中states为(h,c),存在于每一次循环)
在自己做代码时也可以将encoderCNN计算出的features作为h0,第一个input输入start(GRU)。
    def sample(self, features, states):
        """Samples captions for given image features (Greedy search)."""
        sampled_ids = []
        inputs = features.unsqueeze(1) # feature[1, 256]  # (inputs.size()) # [1, 1, 256]=[batch, 1, inemb]
        for i in range(20):                                      # maximum sampling length
            hiddens, states = self.lstm(inputs, states)          # (batch_size, 1, hidden_size)
            outputs = self.linear(hiddens.squeeze(1))            # (batch_size, vocab_size)         
            predicted = outputs.max(1)[1]
            sampled_ids.append(predicted)
            inputs = self.embed(predicted)
        sampled_ids = torch.cat(sampled_ids, 1)                  # (batch_size, 20)
        return sampled_ids.squeeze()

学习:

1.3. train阶段的代码:

(代码逻辑为:将encoderCNN计算出的features[batch,1,ninp]与golden值[batch, len,ninp]按第1维度拼接,
然后pack直接传入LSTM得到最后结果
关于此函数:你可以用它来打包labels,然后用RNN的输出和打包后的labels来计算loss。
pack_padded_sequence学习-参照 https://zhuanlan.zhihu.com/p/34418001

    def forward(self, features, captions, lengths):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embed(captions)   
        embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
        packed = pack_padded_sequence(embeddings, lengths, batch_first=True) 
        hiddens, _ = self.lstm(packed)
        outputs = self.linear(hiddens[0])
        return outputs

运行时出现问题:

1.4. 在test代码中,当你使用gpu时,需更改:否则报错keyerror

        state = (Variable(torch.zeros(opt.num_layers, 1, opt.nhid)).cuda(),
                 Variable(torch.zeros(opt.num_layers, 1, opt.nhid)).cuda())

1.5. model.py的sample下,否则在for循环第二次就会报错RuntimeError: input must have 3 dimensions, got 2

            inputs = self.embed(predicted.unsqueeze(1))

暂未发现其他问题。


2. caption evaluation:  

http://blog.youkuaiyun.com/ccbrid/article/details/79639127




import argparse import os import sys import numpy as np import json import torch from PIL import Image sys.path.append(os.path.join(os.getcwd(), "GroundingDINO")) sys.path.append(os.path.join(os.getcwd(), "segment_anything")) # Grounding DINO import GroundingDINO.groundingdino.datasets.transforms as T from GroundingDINO.groundingdino.models import build_model from GroundingDINO.groundingdino.util.slconfig import SLConfig from GroundingDINO.groundingdino.util.utils import clean_state_dict, get_phrases_from_posmap # segment anything from segment_anything import sam_model_registry import cv2 import numpy as np import matplotlib.pyplot as plt def load_image(image_path): # load image image_pil = Image.open(image_path).convert("RGB") # load image transform = T.Compose( [ T.RandomResize([800], max_size=1333), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ] ) image, _ = transform(image_pil, None) # 3, h, w return image_pil, image def load_model(model_config_path, model_checkpoint_path, bert_base_uncased_path, device): args = SLConfig.fromfile(model_config_path) args.device = device args.bert_base_uncased_path = bert_base_uncased_path model = build_model(args) checkpoint = torch.load(model_checkpoint_path, map_location="cpu") load_res = model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False) print(load_res) _ = model.eval() return model def get_grounding_output(model, image, caption, box_threshold, text_threshold, with_logits=True, device="cpu"): caption = caption.lower() caption = caption.strip() if not caption.endswith("."): caption = caption + "." model = model.to(device) image = image.to(device) with torch.no_grad(): outputs = model(image[None], captions=[caption]) logits = outputs["pred_logits"].cpu().sigmoid()[0] # (nq, 256) boxes = outputs["pred_boxes"].cpu()[0] # (nq, 4) logits.shape[0] # filter output logits_filt = logits.clone() boxes_filt = boxes.clone() filt_mask = logits_filt.max(dim=1)[0] > box_threshold logits_filt = logits_filt[filt_mask] # num_filt, 256 boxes_filt = boxes_filt[filt_mask] # num_filt, 4 logits_filt.shape[0] # get phrase tokenlizer = model.tokenizer tokenized = tokenlizer(caption) # build pred pred_phrases = [] for logit, box in zip(logits_filt, boxes_filt): pred_phrase = get_phrases_from_posmap(logit > text_threshold, tokenized, tokenlizer) if with_logits: pred_phrases.append(pred_phrase + f"({str(logit.max().item())[:4]})") else: pred_phrases.append(pred_phrase) return boxes_filt, pred_phrases def show_mask(mask, ax, random_color=False): if random_color: color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0) else: color = np.array([30/255, 144/255, 255/255, 0.6]) h, w = mask.shape[-2:] mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1) ax.imshow(mask_image) def show_box(box, ax, label): x0, y0 = box[0], box[1] w, h = box[2] - box[0], box[3] - box[1] ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2)) ax.text(x0, y0, label) def save_mask_data(output_dir, mask_list, box_list, label_list): value = 0 # 0 for background mask_img = torch.zeros(mask_list.shape[-2:]) for idx, mask in enumerate(mask_list): mask_img[mask.cpu().numpy()[0] == True] = value + idx + 1 plt.figure(figsize=(10, 10)) plt.imshow(mask_img.numpy()) plt.axis('off') plt.savefig(os.path.join(output_dir, 'mask.jpg'), bbox_inches="tight", dpi=300, pad_inches=0.0) json_data = [{ 'value': value, 'label': 'background' }] for label, box in zip(label_list, box_list): value += 1 name, logit = label.split('(') logit = logit[:-1] # the last is ')' json_data.append({ 'value': value, 'label': name, 'logit': float(logit), 'box': box.numpy().tolist(), }) with open(os.path.join(output_dir, 'mask.json'), 'w') as f: json.dump(json_data, f) if __name__ == "__main__": parser = argparse.ArgumentParser("Grounded-Segment-Anything Demo", add_help=True) parser.add_argument("--config", type=str, default="./GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", help="path to config file") parser.add_argument( "--grounded_checkpoint", type=str, default="./groundingdino_swint_ogc.pth", help="path to checkpoint file" ) parser.add_argument( "--sam_version", type=str, default="vit_h", required=False, help="SAM ViT version: vit_b / vit_l / vit_h" ) parser.add_argument( "--sam_checkpoint", type=str, default="./sam_vit_h_4b8939.pth", help="path to sam checkpoint file" ) parser.add_argument( "--sam_hq_checkpoint", type=str, default=None, help="path to sam-hq checkpoint file" ) parser.add_argument( "--use_sam_hq", action="store_true", help="using sam-hq for prediction" ) parser.add_argument("--input_image", type=str, required=True, help="path to image file") parser.add_argument("--text_prompt", type=str, required=True, help="text prompt") parser.add_argument( "--output_dir", "-o", type=str, default="./outputs", help="output directory" ) parser.add_argument("--box_threshold", type=float, default=0.3, help="box threshold") parser.add_argument("--text_threshold", type=float, default=0.25, help="text threshold") parser.add_argument("--device", type=str, default="cpu", help="running on cpu only!, default=False") parser.add_argument("--bert_base_uncased_path", type=str, required=False, help="bert_base_uncased model path, default=False") args = parser.parse_args() # cfg config_file = args.config # change the path of the model config file grounded_checkpoint = args.grounded_checkpoint # change the path of the model sam_version = args.sam_version sam_checkpoint = args.sam_checkpoint sam_hq_checkpoint = args.sam_hq_checkpoint use_sam_hq = args.use_sam_hq image_path = args.input_image text_prompt = args.text_prompt output_dir = args.output_dir box_threshold = args.box_threshold text_threshold = args.text_threshold device = args.device bert_base_uncased_path = args.bert_base_uncased_path # make dir os.makedirs(output_dir, exist_ok=True) # load image image_pil, image = load_image(image_path) # load model model = load_model(config_file, grounded_checkpoint, bert_base_uncased_path, device=device) # visualize raw image image_pil.save(os.path.join(output_dir, "raw_image.jpg")) # run grounding dino model boxes_filt, pred_phrases = get_grounding_output( model, image, text_prompt, box_threshold, text_threshold, device=device ) # initialize SAM if use_sam_hq: predictor = SamPredictor(sam_model_registry[sam_version](checkpoint=sam_hq_checkpoint).to(device)) else: predictor = SamPredictor(sam_model_registry[sam_version](checkpoint=sam_checkpoint).to(device)) image = cv2.imread(image_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) predictor.set_image(image) size = image_pil.size H, W = size[1], size[0] for i in range(boxes_filt.size(0)): boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H]) boxes_filt[i][:2] -= boxes_filt[i][2:] / 2 boxes_filt[i][2:] += boxes_filt[i][:2] boxes_filt = boxes_filt.cpu() transformed_boxes = predictor.transform.apply_boxes_torch(boxes_filt, image.shape[:2]).to(device) masks, _, _ = predictor.predict_torch( point_coords = None, point_labels = None, boxes = transformed_boxes.to(device), multimask_output = False, ) # draw output image plt.figure(figsize=(10, 10)) plt.imshow(image) for mask in masks: show_mask(mask.cpu().numpy(), plt.gca(), random_color=True) for box, label in zip(boxes_filt, pred_phrases): show_box(box.numpy(), plt.gca(), label) plt.axis('off') plt.savefig( os.path.join(output_dir, "grounded_sam_output.jpg"), bbox_inches="tight", dpi=300, pad_inches=0.0 ) save_mask_data(output_dir, masks, boxes_filt, pred_phrases) 运行报错 C:\Users\29386\.conda\envs\grounded_sam\python.exe C:\Users\29386\segment-anything\Grounded-Segment-Anything\grounded_sam_demo.py --config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py --grounded_checkpoint weights/groundingdino_swint_ogc.pth --sam_checkpoint weights/sam_vit_h_4b8939.pth --input_image assets/demo1.jpg --output_dir outputs --text_prompt cat C:\Users\29386\.conda\envs\grounded_sam\lib\site-packages\timm\models\layers\__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) C:\Users\29386\.conda\envs\grounded_sam\lib\site-packages\torch\functional.py:513: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\TensorShape.cpp:3610.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] final text_encoder_type: bert-base-uncased C:\Users\29386\segment-anything\Grounded-Segment-Anything\grounded_sam_demo.py:47: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(model_checkpoint_path, map_location="cpu") Traceback (most recent call last): File "C:\Users\29386\segment-anything\Grounded-Segment-Anything\grounded_sam_demo.py", line 187, in <module> model = load_model(config_file, grounded_checkpoint, bert_base_uncased_path, device=device) File "C:\Users\29386\segment-anything\Grounded-Segment-Anything\grounded_sam_demo.py", line 47, in load_model checkpoint = torch.load(model_checkpoint_path, map_location="cpu") File "C:\Users\29386\.conda\envs\grounded_sam\lib\site-packages\torch\serialization.py", line 1072, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "C:\Users\29386\.conda\envs\grounded_sam\lib\site-packages\torch\serialization.py", line 480, in __init__ super().__init__(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
最新发布
07-25
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值