vit源码中to_patch_embedding理解

该博客介绍了如何利用einops库的Rearrange方法,将输入图像按特定尺寸分块,并转换为适合全连接层输入的格式。与传统使用卷积核提取patch的方式相比,这种方法提供了更灵活的图像处理方式。此外,文中也提到了卷积层作为替代方案,通过16x16的窗口和stride为16的卷积核来获取图像的patches。
 self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', 
            p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )
Rearrange是ein
import torch import torch.nn as nn import torch.nn.functional as F import open_clip from vit_pytorch.simple_vit_with_patch_dropout import SimpleViT from vit_pytorch.extractor import Extractor from model_init import UrbanCLIP_init # There could be more options to initialize the parameters! The following checkpoint is one of them. # Our design is based on CLIP. So CLIP variants are also within our scope. Welcome any commit for UrbanCLIP! model, _, transform = open_clip.create_model_and_transforms( model_name="coca_ViT-L-14", # pretrained="mscoco_finetuned_laion2B-s13B-b90k" pretrained="/root/autodl-tmp/laion-mscoco_finetuned_CoCa-ViT-L-14-laion2B-s13B-b90k/open_clip_pytorch_model.bin" # pretrained="/root/autodl-tmp/laion-CoCa-ViT-L-14-laion2B-s13B-b90k/open_clip_pytorch_model.bin" ) # more general details of initialized model can be seen as follows: vit = SimpleViT( image_size = 256, patch_size = 32, num_classes = 1000, dim = 1024, depth = 6, heads = 16, mlp_dim = 2048, patch_dropout = 0.5 # https://arxiv.org/abs/2212.00794 ) vit = Extractor(vit, return_embeddings_only = True, detach = False) urbanclip_init = UrbanCLIP_init( dim = 512, # model dimension img_encoder = vit, # vision transformer - image encoder, returning image embeddings as (batch, seq, dim) image_dim = 1024, # image embedding dimension, if not the same as model dimensions num_tokens = 20000, # number of text tokens unimodal_depth = 6, # depth of the unimodal transformer multimodal_depth = 6, # depth of the multimodal transformer dim_head = 64, # dimension per attention head heads = 8, # number of attention heads caption_loss_weight = 1., # weight on the autoregressive caption loss contrastive_loss_weight = 1., # weight on the contrastive loss between image and text CLS embeddings ).cuda() text = torch.randint(0, 20000, (4, 512)).cuda() images = torch.randn(4, 3, 256, 256).cuda() loss = urbanclip_init( text = text, images = images, return_loss = True # set this to True to get the full caption + contrastive loss ) loss.backward() logits = urbanclip_init( text = text, images = images ) text_embeds, image_embeds = urbanclip_init( text = text, images = images, return_embeddings = True ) 现在显示无法解析导入“vit_pytorch.simple_vit_with_patch_dropout”和无法解析导入“vit_pytorch.extractor”,请问我应该怎么处理呢?
最新发布
10-17
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

liiiiiiiiiiiiike

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值