训练 Vision Transformer 模型并运行推理

目录

CV Architecture

ViT and U-Net

Training ViT

Florence-2

Load Model

Load images

CV Scenarios test

Genarate CAPTION from the images

DENSE REGION CAPTION and REGION_PROPOSA

Caption to Phrase Grounding

Bounding boxes

OCR test

Fine Tuning Florence2

Qianwen-VL

Qianwen-VL-Inference


截至目前,CV 模型主要基于卷积神经网络。然而,随着 Transformer 的兴起,Vision Transformer 也逐渐得到应用。

接下来,让我们看看主流的 CV 实现及其特性。

CV Architecture

U-Net

  • Features: Encoder-decoder structure, skip connections.
  • Network Type: Convolutional Neural Network (CNN).
  • Applications: Image segmentation, medical image processing.
  • Advantages: Efficient in segmentation tasks, preserves details.
  • Disadvantages: Limited scalability for large datasets.
  • Usage: Widely used in medical image segmentation.
  • Main Models: Original U-Net, 3D U-Net, Stable Diffusion.

R-CNN

  • Features: Selective search for generating candidate regions.
  • Network Type: CNN-based.
  • Applications: Object detection.
  • Advantages: High detection accuracy.
  • Disadvantages: High computational complexity, slow speed.
  • Usage: Replaced by faster models like Faster R-CNN.
  • Main Models: Fast R-CNN, Faster R-CNN.

GAN

  • Features: Adversarial training between generator and discriminator.
  • Network Type: Framework, usually using CNN.
  • Applications: Image generation, style transfer.
  • Advantages: Generates high-quality images.
  • Disadvantages: Unstable training, prone to mode collapse.
  • Usage: Widely used in generation tasks.
  • Main Models: DCGAN, StyleGAN.

RNN/LSTM

  • Features: Handles sequential data, remembers long-term dependencies.
  • Network Type: Recurrent Neural Network.
  • Applications: Time series prediction, video analysis.
  • Advantages: Suitable for sequential data.
  • Disadvantages: Difficult to train, gradient vanishing.
  • Usage: Commonly used in sequence tasks.
  • Main Models: LSTM, GRU.

GNN

  • Features: Processes graph-structured data.
  • Network Type: Graph Neural Network.
  • Applications: Social network analysis, chemical molecule modeling.
  • Advantages: Captures graph structure information.
  • Disadvantages: Limited scalability for large graphs.
  • Usage: Used in graph data tasks.
  • Main Models: GCN, GraphSAGE.

Capsule Networks

  • Features: Capsule structure, captures spatial hierarchies.
  • Network Type: CNN-based.
  • Applications: Image recognition.
  • Advantages: Captures pose variations.
  • Disadvantages: High computational complexity.
  • Usage: Research stage, not widely applied.
  • Main Models: Dynamic Routing.

Autoencoder

  • Features: Encoder-decoder structure.
  • Network Type: Can be CNN-based.
  • Applications: Dimensionality reduction, feature learning.
  • Advantages: Unsupervised learning.
  • Disadvantages: Limited generation quality.
  • Usage: Used for feature extraction and dimensionality reduction.
  • Main Models: Variational Autoencoder (VAE).

Vision Transformer (ViT)

  • Features: Based on self-attention mechanism, processes image patches.
  • Network Type: Transformer.
  • Applications: Image classification.
  • Advantages: Captures global information.
  • Disadvantages: Requires large amounts of data for training.
  • Usage: Gaining popularity, especially on large datasets.
  • Main Models: Original ViT, DeiT.

ViT and U-Net

According to the paper: "Understanding the Efficacy of U-Net & Vision Transformer for Groundwater Numerical Modelling," U-Net is generally more efficient than ViT, especially in sparse data scenarios. U-Net's architecture is simpler with fewer parameters, making it more efficient in terms of computational resources and time. While ViT has advantages in capturing global information, its self-attention mechanism has high computational complexity, particularly when handling large-scale data.

In the experiments of the paper, models combining U-Net and ViT outperformed the Fourier Neural Operator (FNO) in both accuracy and efficiency, especially in sparse data conditions.

In image processing, sparse data typically refers to incomplete or unevenly distributed information in images. For example:

  • Low-resolution images: Fewer pixels, missing details.
  • Occlusion or missing data: Parts of the image are blocked or data is missing.
  • Uneven sampling: Lower pixel density in certain areas.
  • In these cases, models need to infer the complete image content from limited pixel information.

thumbnail image 1 of blog post titled                                              Train Vision Transformer model and run Inference

After the emergence of Vision Transformers, new branches and variations have appeared:

  • DeiT (Data-efficient Image Transformers) by Facebook AI: DeiT models are refined ViT models. The authors also released more training-efficient ViT models, which can be directly integrated into ViTModel or ViTForImageClassification. Four variants are available (in three different sizes): facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-224, facebook/deit-base-patch16-224, and facebook/deit-base-patch16-384. Note that images should be prepared using DeiTImageProcessor.
  • BEiT (BERT pre-training of Image Transformers) by Microsoft Research: BEiT models use a self-supervised method inspired by BERT (masked image modeling) and based on VQ-VAE, outperforming vision transformers with supervised pre-training.
  • DINO (a self-supervised training method for Vision Transformers) by Facebook AI: Vision Transformers trained with the DINO method exhibit interesting properties not found in convolutional models. They can segment objects without being explicitly trained for it. DINO checkpoints can be found on the hub.
  • MAE (Masked Autoencoder) by Facebook AI: By pre-training Vision Transformers to reconstruct the pixel values of a large portion (75%) of masked patches (using an asymmetric encoder-decoder architecture), the authors demonstrate that this simple method outperforms supervised pre-training after fine-tuning.
  • The following diagram describes the workflow of Vision Transformer (ViT):
  1. Image Patching: The input image is divided into small, fixed-size patches.
  2. Linear Projection: Each image patch is flattened and transformed into a vector through linear projection.
  3. Position Embedding: Position embeddings are added to each image patch to retain positional information.
  4. CLS Token: A learnable CLS token is added at the beginning of the sequence for classification tasks.
  5. Transformer Encoder: These embedded vectors (including the CLS token) are fed into the Transformer encoder for multi-layer processing. Each layer includes a multi-head attention mechanism and a feedforward neural network.
  6. MLP Head: After processing by the encoder, the output of the CLS token is passed to a multi-layer perceptron (MLP) head for the final classification decision.
  • This entire process demonstrates how the Transformer architecture can directly handle sequences of image patches to perform image classification tasks.

Training ViT

Pure ViT is mainly for Image Classifier.

class Attention(nn.Module):  
    def __init__(self, dim, heads=8, dim_head=64, dropout=0.):  
        super().__init__()  
        inner_dim = dim_head * heads  
        project_out = not (heads == 1 and dim_head &#
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值