Three questions before research
- What task will enable zero-shot generalization?
- What is the corresponding model architecture?
- What data can power this task and model?
SAM (segment anything model)
A foundation model for image segmentation
Three components:
Task:
- a promptable segmentation task
- a segmentation model (SAM)
- a data engine for collecting dataset of over 1billion masks
model:
Image encoder:
use an MAE pre-trained Vision Transformer (ViT)
Prompt encoder:
two sets of prompts: sparse (points, boxes, text) and dense (masks).
positional encodings & CLIP & using convolutions and summed element-wise with the image embedding
Mask decoder:
uses prompt self-attention and cross-attention in two directions (prompt-to-image embedding and vice-versa)
Data engine:
To achieve strong generalization to new data distributions
Three stages of model-in-the-loop dataset annotation:
1. assisted-manual:
assists annotators in annotating masks, similar to a classic interactive segmentation setup
2. semi-automatic:
automatically generate masks for a subset of objects by prompting (mask diversity)
3. fully automatic:
prompt SAM with a regular grid of foreground points