SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

最新推荐文章于 2025-06-04 23:56:57 发布

尔呦

最新推荐文章于 2025-06-04 23:56:57 发布

阅读量846

点赞数 22

分类专栏： paper reading 文章标签：深度学习

版权

22 篇文章

订阅专栏

针对的还是subject driven image generation的问题，针对的问题是作为reference的图片是subject和多种其他object以及背景的混合，本文不需要test time finetuning，主要贡献点在于SSR encoder来得到更加准确的subject embedding，SSR encoder包括token-to-patch来对齐query(mask or text)和image patches以及一个detail preserving subject encoder来提取subject特征；

image $I$ +query $q$ (mask or text)，SSR-encoder输出multiscale subject embedding $c_s$ ，之后作为条件参与训练的方式和ip-adapter一样，text embedding $c_t$ 不变；
Selective Subject Representation Encoder：SSR-encoder:token-to-Patch Aligner + Detail-Preserving Subject Encoder
Token-to-patch aligner:对齐image patch feature和text token feature，对于 $I, q$ ，使用clip得到对应的query embedding $z_q\in\mathbb{R}^{N_q\times D_q}$ 和image embedding $z_0\in\mathbb{R}^{N_i\times D_i}$ ，之后使用可训练的project layer $W_Q,W_K$ 得到 $Q, K$ ，计算对应的token-to-patch attn map $A_{t2p} = Softmax(\frac{QK^T}{\sqrt{d}})$ ，这个map起到了region selection的作用，所以本文的方法天然的支持以mask作为query条件；
Detail-preserving subject encoder：之后使用clip提取image feature，与之前方法只提取最后一层不同，本文提取了multi scale的特征(本文设定为k) $z_I = \{z_k\}^K_{k=0}$ ，结合使用 $k$ 个projection layer $W_k^V$ 得到 $k$ 个 $V$ ，结合 $A_{t2p}$ 得到作为条件的embedding $c_s，c_s^k = A_{t2p}V_k^T$ ，通过将 $k$ 个 $c_s^k$ concat起来得到；
Subject Conditioned Generation：和ip-adapter类似；
在训练的时候，增加了一项损失，监督的目标是 $c_s^k$ 的平均值和query text embedding的cosine similarity；

使用laion 5b subsets with aesthetic score > 6，使用了blip2重新做了caption，总共一千万的数据量，5000用作评测；
训练的时候输入是512*512，处理方式是resize + centercrop；
metrics：和subject的一致性，和text的一致性以及美学分数；