Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models 翻译_concept arithemetics for circumventing-优快云博客

本文链接：https://blog.youkuaiyun.com/Doc2X/article/details/143979211

Doc2X | PDF 到 Markdown 转换专家
精准将 PDF 转换为 Markdown 文档，支持公式解析与代码提取，简化开发与科研工作流程。
Doc2X | PDF to Markdown Conversion Expert
Convert PDFs to Markdown accurately with support for formula parsing and code extraction, simplifying development and research workflows.
👉 开始使用 Doc2X | Start Using Doc2X

原文链接：https://arxiv.org/pdf/2404.13706

Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

概念算术：绕过扩散模型中的概念抑制

Vitali Petsiuk ${}^{1}$ and Kate Saenko ${}^{1}$

Vitali Petsiuk ${}^{1}$ 和 Kate Saenko ${}^{1}$

Boston University

波士顿大学

{vpetsiuk,saenko}@bu.edu

Abstract. Motivated by ethical and legal concerns, the scientific community is actively developing methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backdoors in them. We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. This property allows us to combine other concepts, that should not have been affected by the inhibition, to reconstruct the vector, responsible for target concept generation, even though the direct computation of this vector is no longer accessible. We provide theoretical and empirical evidence why the proposed attacks are possible and discuss the implications of these findings for safe model deployment. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary. Our work opens up the discussion about the implications of concept arithmetics and compositional inference for safety mechanisms in diffusion models.

摘要。出于伦理和法律方面的考虑，科学界正在积极开发方法，以限制文本到图像扩散模型被滥用于在生成的图像中复制受版权保护的、暴力的、露骨的或个人信息。同时，研究人员通过扮演对手的角色，测试这些新开发的安全措施，以发现其中的漏洞和后门。我们利用扩散模型的组合性质，该性质允许在单次图像生成中利用多个提示。这一性质使我们能够结合其他不应受抑制影响的概念，来重建负责目标概念生成的向量，即使直接计算该向量已不再可行。我们提供了理论和实证证据，说明为什么所提出的攻击是可能的，并讨论了这些发现对安全模型部署的影响。我们认为，必须考虑对手可能采用的所有图像生成方法。我们的工作开启了关于概念算术和组合推理对扩散模型安全机制影响的讨论。

Content Advisory: This paper contains discussions and model-generated content that may be considered offensive. Reader discretion is advised.

内容警告：本文包含可能被认为具有冒犯性的讨论和模型生成内容。读者请自行斟酌。

Project page: https://cs-people.bu.edu/vpetsiuk/arc

项目页面：https://cs-people.bu.edu/vpetsiuk/arc

1 Introduction

1 引言

Recent advances in Text-to-Image (T2I) generation $\left\lbrack { {24},{26},{28}}\right\rbrack$ have led to the rapid growth of applications enabled by the models, including many commercial projects as well as creative applications by the general public. On the other hand, they can also be used for generating deep fakes, hateful or inappropriate images [2,9], copyrighted materials or artistic styles [30]. Trained on vast amounts of data scraped from the web, these models also learn to reproduce the biases and stereotypes present in the data $\left\lbrack {2,8,{17},{19}}\right\rbrack$ . While some legal $\left\lbrack {9,{18}}\right\rbrack$ and ethical [27] questions concerning image generation models remain unsolved, the scientific community is developing methods to limit their malicious utility, while keeping them open and accessible to the community.

文本到图像 (T2I) 生成的最新进展 $\left\lbrack { {24},{26},{28}}\right\rbrack$ 导致了由这些模型驱动的应用的快速增长，包括许多商业项目以及普通公众的创意应用。另一方面，它们也可以用于生成深度伪造、仇恨或不适当的图像 [2,9]、受版权保护的材料或艺术风格 [30]。这些模型在从网络上抓取的大量数据上进行训练，还学会了再现数据中存在的偏见和刻板印象 $\left\lbrack {2,8,{17},{19}}\right\rbrack$ 。尽管关于图像生成模型的某些法律 $\left\lbrack {9,{18}}\right\rbrack$ 和伦理 [27] 问题仍未解决，但科学界正在开发方法来限制其恶意用途，同时保持其对社区的开放和可访问性。

Fig. 1: While recent methods for erasing concepts in Diffusion Models successfully pass their respective evaluations (middle row), they do not entirely remove the target concept (such as zebra) from model weights as claimed. In this work, we propose a method to reproduce the erased concept using the inhibited models (bottom row).

图 1：尽管最近用于擦除扩散模型中概念的方法成功通过了各自的评估（中间行），但它们并未完全如声称的那样从模型权重中移除目标概念（如斑马）。在这项工作中，我们提出了一种使用抑制模型重现擦除概念的方法（底部行）。

Some recently proposed approaches, that we refer to as Concept Inhibition methods $\left\lbrack {7,8,{10},{15},{29},{39}}\right\rbrack$ modify the Diffusion Model (DM) to “forget” some specified information. Given a target concept, the weights of the model are fine-tuned or otherwise edited so that the model is no longer capable of generating images that contain that concept. Unlike the post-hoc filtering methods (safety checkers) that can be easily circumvented by an adversary $\left\lbrack {1,{25},{38}}\right\rbrack$