提示词工程与注意力机制

原创于 2025-12-05 23:05:20 发布 · 817 阅读

CC 4.0 BY-SA版权

文章标签：

摘要

提示词工程（Prompt Engineering）是使用Stable Diffusion等文本到图像生成模型的关键技能。本文将深入探讨Stable Diffusion WebUI中的提示词工作机制，包括提示词解析、注意力机制、权重调整等核心技术。我们将分析WebUI中[prompt_parser.py](file:///E:/project/stable-diffusion-webui/modules/prompt_parser.py)和[sd_hijack.py](file:///E:/project/stable-diffusion-webui/modules/sd_hijack.py)等模块的实现原理，揭示提示词如何转化为模型可以理解的向量表示，以及如何通过调整提示词权重来控制生成结果。此外，还将介绍文本倒排（Textual Inversion）、超网络（Hypernetwork）等高级功能的实现机制。

关键词： Stable Diffusion, 提示词工程, 注意力机制, WebUI, 深度学习

1. 引言

提示词工程是使用AI生成模型的重要技能，尤其在文本到图像生成领域。Stable Diffusion模型通过分析输入的文本提示词来生成相应的图像，因此提示词的质量和结构直接影响生成结果。在Stable Diffusion WebUI中，提示词不仅用于指导图像生成，还支持复杂的语法结构，如权重调整、动态提示词等。

本文将深入分析WebUI中提示词处理的核心机制，帮助读者理解提示词是如何被解析、编码并应用于模型推理过程中的。

2. 提示词解析机制

2.1 提示词语法

Stable Diffusion WebUI支持多种提示词语法，允许用户精确控制生成过程：

基础语法：
- 普通词汇：a beautiful landscape
- 词汇组合：mountain lake forest
权重调整：
- 增加权重：(important detail:1.2) 或 (important detail)
- 减少权重：[unwanted element:0.8] 或 [unwanted element]
动态提示词：
- 时间轴控制：[first image:second image:20]
- 组合控制：[option1|option2|option3]
复合提示词：
- 使用AND连接：a cat AND a dog

2.2 提示词解析器

WebUI使用基于Lark的解析器来处理复杂的提示词语法：

schedule_parser = lark.Lark(r"""
!start: (prompt | /[][():]/+)*
prompt: (emphasized | scheduled | alternate | plain | WHITESPACE)*
!emphasized: "(" prompt ")"
        | "(" prompt ":" prompt ")"
        | "[" prompt "]"
scheduled: "[" [prompt ":"] prompt ":" [WHITESPACE] NUMBER [WHITESPACE] "]"
alternate: "[" prompt ("|" [prompt])+ "]"
WHITESPACE: /\s+/
plain: /([^\\\[\]():|]|\\.)+/
%import common.SIGNED_NUMBER -> NUMBER
""")

这个解析器能够处理各种复杂的提示词结构，并将其转换为内部表示。

2.3 提示词调度

对于包含时间轴控制的提示词，WebUI会生成提示词调度表：

def get_learned_conditioning_prompt_schedules(prompts, base_steps, hires_steps=None, use_old_scheduling=False):
    """
    将提示词转换为调度表，指定在不同步骤使用不同的提示词
    
    示例：
    输入: "a [mountain:lake:0.25]"
    输出: [[25, 'a mountain'], [100, 'a lake']]
    """
    # 实现细节...

这种机制允许在生成过程中动态调整提示词，从而实现更精细的控制。

3. 文本编码与注意力机制

3.1 CLIP文本编码器

Stable Diffusion使用CLIP模型作为文本编码器，将文本提示词转换为向量表示。在WebUI中，这个过程通过[TextConditionalModel](file:///E:/project/stable-diffusion-webui/modules/sd_hijack_clip.py#L24-L133)类及其派生类实现：

class TextConditionalModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.hijack = sd_hijack.model_hijack
        self.chunk_length = 75  # CLIP最大token长度(不含开始和结束标记)
        
    def forward(self, texts):
        """
        将文本转换为向量表示
        """
        # 处理文本
        batch_chunks, token_count = self.process_texts(texts)
        
        # 编码每个chunk
        zs = []
        for i in range(chunk_count):
            batch_chunk = [chunks[i] if i < len(chunks) else self.empty_chunk() for chunks in batch_chunks]
            tokens = [x.tokens for x in batch_chunk]
            multipliers = [x.multipliers for x in batch_chunk]
            
            # 应用文本倒排修复
            self.hijack.fixes = [x.fixes for x in batch_chunk]
            z = self.process_tokens(tokens, multipliers)
            zs.append(z)
            
        return torch.hstack(zs)

3.2 Token化处理

文本首先被分割成tokens，每个token对应CLIP词汇表中的一个词条：

def tokenize_line(self, line):
    """
    将单行文本转换为PromptChunk对象列表
    """
    if opts.emphasis != "None":
        parsed = prompt_parser.parse_prompt_attention(line)
    else:
        parsed = [[line, 1.0]]

    tokenized = self.tokenize([text for text, _ in parsed])
    
    chunks = []
    chunk = PromptChunk()
    token_count = 0
    last_comma = -1
    
    # 处理每个token
    for tokens, (text, weight) in zip(tokenized, parsed):
        position = 0
        while position < len(tokens):
            token = tokens[position]
            
            # 处理文本倒排嵌入
            embedding, embedding_length_in_tokens = self.hijack.embedding_db.find_embedding_at_position(tokens, position)
            if embedding is None:
                chunk.tokens.append(token)
                chunk.multipliers.append(weight)
                position += 1
                continue
                
            # 添加嵌入修复信息
            chunk.fixes.append(PromptChunkFix(len(chunk.tokens), embedding))
            chunk.tokens += [0] * emb_len
            chunk.multipliers += [weight] * emb_len
            position += embedding_length_in_tokens

3.3 注意力权重调整

WebUI支持多种注意力权重调整机制：

def parse_prompt_attention(text):
    """
    解析提示词中的注意力标记
    """
    res = []
    round_brackets = []
    square_brackets = []

    round_bracket_multiplier = 1.1
    square_bracket_multiplier = 1 / 1.1

    def multiply_range(start_position, multiplier):
        for p in range(start_position, len(res)):
            res[p][1] *= multiplier
            
    # 解析各种标记
    for m in re_attention.finditer(text):
        text = m.group(0)
        weight = m.group(1)

        if text.startswith('\\'):
            res.append([text[1:], 1.0])
        elif text == '(':
            round_brackets.append(len(res))
        elif text == '[':
            square_brackets.append(len(res))
        elif weight is not None and round_brackets:
            multiply_range(round_brackets.pop(), float(weight))
        elif text == ')' and round_brackets:
            multiply_range(round_brackets.pop(), round_bracket_multiplier)
        elif text == ']' and square_brackets:
            multiply_range(square_brackets.pop(), square_bracket_multiplier)
        else:
            res.append([text, 1.0])
            
    # 应用未闭合的括号权重
    for pos in round_brackets:
        multiply_range(pos, round_bracket_multiplier)
        
    for pos in square_brackets:
        multiply_range(pos, square_bracket_multiplier)
        
    return res

4. 文本倒排（Textual Inversion）

4.1 基本原理

文本倒排允许用户通过训练自定义的嵌入向量来表示特定的概念或风格。在WebUI中，通过[EmbeddingDatabase](file:///E:/project/stable-diffusion-webui/modules/textual_inversion/textual_inversion.py#L26-L187)类管理这些嵌入：

class EmbeddingDatabase:
    def __init__(self):
        self.word_embeddings = {}
        self.skipped_embeddings = {}
        self.dir_mtime = None
        self.embeddings_dir = cmd_opts.embeddings_dir
        
    def load_textual_inversion_embeddings(self):
        """
        加载文本倒排嵌入
        """
        # 遍历嵌入目录
        for fn in os.listdir(self.embeddings_dir):
            if not os.path.isfile(os.path.join(self.embeddings_dir, fn)):
                continue
                
            # 支持多种格式
            if fn.lower().endswith(".pt"):
                self.load_from_file_pt(fn)
            elif fn.lower().endswith(".safetensors"):
                self.load_from_file_safetensors(fn)
            elif fn.lower().endswith(".bin"):
                self.load_from_file_bin(fn)
                
    def find_embedding_at_position(self, tokens, offset):
        """
        在指定位置查找嵌入
        """
        token = tokens[offset]
        possible_matches = self.hashes.get(token, [])
        
        if len(possible_matches) == 1:
            return possible_matches[0], 1
            
        # 处理多token嵌入
        for embedding in possible_matches:
            if len(embedding.tokens) <= len(tokens) - offset:
                matched_tokens = tokens[offset:offset + len(embedding.tokens)]
                if matched_tokens == embedding.tokens:
                    return embedding, len(embedding.tokens)
                    
        return None, None

4.2 嵌入应用

嵌入通过[EmbeddingsWithFixes](file:///E:/project/stable-diffusion-webui/modules/sd_hijack.py#L351-L381)类应用到文本编码过程中：

class EmbeddingsWithFixes(torch.nn.Module):
    def __init__(self, wrapped, embeddings, textual_inversion_key='clip_l'):
        super().__init__()
        self.wrapped = wrapped
        self.embeddings = embeddings
        self.textual_inversion_key = textual_inversion_key

    def forward(self, input_ids):
        batch_fixes = self.embeddings.fixes
        self.embeddings.fixes = None

        inputs_embeds = self.wrapped(input_ids)

        if batch_fixes is None or len(batch_fixes) == 0 or max([len(x) for x in batch_fixes]) == 0:
            return inputs_embeds

        vecs = []
        for fixes, tensor in zip(batch_fixes, inputs_embeds):
            for offset, embedding in fixes:
                # 获取嵌入向量
                vec = embedding.vec[self.textual_inversion_key] if isinstance(embedding.vec, dict) else embedding.vec
                emb = devices.cond_cast_unet(vec)
                
                # 替换token嵌入
                emb_len = min(tensor.shape[0] - offset - 1, emb.shape[0])
                tensor = torch.cat([tensor[0:offset + 1], emb[0:emb_len], tensor[offset + 1 + emb_len:]]).to(dtype=inputs_embeds.dtype)

            vecs.append(tensor)

        return torch.stack(vecs)

5. 注意力机制增强

5.1 强调算法

WebUI支持多种强调算法来调整注意力权重：

class Emphasis:
    def __init__(self):
        self.tokens = None
        self.multipliers = None
        self.z = None
        
    def after_transformers(self):
        """
        在transformer处理后调整注意力
        """
        # 应用乘法器
        if opts.emphasis == "Original":
            self.apply_original_emphasis()
        elif opts.emphasis == "SD2":
            self.apply_sd2_emphasis()
            
    def apply_original_emphasis(self):
        """
        原始强调算法
        """
        # 对每个token应用权重
        for batch_pos in range(self.z.shape[0]):
            for token_pos in range(self.z.shape[1]):
                multiplier = self.multipliers[batch_pos][token_pos]
                if multiplier != 1.0:
                    self.z[batch_pos][token_pos] *= multiplier

5.2 自定义强调算法

用户可以通过扩展强调算法来实现自定义的注意力调整：

def get_current_option(emphasis_option):
    """
    获取当前强调算法
    """
    if emphasis_option == "Original":
        return OriginalEmphasis
    elif emphasis_option == "SD2":
        return SD2Emphasis
    else:
        # 默认使用原始算法
        return OriginalEmphasis

6. 提示词优化技巧

6.1 提示词结构优化

良好的提示词结构可以显著提高生成质量：

明确主体：清楚描述图像的主要对象
详细修饰：添加形容词和副词来丰富描述
艺术风格：指定特定的艺术风格或媒介
环境背景：描述场景和光照条件

示例：

(masterpiece), (best quality), a beautiful young woman with long flowing hair, 
intricate detailed face, vibrant colors, fantasy art style, 
detailed background with mountains and lake, dramatic lighting, 
soft focus effect, cinematic composition

6.2 权重调整策略

合理的权重调整可以让模型关注重要元素：

关键元素加强：对主体和重要细节使用正权重
不良元素削弱：对不需要的元素使用负权重
平衡整体效果：避免权重过高导致的过度饱和

示例：

(a beautiful castle:1.3), (in a lush green forest:1.1), 
(blurry background:0.8), (low quality:0.5)

6.3 动态提示词应用

动态提示词可以在生成过程中改变焦点：

渐进式变化：逐步从一个概念过渡到另一个概念
交替显示：在不同步骤显示不同元素
条件控制：根据生成进度调整提示词重点

示例：

a futuristic city, [day time:night time:0.6], 
[empty streets:crowded streets:0.8]

7. 实践案例

7.1 风景画生成

让我们通过一个具体的例子来看如何优化提示词：

# 基础提示词
basic_prompt = "landscape, mountains, lake, trees"

# 优化后的提示词
optimized_prompt = """
(masterpiece:1.2), (best quality:1.2), 
breathtaking landscape, majestic snow-capped mountains, 
crystal clear lake reflecting the sky, 
dense pine forests, fluffy white clouds, 
golden hour lighting, warm sunlight, 
serene atmosphere, peaceful scene, 
highly detailed, sharp focus, 
professional photography, wide-angle shot
"""

# 负面提示词
negative_prompt = """
(low quality:1.3), (worst quality:1.3), 
blurry, distorted proportions, 
extra limbs, deformed hands, 
bad anatomy, cropped image, 
watermark, signature, text, 
ugly, duplicate, morbid, mutilated, 
poorly drawn face, mutation, 
disfigured, mutated hands, 
poorly drawn hands, blurry face
"""

7.2 人物肖像优化

人物肖像需要特别注意面部特征和细节：

portrait_prompt = """
(highly detailed face:1.3), (detailed eyes:1.2), 
(beautiful detailed eyes:1.2), (detailed lips:1.2),
(portrait of a young woman:1.1), 
intricate detailed face, perfect lighting, 
sharp focus, professional headshot, 
8k resolution, award winning photograph,
soft studio lighting, neutral background,
symmetrical facial features, natural expression
"""

portrait_negative = """
(deformed iris, deformed pupils:1.3), 
(asymmetric eyes:1.2), (crooked nose:1.2),
(deformed face:1.3), (ugly face:1.3),
(multiple heads:1.3), (mutated hands:1.2),
(long neck:1.2), (mutated fingers:1.2),
extra fingers, fewer fingers, 
bad hands, bad arms, bad legs,
watermark, logo, text, 
signature, username
"""

8. 性能优化

8.1 提示词缓存

WebUI实现了提示词缓存机制来提高重复生成的效率：

def process_texts(self, texts):
    """
    带缓存的文本处理
    """
    token_count = 0

    cache = {}
    batch_chunks = []
    for line in texts:
        if line in cache:
            chunks = cache[line]
        else:
            chunks, current_token_count = self.tokenize_line(line)
            token_count = max(current_token_count, token_count)

            cache[line] = chunks

        batch_chunks.append(chunks)

    return batch_chunks, token_count

8.2 Token优化

合理使用token可以提高效率：

避免冗余词汇：删除不必要的重复词汇
合并相似概念：使用更精确的单一词汇代替多个词汇
控制总长度：避免超出模型处理能力

9. 高级功能

9.1 Composable Diffusion

WebUI支持通过AND操作符组合多个独立概念：

def get_multicond_prompt_list(prompts: SdConditioning | list[str]):
    """
    处理复合提示词
    """
    res_indexes = []

    prompt_indexes = {}
    prompt_flat_list = SdConditioning(prompts)
    prompt_flat_list.clear()

    for prompt in prompts:
        # 使用AND分割提示词
        subprompts = re_AND.split(prompt)

        indexes = []
        for subprompt in subprompts:
            match = re_weight.search(subprompt)
            text, weight = match.groups() if match is not None else (subprompt, 1.0)
            weight = float(weight) if weight is not None else 1.0

            index = prompt_indexes.get(text, None)
            if index is None:
                index = len(prompt_flat_list)
                prompt_flat_list.append(text)
                prompt_indexes[text] = index

            indexes.append((index, weight))

        res_indexes.append(indexes)

    return res_indexes, prompt_flat_list, prompt_indexes

9.2 提示词变异

通过细微调整提示词可以产生有趣的变体：

# 使用随机种子和提示词变异生成多个相似但不同的图像
variant_prompt = """
(masterpiece), (best quality), 
a cat sitting on a windowsill, 
sunlight streaming through window, 
{fluffy|fluffy fur|luxurious fur}, 
{curious|alert|watchful} expression, 
wooden windowsill, cozy room
"""

10. 故障排除与调试

10.1 常见问题

生成结果与预期不符：
- 检查提示词是否清晰明确
- 调整关键词权重
- 添加更具体的描述
质量低下：
- 检查负面提示词是否充分
- 增加质量相关关键词权重
- 调整采样器和步数参数
元素缺失：
- 确认提示词顺序和权重
- 检查是否有冲突的描述
- 尝试重新排列提示词

10.2 调试技巧

逐步添加元素：从简单提示词开始，逐步增加细节
隔离变量：一次只改变一个因素来观察效果
使用对比测试：比较不同提示词的效果差异

总结

提示词工程是使用Stable Diffusion WebUI的核心技能。通过深入理解提示词解析、文本编码、注意力机制等工作原理，用户可以更精确地控制图像生成过程。本文介绍了从基础语法到高级技巧的全方位知识，帮助读者掌握提示词工程的精髓。

成功的提示词工程需要：

深入理解模型的工作原理
不断实践和积累经验
熟练掌握各种提示词技巧
保持创造力和实验精神

随着技术的不断发展，提示词工程也在持续演进，未来可能会出现更多创新的提示词技术和应用方式。

参考资料

Stable Diffusion WebUI GitHub仓库: https://github.com/AUTOMATIC1111/stable-diffusion-webui
CLIP论文: https://arxiv.org/abs/2103.00020
Textual Inversion论文: https://arxiv.org/abs/2208.01618
Composable Diffusion论文: https://arxiv.org/abs/2206.07771
Lark解析器: https://github.com/lark-parser/lark