构建RAG智能体(5)：语义护栏之过滤无用信息-优快云博客

本文链接：https://blog.youkuaiyun.com/tilblackout/article/details/149550550

本篇文章我们将深入探讨语义护栏(Semantic Guardrailing)，即如何利用嵌入模型作为语言骨干，并在此基础上训练一个分类器，以有效过滤掉对聊天机器人无益甚至有害的信息。本文将详细阐述这种方法相对于传统自回归引导过滤的优势，并通过生成合成数据的实际任务，展示了构建语义护栏的具体步骤。

文章目录

1 引言
2. 利用嵌入模型构建语义护栏
3 总结

1 引言

在之前的文章中，我们已经掌握了嵌入模型的基础知识。现在，我们将利用这些知识来探索一个对于模型效果至关重要的概念：语义护栏。其核心目标是利用嵌入技术来识别并过滤那些不适宜聊天机器人回应的消息，从而确保模型的输出既安全又有用。

环境设置

在开始之前，我们需要配置好相应的开发环境。

## 在Colab中需要，课程环境中则非必需
# %pip install -qq langchain langchain-nvidia-ai-endpoints gradio

## 如果您在colab中遇到 typing-extensions 的问题，
## 重启您的运行时再试一次
from langchain_nvidia_ai_endpoints._common import NVEModel

from getpass import getpass
import requests
import os

hard_reset = False  ## <-- 如果您想重置您的 NVIDIA_API_KEY，请设置为 True
while "nvapi-" not in os.environ.get("NVIDIA_API_KEY", "") or hard_reset:
    try:
        response = requests.get("http://docker_router:8070/get_key").json()
        assert response.get('nvapi_key')
    except: response = {'nvapi_key' : getpass("NVIDIA API Key: ")}
    os.environ["NVIDIA_API_KEY"] = response.get("nvapi_key")
    try: requests.post("http://docker_router:8070/set_key/", json={'nvapi_key' : os.environ["NVIDIA_API_KEY"]}).json()
    except: pass
    hard_reset = False
    if "nvapi-" not in os.environ.get("NVIDIA_API_KEY", ""):
        print("[!] API 密钥分配失败。请确保它以 `nvapi-` 开头，如同模型页面生成的一样。")

print(f"已检索到 NVIDIA_API_KEY，开头为 \"{os.environ.get('NVIDIA_API_KEY')[:9]}...\"")
from langchain_nvidia_ai_endpoints._common import NVEModel
NVEModel().available_models

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings, ChatNVIDIA

## TODO: 选择你的嵌入模型
embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type=None)

2. 利用嵌入模型构建语义护栏

在下一篇笔记中，我们将使用更高级的工具，在底层利用我们的嵌入模型。在此之前，趁着对原始方法还记忆犹新，我们仍然可以探索几个重要的概念。

具体来说，我们可以用它作为生产化模型关键组件语义护栏的骨干。我们可以使用嵌入来过滤掉那些对我们的聊天机器人来说不太可能有用的消息.

2.1 相对于自回归引导过滤的优势

在之前的文章中，我们可以使用LLM来辅助复杂的内部推理，是不是可以用它来做过滤呢？具体来说，你可能会想到让一个LLM来判断问题，然后使用RunnableBranch进行分支处理。这完全可行，但该系统存在一些需要深入考量的优缺点：

优点： 通过提示词工程来设计内部系统以限制对话流程，相对快速且简单。您甚至可以开发一个程序，接收好问题和坏问题的示例，并生成一个能稳定返回好或坏这两种有限状态的提示。

缺点： 使用自回归路由通常会带来一些不可接受的延迟或资源开销。例如，你可能希望在后台集成一个语义护栏机制，既能防止有害输出，又能将有问题的输入引向安全且可预测的方向。你的自回归方案如下：

可以使用一个相对较小的、经过指令微调的模型来充当零样本分类器，并期望其性能保持稳定。 为此，您可能还需要将输入转换为模型表现最佳的规范(标准)形式。
也可以微调一个小型自回归LLM，使其适用于您的任务。 你需要进行一些合成数据管理，并且可能需要为一次性的微调投入额外的计算预算，但这至少能让一个较小的模型默认模仿一个大型、经过提示工程设计的模型的性能。

尽管这些选项都不错，但对于这个特定的用例，一个合适的嵌入模型、一些数据管理以及对深度学习基础概念的回顾，可以很好地解决问题。

具体来说，我们可以使用嵌入模型作为语言的骨干，然后在其之上训练一个分类器来预测概率。 我们将探讨这个想法，并在遇到新挑战时加以解决。

2.2 生成合成数据

要开始构建语义护栏，我们显然需要先设定一些目标。

假设： 假设我们想创建一个企业的聊天机器人，它应该主要回应关于技术和公司相关细节的讨论。你可能会觉得这个聊天机器人的定义相当狭隘，并且存在一些明显的缺陷，但你是对的…尽管如此，这是一个很好的开始，由此产生的工件在概念上很容易扩展到更现实的问题设定中。
计划： 为了帮助识别我们正在处理的条目类型，生成一些有代表性的输入来定义什么是好和什么是差是一个好方法。然后我们可以观察我们的嵌入模型如何处理这些示例，并相应地设计一个解决方案。

不幸的是，我们没有任何真实数据，所以看起来我们只能进行合成数据生成了，实际上用大模型为小模型来生成数据集是行得通的。不过这里我们就随机组合生成一些评论：

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import ChatMessage
from operator import itemgetter

## 针对mistral的有用方法，该模型目前经过微调以输出带编号的条目
def EnumParser(*idxs):
    '''该方法从输出带编号条目的mistral模型中提取值'''
    idxs = idxs or [slice(0, None, 1)]
    entry_parser = lambda v: v if (' ' not in v) else v[v.index(' '):]
    out_lambda = lambda x: [entry_parser(v).strip() for v in x.split("\n")]
    return StrOutputParser() | RunnableLambda(lambda x: itemgetter(*idxs)(out_lambda(x)))

instruct_llm = ChatNVIDIA(model="mixtral_8x7b") | EnumParser()

gen_prompt = {'input' : lambda x:x} | ChatPromptTemplate.from_messages([('user',
    "请生成20个有代表性的对话，这些对话应被视为{input}。"
    "确保所有问题的措辞和内容都非常不同。"
    "不要回答问题；只需列出它们。确保所有输出都有编号。"
    "示例回复: \n1. <问题>\n2. <问题>\n3. <问题>\n..."
)])

## 一些直接引用NVIDIA的例子
responses_1 = (gen_prompt | instruct_llm).invoke(
    "You are a expert in OMOP CDM and you know how to answer the questions and how to split the data in the different domains, the questions that you need to generate are about what are de codes for a different types of variables in a specific domains"
    " reasonable for an NVIDIA document chatbot to be able to answer."
    " Vary the context to technology, research, deep learning, language modeling, gaming, etc."
)
print("Reasonable NVIDIA Responses:", *responses_1, "", sep="\n")

## And some that do not
responses_2 = (gen_prompt | instruct_llm).invoke(
    ## TODO: Finish the prompt
    "You are a expert in OMOP CDM and you know how to answer the questions, the questions that you will generate are about the codes for transform values in to the omop code"
    " be reasonable for a tech document chatbot to be able to answer. Make sure to vary"
    " the context to technology, research, gaming, language modeling, graphics, etc."
)
print("Reasonable non-NVIDIA Responses:", *responses_2, "", sep="\n")

## Feel free to try your own generations instead
responses_3 = (gen_prompt | instruct_llm).invoke(
    "unreasonable for an NVIDIA document chatbot to answer,"
    " as it is irrelevant and will not be useful to answer (though not inherently harmful)."
)
print("Irrelevant Responses:", *responses_3, "", sep="\n")

responses_4 = (gen_prompt | instruct_llm).invoke(
    "unreasonable for an NVIDIA document chatbot to answer,"
    " as it will reflect negatively on NVIDIA."
)
print("Harmful non-NVIDIA", *responses_4, "", sep="\n")

good_responses = responses_1 + responses_2
poor_responses = responses_3 + responses_4

2.3 更快地生成更多嵌入

现在该将生成的数据全部嵌入为语义向量了。我们之前使用同步的embed_query和embed_documents方法来嵌入文档，这对于较小规模或更即时的应用是足够的。然而，当我们需要一次性嵌入大量数据时，效率可能就很低。

这里，我们可以通过异步来允许多个嵌入操作同时进行。值得注意的是，它并非可以无限并发，在手动将其集成到更大型的部署中之前，应进行更深入的研究。

2.3.1 计时解决方案

%%time工具在notebook中对异步解决方案不起作用，因此下面是一个基于作用域的计时工具。下面，我们定义它并测试嵌入前10个文档所需的时间：

import time
import numpy as np

class Timer():
    '''Useful timing utilities (%%time is great, but doesn't work for async)'''
    def __enter__(self):
      self.start = time.perf_counter()

    def __exit__(self, *args, **kwargs):
        elapsed = time.perf_counter() - self.start
        print("\033[1m" + f"Executed in {elapsed:0.2f} seconds." + "\033[0m")

with Timer():
    good_embeds = [embedder.embed_query(x) for x in good_responses[:10]]

print("Shape:", np.array(good_embeds).shape)

输出：
Executed in 6.08 seconds.
Shape: (10, 1024)

异步嵌入(Asynchronous Embeddings)

注意下面的嵌入查询执行起来非常耗时。如果我们能直接访问嵌入模型，其实是可以通过批处理来加速的。但由于云端的查询路由器已经在后台自动做了这件事，它出于公平性和一致性，会限制用户一次只能进行一个查询。

云端嵌入服务器：指的是你使用的嵌入模型其实托管在远程服务器上(比如 NVIDIA 的 API、OpenAI 的 API，或其他大模型服务商)，你并不是自己在本地运行模型。

换句话说，并不是服务本身不能更快地完成嵌入，而是我们的代码在每次执行embed_query指令时，都要串行等待每一个嵌入完成。

当我们需要一次性嵌入大量文档时，通常更好的做法是异步并发地发出所有请求，然后再等待它们返回结果。如果实现得当，本地的处理速度会显著提升，而对大模型服务的影响也很小。

前提是查询路由器启用了所谓的 in-flight batching：多个请求会被合并成批，一起送入神经网络处理。

我们可以尝试使用LangChain提供的标准 aembed_<...> 接口来生成一些协程(Coroutines)，这是一种用于并发执行的机制：

with Timer():
    good_embed_gens = [embedder.aembed_query(query) for query in good_responses[10:20]]
print(good_embed_gens[0])

## 注意：定义协程后，要么执行它，要么关闭它。
## 如果不小心销毁了一个未关闭的协程对象，会触发警告。
for gen in good_embed_gens:
    gen.close()

这些协程可以使用await单独等待执行，或者借助asyncio.gather实现并发执行。当使用后者时，所有协程会同时运行，直到最后一个完成为止，结果才会被汇总。

import asyncio

with Timer():
    tasks = [embedder.aembed_query(query) for query in good_responses[10:20]]
    good_embeds2 = await asyncio.gather(*tasks)

print("Shape:", np.array(good_embeds2).shape)
Executed in 1.19 seconds.
Shape: (10, 1024)

相比之前的串行版本，这个并发版本的耗时基本上等于最慢的那一个嵌入请求的时间。

限制并发量(Limiting Concurrency)

虽然异步版本明显比同步快得多，但并发数量并不是越多越好！如果同时并发的任务太多，可能会导致服务降速、连接中断，甚至资源耗尽。

在实际应用中，我们应该使用控制结构来限制并发数。比如，可以用asyncio的信号量(Semaphore) 来设定最大并发上限：

from collections import abc
import asyncio
from asyncio import Semaphore
import asyncio
from collections import abc
from typing import Callable
from functools import partial

async def embed_with_semaphore(
    text : str,
    embed_fn : Callable,
    semaphore : asyncio.Semaphore
) -> abc.Coroutine:
    async with semaphore:
        return await embed_fn(text)

## 创建新的嵌入方法，用于限制最大并发数量
embed = partial(
    embed_with_semaphore,
    embed_fn = embedder.aembed_query,
    semaphore = asyncio.Semaphore(value=10)  ## <- 可以自行尝试不同的并发数
)

## 又是一个协程构造器，因此耗时非常小
tasks = [embed(query) for query in good_responses[20:30]]

with Timer():
    good_embeds_3 = await asyncio.gather(*tasks)
Executed in 1.01 seconds.

嵌入剩余的响应

现在我们尝试对剩下的文档也进行嵌入。建议继续控制并发数量(失败时系统会抛出异常)，看看是否能在可接受的时间内完成任务。

在我们的测试中，并发数设置为10是一个比较理想的点，再往上提升收益并不明显。

## 注意，我们测试发现 value=10 之后收益就很小了...
with Timer():
    good_tasks = [embed(query) for query in good_respones]
    poor_tasks = [embed(query) for query in bad_responses]

print("Good Embeds Shape:", np.array(good_embeds).shape)
print("Poor Embeds Shape:", np.array(poor_embeds).shape)
NameError: name 'good_respones' is not defined

2.4 确认语义密度

我们生成这些嵌入的初衷是为了支持语义过滤。为了验证嵌入是否具有语义聚类的效果，可以用一些经典的降维方法，如主成分分析(PCA)或t-SNE。

这些方法能将高维数据映射到二维空间，同时尽可能保留原始的统计特性，适合可视化观察语义聚类：

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import numpy as np

# 将所有组的嵌入合并
embeddings = np.vstack([good_embeds, poor_embeds])

# 每个点的标签
labels = np.array([0]*20 + [1]*20 + [4]*20 + [5]*20)

# 执行 PCA
pca = PCA(n_components=2)
embeddings_pca = pca.fit_transform(embeddings)

# 执行 t-SNE
tsne = TSNE(n_components=2, random_state=0)
embeddings_tsne = tsne.fit_transform(embeddings)

# 绘制 PCA
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(embeddings_pca[:, 0], embeddings_pca[:, 1], c=labels, cmap='viridis', label=labels)
plt.title("PCA of Embeddings")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label='Group')

# 绘制 t-SNE
plt.subplot(1, 2, 2)
plt.scatter(embeddings_tsne[:, 0], embeddings_tsne[:, 1], c=labels, cmap='viridis', label=labels)
plt.title("t-SNE of Embeddings")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.colorbar(label='Group')

plt.show()

2.5 训练我们的分类器

我们希望通过这些嵌入来训练一个分类器，用于判断某条响应是否是好的。考虑到我们已经有了很强的嵌入向量，哪怕是一个简单的两层神经网络也足够胜任。

训练深度分类器

我们可以用Keras框架快速搭建一个简单的网络。这个版本可以兼容Keras 2和Keras 3：

with Timer():
    print("Importing Keras for the first time")
    import keras
    from keras import layers

def train_model_neural_network(class0, class1):
    ## 经典深度学习训练流程。训练到收敛即可。
    model = keras.Sequential([
        layers.Dense(64, activation='tanh'),
        layers.Dense(1, activation='sigmoid'),
    ])
    ## 嵌入模型已冻结，因此可以使用较大的学习率快速收敛
    model.compile(
        optimizer = keras.optimizers.Adam(learning_rate = 1),
        loss = [keras.losses.BinaryCrossentropy(from_logits=False)],
        metrics = [keras.metrics.BinaryAccuracy()],
    )
    ## 使用小批量随机梯度下降，需要重复训练几轮

    reps_per_batch = 64*5  ## <- 相当于增加 "epochs" 次数，但不会打印太多日志
    epochs = 2             ## <- 一轮基本就够了，设置两轮只是为了能看到 loss 更新
    x = np.array((class0 + class1) * reps_per_batch)
    y = np.array(([0]*len(class0) + [1]*len(class1)) * reps_per_batch)
    model.fit(x, y, epochs=epochs, batch_size=64, validation_split=.5)
    return model

with Timer():
    model1 = train_model_neural_network(poor_embeds, good_embeds)

训练一个更简单的分类器

由于嵌入模型本身已经拥有很强的语义表达能力，我们其实也可以使用封闭形式(closed-form)优化方式，例如逻辑回归，无需训练过程，仅靠数学公式就能直接拟合出结果。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

def train_logistic_regression(class0, class1):
    ## 逻辑回归实现。用封闭形式方法求解。
    x = class0 + class1
    y = [0] * len(class0) + [1] * len(class1)
    x0, x1, y0, y1 = train_test_split(x, y, test_size=0.5, random_state=42)
    model = LogisticRegression()
    model.fit(x0, y0)
    print(np.array(x0).shape)
    print("Training Results:", model.score(x0, y0))
    print("Testing Results:", model.score(x1, y1))
    return model

with Timer():
    model2 = train_logistic_regression(poor_embeds, good_embeds)

2.6 集成到聊天机器人中

现在我们已经有了嵌入模型和分类器，可以将其集成到聊天机器人的事件循环中。注意，为了保持用户体验，我们不会直接拒绝差的问题，而是修改系统提示语，委婉地限制机器人回答的内容。

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableBranch, RunnableLambda
from langchain_core.runnables.passthrough import RunnableAssign
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings

import gradio as gr
import numpy as np

# 假设你已经训练好分类器并命名为 model2
# 以及你已经有一个嵌入器 embedder（这里已经初始化好了）
embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="query")
chat_model = ChatNVIDIA(model="llama2_13b") | StrOutputParser()

response_prompt = ChatPromptTemplate.from_messages([("system", "{system}"), ("user", "{input}")])

from functools import partial

def RPrint(preface=""):
    def print_and_return(x, preface=""):
        print(f"{preface}{x}")
        return x
    return RunnableLambda(partial(print_and_return, preface=preface))

# 正常回答时的系统提示词
good_sys_msg = (
    "You are an NVIDIA chatbot. Please answer their question while representing NVIDIA."
    "  Please help them with their question if it is ethical and relevant."
)
# 回避话题时的系统提示词
poor_sys_msg = (
    "You are an NVIDIA chatbot. Please answer their question while representing NVIDIA."
    "  Their question has been analyzed and labeled as 'probably not useful to answer as an NVIDIA Chatbot',"
    "  so avoid answering if appropriate and explain your reasoning to them. Make your response as short as possible."
)


# 假设你已经训练好了 model2（逻辑回归或神经网络）
def is_good_response(query):
    # 对 query 做嵌入
    embedding = embedder.embed_query(query)
    # 转换成适用于分类器的格式
    embedding = np.array(embedding).reshape(1, -1)
    # 使用分类器进行预测
    result = model2.predict(embedding)[0]
    return bool(result)  # 1 表示“好”，0 表示“不好”

# 修改链条，动态切换 system prompt
chat_chain = (
    { 'input': lambda x: x }
    | RunnableAssign(dict(
        system = RunnableBranch(
            # 如果 is_good_response 返回 False，使用 poor_sys_msg
            ((lambda d: not is_good_response(d["input"])), RunnableLambda(lambda x: poor_sys_msg)),
            # 否则使用 good_sys_msg（默认分支）
            RunnableLambda(lambda x: good_sys_msg)
        )
    )) | response_prompt | chat_model
)


## Gradio 组件

def chat_stream(message, history):
    buffer = ""
    for token in chat_chain.stream({"input": message}):
        buffer += token
        yield buffer

chatbot = gr.Chatbot(value = [[None, "Hello! I'm your NVIDIA chat agent! Let me answer some questions!"]])
demo = gr.ChatInterface(chat_stream, chatbot=chatbot).queue()

try:
    demo.launch(debug=True, share=True, show_api=False)
    demo.close()
except Exception as e:
    demo.close()
    print(e)
    raise e