智能体库 smolagents 框架

一杯水果茶！

于 2025-08-06 00:17:14 发布

阅读量384

点赞数 11

CC 4.0 BY-SA版权

分类专栏：大模型面试必备文章标签： python 人工智能 agent smolagents

本文链接：https://blog.youkuaiyun.com/xiaoyuting999/article/details/149950367

大模型面试必备专栏收录该内容

6 篇文章

订阅专栏

智能体库 smolagents 框架

HF AI Agents Course

智能体库 smolagents 框架

smolagents 是一个专注于 codeAgent 的库，codeAgent 是一种通过代码块执行“操作”，然后通过执行代码“观察”结果的智能体。

smolagents 的关键优势：

简洁性：最小的代码复杂性和抽象层，使框架易于理解、采用和扩展。
灵活的 LLM 支持：通过与 Hugging Face 工具和外部 API 的集成，支持任何 LLM。
代码优先方法：首选支持直接在代码中编写操作的 Code Agents，无需解析并简化工具调用。
HF Hub 集成：与 Hugging Face Hub 无缝集成，允许使用 Gradio Spaces 作为工具。

上面的优势需要结合代码容易理解，先安装 smolagents，pip install smolagents -U，然后就可以用了，看见下面的 CodeAgent 了吗？一行代码就可以构建 agent 实例了，再一行代码就可以让 agent.run() 了，这就是框架的的优势，

from smolagents import CodeAgent, DuckDuckGoSearchTool, InferenceClientModel

agent = CodeAgent(tools=[DuckDuckGoSearchTool()], model=InferenceClientModel())

agent.run("Search for the best music recommendations for a party at the Wayne's mansion.")

如果想自定义工具，使用 @tool 装饰器定义一个作为工具的自定义函数，并将其包含在 tools 列表中就 OK 了，easy 吧？

@tool
def your_custom_tool(arg1:str, arg2:int)-> str: #it's import to specify the return type
    #Keep this format for the description / args / args description but feel free to modify the tool
    """A tool that does nothing yet 
    Args:
        arg1: the first argument
        arg2: the second argument
    """
    return "What magic will you build ?"

agent = CodeAgent(tools=[your_custom_tool], model=InferenceClientModel())

想要导入 Python 包，比如实时任务就需要 datetime 模块，smolagents 使用 additional_authorized_imports 来导入 datetime 模块，

import datetime

agent = CodeAgent(tools=[], model=InferenceClientModel(), additional_authorized_imports=['datetime'])

然后就是 smolagents 框架是 Hugging face 家的，所以非常方便与 Hugging face 社区分享完整的智能体，并下载其他人的智能体立即使用，

# 更改为你的用户名和仓库名
agent.push_to_hub('sergiopaniego/AlfredAgent')  # 把你的智能体上传到社区
alfred_agent = agent.from_hub('sergiopaniego/AlfredAgent')  # 从社区下载自己或别人的智能体

什么情况下选择 smolagents 而不是其他框架？轻量级且最小化的解决方案、快速实验 而无需复杂的配置、应用逻辑相对简单。

smolagents 中的智能体作为 多步骤智能体 运行。每个 MultiStepAgent 执行：一次思考、一次工具调用和执行。

论文 Executable Code Actions Elicit Better LLM Agents 研究表明，工具调用型大语言模型直接使用代码工作更有效。

代码智能体（Code agents）是 smolagents 中的默认智能体类型。smolagents 提供了一个轻量级框架，用约 1,000 行代码实现构建代码智能体（code agents）。

代码智能体如何工作？

在这里插入图片描述

上图说明了 CodeAgent.run() 如何操作，遵循在前面提到的 ReAct 框架。smolagents 中智能体的主要抽象是 MultiStepAgent，它作为核心构建块。

CodeAgent 通过一系列步骤执行操作，将现有变量和知识整合到智能体的上下文中，这些内容保存在执行日志中：

系统提示存储在 SystemPromptStep 中，用户查询记录在 TaskStep 中。
然后，执行以下循环：
- 方法 agent.write_memory_to_messages() 将智能体的日志写入大语言模型可读的聊天消息列表中。
- 这些消息发送给 Model，生成补全（completion）。
- 解析补全内容以提取操作，这应该是代码片段，因为使用的是 CodeAgent。
- 执行该操作。
- 将结果记录到内存中的 ActionStep 中。
在每个步骤结束时，如果智能体包含任何函数调用（在 agent.step_callback 中），它们将被执行。

除了使用 CodeAgent 作为主要类型的智能体外，smolagents 还支持 ToolCallingAgent，后者以 JSON 形式编写工具调用。只是 ToolCallingAgent 不再生成可执行代码，而是生成指定工具名称和参数的 JSON 对象，系统随后解析这些指令来执行相应工具。

举个例子，当想要搜索餐饮服务和派对创意时，CodeAgent 会生成并运行如下 Python 代码：

for query in [
    "Best catering services in Gotham City", 
    "Party theme ideas for superheroes"
]:
    print(web_search(f"Search for: {query}"))

而 ToolCallingAgent 则会创建 JSON 结构：

[
    {"name": "web_search", "arguments": "Best catering services in Gotham City"},
    {"name": "web_search", "arguments": "Party theme ideas for superheroes"}
]

该 JSON 结构随后会被用于执行工具调用。

检索智能体 —— 构建智能驱动的 RAG 系统

检索增强生成（Retrieval-Augmented Generation，RAG）系统结合了数据检索和生成模型的能力，以提供上下文感知的响应。

例如，用户的查询会被传递给搜索引擎，检索结果与查询一起提供给模型，模型随后根据查询和检索到的信息生成响应。

智能驱动的 RAG（Retrieval-Augmented Generation）通过将自主智能体与动态知识检索相结合，扩展了传统 RAG 系统。

传统 RAG 系统使用 LLM 根据检索数据回答查询，而智能驱动的 RAG 实现了对检索和生成流程的智能控制，从而提高了效率和准确性。

传统 RAG 系统面临关键限制，例如依赖单次检索步骤，以及过度关注与用户查询的直接语义相似性，这可能会忽略相关信息。
智能驱动的 RAG 通过允许智能体自主制定搜索查询、评估检索结果并进行多次检索步骤，以生成更定制化和全面的输出，从而解决这些问题。

让我们构建一个能够使用 DuckDuckGo 进行网页搜索的简单智能体，该智能体将检索信息并综合响应来回答查询。

from smolagents import CodeAgent, DuckDuckGoSearchTool, InferenceClientModel

# 初始化搜索工具
search_tool = DuckDuckGoSearchTool()

# 初始化模型
model = InferenceClientModel()

agent = CodeAgent(
    model=model,
    tools=[search_tool]
)

# 使用示例
response = agent.run(
    "Search for luxury superhero-themed party ideas, including decorations, entertainment, and catering."
)
print(response)

对于专业任务，自定义知识库非常宝贵。

向量数据库（vector database）是通过专业 ML 模型实现丰富文档表示的集合，能够快速搜索和检索文档。该方法将预定义知识与语义搜索相结合，为活动规划提供上下文感知解决方案。

向量数据库：存储所有文档块的嵌入向量，支持海量数据的高维近邻快速检索。

下面创建从自定义知识库检索派对策划创意的工具。使用 BM25 检索器搜索知识库并返回最佳结果，同时使用 RecursiveCharacterTextSplitter 将文档分割为更小的块以提高搜索效率：

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from smolagents import Tool
from langchain_community.retrievers import BM25Retriever
from smolagents import CodeAgent, InferenceClientModel

class PartyPlanningRetrieverTool(Tool):
    name = "party_planning_retriever"
    description = "Uses semantic search to retrieve relevant party planning ideas for Alfred’s superhero-themed party at Wayne Manor."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be a query related to party planning or superhero themes.",
        }
    }
    output_type = "string"

    def __init__(self, docs, **kwargs):
        super().__init__(**kwargs)
        # 对所有分块做倒排索引，运行时根据用户查询（query）快速找出最相关的前 5 条文档块
        self.retriever = BM25Retriever.from_documents(
            docs, k=5  # 检索前 5 个文档
        )

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.retriever.invoke(
            query,
        )
        return "\nRetrieved ideas:\n" + "".join(
            [
                f"\n\n===== Idea {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )

# 模拟派对策划知识库
party_ideas = [
    {"text": "A superhero-themed masquerade ball with luxury decor, including gold accents and velvet curtains.", "source": "Party Ideas 1"},
    {"text": "Hire a professional DJ who can play themed music for superheroes like Batman and Wonder Woman.", "source": "Entertainment Ideas"},
    {"text": "For catering, serve dishes named after superheroes, like 'The Hulk's Green Smoothie' and 'Iron Man's Power Steak.'", "source": "Catering Ideas"},
    {"text": "Decorate with iconic superhero logos and projections of Gotham and other superhero cities around the venue.", "source": "Decoration Ideas"},
    {"text": "Interactive experiences with VR where guests can engage in superhero simulations or compete in themed games.", "source": "Entertainment Ideas"}
]

source_docs = [
    Document(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in party_ideas
]

# 分割文档以提高搜索效率（把大篇幅的文档切成多个小块）
# 分块后，每个 Document 都有独立的 page_content 与元数据（metadata），方便后续检索命中后，直接把相关块注入到 LLM 的上下文中
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    add_start_index=True,
    strip_whitespace=True,
    separators=["\n\n", "\n", ".", " ", ""],
)
docs_processed = text_splitter.split_documents(source_docs)

# 创建检索工具, 把上述检索逻辑封装成一个“查询→返回文本”的黑盒工具
party_planning_retriever = PartyPlanningRetrieverTool(docs_processed)

# 初始化智能体
agent = CodeAgent(tools=[party_planning_retriever], model=InferenceClientModel())

# 使用示例
response = agent.run(
    "Find ideas for a luxury superhero-themed party, including entertainment, catering, and decoration options."
)

print(response)

增强后的智能体能够：首先检查文档中的相关信息、结合知识库的洞察、在内存中维护对话上下文。

总结，一个典型的 RAG（Retrieval-Augmented Generation）流水线，大致就是这么几个步骤：

构建自定义知识库：收集你想要补充给模型的文本——可以是文档、网页抓取内容、内部报告、产品手册……把这些文本包装成 Document，保留必要的 metadata（比如来源、时间戳、标题等）。
文本分块（Chunking）：为了让检索更高效、又让每块文本在语义上尽量完整，会把大段文本按固定长度（如 500 字）分割，并允许一定重叠（如 50 字），保证上下文连续性。
构建索引：
关键词检索（BM25）：对分块后的文本做倒排索引，根据词频/逆文档频率算相关度。适用于相对小型、对关键词特别敏感的场景。
向量检索（Embedding + 向量数据库）：先用大模型的嵌入接口（如 OpenAI Embeddings）把每个文本块转成向量，再把这些向量存入 FAISS/Chroma/Weaviate 等向量数据库，检索时做高维近邻查找，语义匹配能力更强，尤其擅长「同义替换」「上下文相关」的查询。

查询与检索：用户给出问题或指令（query），系统把它转成关键词或向量，在索引里快速找出最相关的前 k 条文本块。
拼接 Prompt，发送给 LLM：取回的文本块（通常会附带各自的 metadata 说明来源），按一定模板拼接到 Prompt 里。

多智能体系统 —— 不依赖单一智能体，任务分配给具有不同能力的智能体

多智能体系统使专业智能体能够在复杂任务上进行协作，提高模块化、可扩展性和稳健性。不依赖单一智能体，任务分配给具有不同能力的智能体。

一个典型的设置可能包括：

管理智能体（Manager Agent）用于任务委派；
代码解释器智能体（Code Interpreter Agent）用于代码执行；
网络搜索智能体（Web Search Agent）用于信息检索。

其中管理智能体协调代码解释器工具和网络搜索智能体，后者利用像 DuckDuckGoSearchTool 和 VisitWebpageTool 这样的工具来收集相关信息。

在这里插入图片描述
例如，多智能体 RAG 系统可以整合：

网络智能体（Web Agent）用于浏览互联网。
检索智能体（Retriever Agent）用于从知识库获取信息。
图像生成智能体（Image Generation Agent）用于生成视觉内容。

多智能体结构允许在不同子任务之间分离记忆，带来两大好处：每个智能体更专注于其核心任务，因此性能更佳；分离记忆减少了每个步骤的输入 token 数量，从而减少延迟和成本。

举个例子，先定义一个网络智能体，

model = InferenceClientModel(
    "Qwen/Qwen2.5-Coder-32B-Instruct", provider="together", max_tokens=8096
)

web_agent = CodeAgent(
    model=model,
    tools=[
        GoogleSearchTool(provider="serper"),
        VisitWebpageTool(),
        calculate_cargo_travel_time,
    ],
    name="web_agent",
    description="Browses the web to find information",
    verbosity_level=0,
    max_steps=10,
)

管理智能体需要进行一些较重的思考工作，所以给它更强大的模型 DeepSeek-R1，

manager_agent = CodeAgent(
    model=InferenceClientModel("deepseek-ai/DeepSeek-R1", provider="together", max_tokens=8096),
    tools=[calculate_cargo_travel_time],
    managed_agents=[web_agent],  # 管理的 agent 在这里
    additional_authorized_imports=[
        "geopandas",
        "plotly",
        "shapely",
        "json",
        "pandas",
        "numpy",
    ],
    planning_interval=5,
    verbosity_level=2,
    final_answer_checks=[check_reasoning_and_plot],
    max_steps=15,
)

让我们检查这个团队是什么样子：

manager_agent.visualize()

这将生成类似于下面的内容，帮助我们理解智能体和使用的工具之间的结构和关系：

CodeAgent | deepseek-ai/DeepSeek-R1
├── ✅ Authorized imports: ['geopandas', 'plotly', 'shapely', 'json', 'pandas', 'numpy']
├── 🛠️ Tools:
│   ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
│   ┃ Name                        ┃ Description                           ┃ Arguments                             ┃
│   ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│   │ calculate_cargo_travel_time │ Calculate the travel time for a cargo │ origin_coords (`array`): Tuple of     │
│   │                             │ plane between two points on Earth     │ (latitude, longitude) for the         │
│   │                             │ using great-circle distance.          │ starting point # ...                  │
│   │ final_answer                │ Provides a final answer to the given  │ answer (`any`): The final answer to   │
│   │                             │ problem.                              │ the problem                           │
│   └─────────────────────────────┴───────────────────────────────────────┴───────────────────────────────────────┘
└── 🤖 Managed agents:
    └── web_agent | CodeAgent | Qwen/Qwen2.5-Coder-32B-Instruct
        ├── ✅ Authorized imports: []
        ├── 📝 Description: Browses the web to find information
        └── 🛠️ Tools:
            ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
            ┃ Name                        ┃ Description                       ┃ Arguments                         ┃
            ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
            │ web_search                  │ Performs a google web search for  │ query (`string`): The search      │
            │                             │ your query then returns a string  │ query to perform.                 │
            │                             │ of the top search results.        │ filter_year (`integer`):          │
            │                             │                                   │ Optionally restrict results to a  │
            │                             │                                   │ certain year                      │
            │ final_answer                │ Provides a final answer to the    │ answer (`any`): The final answer  │
            │                             │ given problem.                    │ to the problem                    │
            └─────────────────────────────┴───────────────────────────────────┴───────────────────────────────────┘

视觉和浏览器智能体

赋予智能体视觉能力对于超越文本处理的任务至关重要。网页浏览、文档理解等现实场景都需要解析丰富的视觉内容。smolagents 内置支持视觉语言模型（VLMs），使智能体能够有效处理图像信息。

该方法在智能体启动时通过 task_images 参数传入图像，智能体在执行过程中持续处理这些图像。

from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg", # 小丑图像
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg" # 小丑图像
]

images = []
for url in image_urls:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" 
    }
    response = requests.get(url,headers=headers)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    images.append(image)

完成图像加载后，智能体将判断访客身份：究竟是超级英雄（Wonder Woman）还是反派角色（The Joker）。

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(model_id="gpt-4o")

# 实例化智能体
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images
)

下面的运行结果指出，揭示了这是 Joker，

    {
        'Costume and Makeup - First Image': (
            'Purple coat and a purple silk-like cravat or tie over a mustard-yellow shirt.',
            'White face paint with exaggerated features, dark eyebrows, blue eye makeup, red lips forming a wide smile.'
        ),
        'Costume and Makeup - Second Image': (
            'Dark suit with a flower on the lapel, holding a playing card.',
            'Pale skin, green hair, very red lips with an exaggerated grin.'
        ),
        'Character Identity': 'This character resembles known depictions of The Joker from comic book media.'
    }

我们知道，smolagents 中的智能体基于 MultiStepAgent 类，该类是 ReAct 框架的抽象。在此方法中，图像是在执行过程中动态添加到智能体的记忆中的。
在这里插入图片描述

接下来 构建一个探索网络、搜索潜在访客详情并检索验证信息的智能体，我们需要为智能体提供一组新的工具。此外，我们将使用 Selenium 和 Helium，这些是浏览器自动化工具。让我们安装所需的工具 pip install "smolagents[all]" helium selenium python-dotenv，

我们需要一组专为浏览设计的智能体工具，例如“search_item_ctrl_f”、“go_back”和“close_popups”。这些工具允许智能体像浏览网页的人一样行事。

@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()

@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()

还需要保存屏幕截图的功能，因为这是我们的 VLM 智能体完成任务时必不可少的一部分。此功能会捕获屏幕截图并将其保存在 step_log.observations_images = [image.copy()] 中，从而允许智能体在导航时动态存储和处理图像。

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # 让 JavaScript 动画在截图之前完成
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  # 从日志中删除先前的截图以进行精简处理
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  # 创建副本以确保其持久保存，重要！!

    # 使用当前 URL 更新观察结果 
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None else step_log.observations + "\n" + url_info
    return

此函数作为 step_callback 传递给智能体，因为它在智能体执行的每一步结束时被触发。这使得智能体能够在整个过程中动态捕获和存储屏幕截图。

现在，我们可以生成用于浏览网页的视觉智能体，为其提供我们创建的工具，以及 DuckDuckGoSearchTool 以探索网页。此工具将帮助智能体根据视觉线索检索验证访客身份所需的信息。

from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
model = OpenAIServerModel(model_id="gpt-4o")

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],  # 浏览网页的工具函数
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot],  # 在执行的每一步结束时触发
    max_steps=20,
    verbosity_level=2,
)

有了这些，Alfred 准备检查访客的身份，并根据这些信息做出是否允许他们进入派对的明智决定：

agent.run("""
I am Alfred, the butler of Wayne Manor, responsible for verifying the identity of guests at party. A superhero has arrived at the entrance claiming to be Wonder Woman, but I need to confirm if she is who she says she is.

Please search for images of Wonder Woman and generate a detailed visual description based on those images. Additionally, navigate to Wikipedia to gather key details about her appearance. With this information, I can determine whether to grant her access to the event.
""" + helium_instructions)

可以看到，我们将 helium_instructions 作为任务的一部分包含在内。这个特殊的提示旨在控制智能体的导航，确保它在浏览网页时遵循正确的步骤。

通过这些步骤，我们成功地为派对创建了一个身份验证系统！