一层一层的拆解openmanus 从 run_flow出发一步一步理解 代码设计理念

2025年3月21日 代码pull时间
目前openmanus的更新迭代可以说是日新月异
https://github.com/mannaandpoem/OpenManus
https://github.com/mannaandpoem/OpenManus
https://github.com/mannaandpoem/OpenManus
首先是第一层代码 基于run_flow.py 写了一个run_flow_local.py 其实就是把命令行输入改成了代码进行任务输入。

import asyncio
import time

from app.agent.manus import Manus
from app.flow.base import FlowType
from app.flow.flow_factory import FlowFactory
from app.logger import logger


async def run_flow():
    agents = {
        "manus": Manus(),
    }

    try:
        prompt = "从分子结构到宇宙层面都有哪些具身智能应用场景 不要用谷歌搜索"

        if prompt.strip().isspace() or not prompt:
            logger.warning("Empty prompt provided.")
            return

        flow = FlowFactory.create_flow(
            flow_type=FlowType.PLANNING,
            agents=agents,
        )
        logger.warning("Processing your request...")

        try:
            start_time = time.time()
            result = await asyncio.wait_for(
                flow.execute(prompt),
                timeout=3600,  # 60 minute timeout for the entire execution
            )
            elapsed_time = time.time() - start_time
            logger.info(f"Request processed in {elapsed_time:.2f} seconds")
            logger.info(result)
        except asyncio.TimeoutError:
            logger.error("Request processing timed out after 1 hour")
            logger.info(
                "Operation terminated due to timeout. Please try a simpler request."
            )

    except KeyboardInterrupt:
        logger.info("Operation cancelled by user.")
    except Exception as e:
        logger.error(f"Error: {str(e)}")


if __name__ == "__main__":
    asyncio.run(run_flow())

代码解读

这段代码实现了一个异步运行的AI代理工作流程,用于处理关于具身智能应用场景的提示。以下是详细解析:

  1. 导入模块

    • asyncio:Python的异步编程库,用于处理并发操作
    • time:用于计时功能
    • 自定义模块:包括Manus代理、流程类型、流程工厂和日志器
  2. 主函数 run_flow()

    • 创建一个包含单个代理"manus"的字典
    • 设置提示文本:“从分子结构到宇宙层面都有哪些具身智能应用场景 不要用谷歌搜索”
    • 验证提示是否为空
    • 使用FlowFactory创建一个类型为PLANNING的工作流
    • 执行工作流并包含详细的错误处理:
      • 设置60分钟(3600秒)的超时限制
      • 记录执行时间和结果
      • 优雅地处理超时错误
    • 包含用户中断和其他异常的处理机制
  3. 入口点

    • 当脚本直接运行时,通过asyncio.run(run_flow())执行主函数

这段代码似乎是一个更大的AI代理框架的一部分,其中:

  • Manus代理可能提供专门的能力
  • FlowType.PLANNING表明这是一个规划类型的工作流
  • 提示特别询问了从分子到宇宙不同尺度的具身智能应用场景,并注明不使用谷歌搜索

我们可以看到两个适合进行源码研究的类 Manus 类 与 FlowFactory 类。
所以我们深入观察Manus 类的实现

app/agent/manus.py

from pydantic import Field

from app.agent.browser import BrowserAgent
from app.config import config
from app.prompt.browser import NEXT_STEP_PROMPT as BROWSER_NEXT_STEP_PROMPT
from app.prompt.manus import NEXT_STEP_PROMPT, SYSTEM_PROMPT
from app.tool import Terminate, ToolCollection
from app.tool.browser_use_tool import BrowserUseTool
from app.tool.python_execute import PythonExecute
from app.tool.str_replace_editor import StrReplaceEditor


class Manus(BrowserAgent):
    """
    A versatile general-purpose agent that uses planning to solve various tasks.

    This agent extends BrowserAgent with a comprehensive set of tools and capabilities,
    including Python execution, web browsing, file operations, and information retrieval
    to handle a wide range of user requests.
    """

    name: str = "Manus"
    description: str = (
        "A versatile agent that can solve various tasks using multiple tools"
    )

    system_prompt: str = SYSTEM_PROMPT.format(directory=config.workspace_root)
    next_step_prompt: str = NEXT_STEP_PROMPT

    max_observe: int = 10000
    max_steps: int = 20

    # Add general-purpose tools to the tool collection
    available_tools: ToolCollection = Field(
        default_factory=lambda: ToolCollection(
            PythonExecute(), BrowserUseTool(), StrReplaceEditor(), Terminate()
        )
    )

    async def think(self) -> bool:
        """Process current state and decide next actions with appropriate context."""
        # Store original prompt
        original_prompt = self.next_step_prompt

        # Only check recent messages (last 3) for browser activity
        recent_messages = self.memory.messages[-3:] if self.memory.messages else []
        browser_in_use = any(
            "browser_use" in msg.content.lower()
            for msg in recent_messages
            if hasattr(msg, "content") and isinstance(msg.content, str)
        )

        if browser_in_use:
            # Override with browser-specific prompt temporarily to get browser context
            self.next_step_prompt = BROWSER_NEXT_STEP_PROMPT

        # Call parent's think method
        result = await super().think()

        # Restore original prompt
        self.next_step_prompt = original_prompt

        return result

代码解读

这段代码定义了一个名为Manus的通用型AI代理类,它继承自BrowserAgent并具有多种工具和功能。下面是详细分析:

  1. 导入模块

    • pydantic.Field:用于数据验证和模型定义
    • 各种自定义模块,包括浏览器代理、配置、提示文本和工具集合
  2. Manus类定义

    • 继承自BrowserAgent
    • 类文档字符串说明这是一个多功能通用代理,使用规划方法解决各种任务
    • 具备Python执行、网页浏览、文件操作和信息检索等能力
  3. 属性设置

    • name:代理名称为"Manus"
    • description:描述其作为多工具任务解决方案的功能
    • system_prompt:系统提示文本,使用配置中的工作空间路径进行格式化
    • next_step_prompt:指导代理下一步行动的提示文本
    • max_observemax_steps:限制代理的观察长度和最大步骤数
  4. 工具集合

    • 使用Fielddefault_factory初始化可用工具集合
    • 包含四种工具:
      • PythonExecute:执行Python代码
      • BrowserUseTool:基于浏览器进行交互
      • StrReplaceEditor:文件操作工具
      • Terminate: 在请求完成或助手无法继续任务时终止交互, 在所有任务完成后调用此工具结束工作。
  5. think()方法

    • 异步方法,用于处理当前状态并决定下一步行动
    • 实现了特殊的上下文切换逻辑:
      • 保存原始的下一步提示
      • 检查最近3条消息是否包含浏览器活动
      • 如果正在使用浏览器,临时切换到浏览器特定的提示
      • 调用父类的think方法
      • 恢复原始提示
      • 返回结果

这个Manus类是一个功能强大的代理,能够根据不同任务动态调整其行为模式,特别是在处理浏览器相关任务时能够自动切换到更合适的提示上下文。它集成了多种工具,可以执行Python代码、浏览网页、编辑文本等多种操作,是一个全能型的AI助手实现。
接下来观察四种工具的实现 PythonExecute、BrowserUseTool、StrReplaceEditor、Terminate

app/tool/python_execute.py

import multiprocessing
import sys
from io import StringIO
from typing import Dict

from app.tool.base import BaseTool


class PythonExecute(BaseTool):
    """A tool for executing Python code with timeout and safety restrictions."""

    name: str = "python_execute"
    description: str = "Executes Python code string. Note: Only print outputs are visible, function return values are not captured. Use print statements to see results."
    parameters: dict = {
        "type": "object",
        "properties": {
            "code": {
                "type": "string",
                "description": "The Python code to execute.",
            },
        },
        "required": ["code"],
    }

    def _run_code(self, code: str, result_dict: dict, safe_globals: dict) -> None:
        original_stdout = sys.stdout
        try:
            output_buffer = StringIO()
            sys.stdout = output_buffer
            exec(code, safe_globals, safe_globals)
            result_dict["observation"] = output_buffer.getvalue()
            result_dict["success"] = True
        except Exception as e:
            result_dict["observation"] = str(e)
            result_dict["success"] = False
        finally:
            sys.stdout = original_stdout

    async def execute(
        self,
        code: str,
        timeout: int = 5,
    ) -> Dict:
        """
        Executes the provided Python code with a timeout.

        Args:
            code (str): The Python code to execute.
            timeout (int): Execution timeout in seconds.

        Returns:
            Dict: Contains 'output' with execution output or error message and 'success' status.
        """

        with multiprocessing.Manager() as manager:
            result = manager.dict({"observation": "", "success": False})
            if isinstance(__builtins__, dict):
                safe_globals = {"__builtins__": __builtins__}
            else:
                safe_globals = {"__builtins__": __builtins__.__dict__.copy()}
            proc = multiprocessing.Process(
                target=self._run_code, args=(code, result, safe_globals)
            )
            proc.start()
            proc.join(timeout)

            # timeout process
            if proc.is_alive():
                proc.terminate()
                proc.join(1)
                return {
                    "observation": f"Execution timeout after {timeout} seconds",
                    "success": False,
                }
            return dict(result)

这是一个带超时和安全限制的 Python 代码执行工具:

主要组成部分:

  1. 导入模块:
  • multiprocessing:用于在独立进程中运行代码并实现超时控制
  • sys:用于重定向标准输出
  • StringIO:用于捕获打印输出
  • typing.Dict:用于类型提示
  • BaseTool:父类(代码中未显示)
  1. 类定义:
class PythonExecute(BaseTool):
  • 一个用于安全执行 Python 代码的工具
  • 继承自 BaseTool 类
  1. 类属性:
name: str = "python_execute"
description: str = "执行 Python 代码字符串..."
parameters: dict = {...}
  • 定义工具的元数据和输入格式
  • 需要一个必需参数 “code”(字符串类型)
  1. 辅助方法 _run_code
def _run_code(self, code: str, result_dict: dict, safe_globals: dict) -> None:
  • 执行代码并捕获输出
  • 将标准输出重定向以捕获打印内容
  • 使用 exec() 执行代码
  • 处理异常并将结果存储在 result_dict 中
  • 在 finally 块中恢复原始标准输出
  1. 主要方法 execute
async def execute(self, code: str, timeout: int = 5) -> Dict:
  • 代码执行的主要入口
  • 接受代码字符串和可选的超时参数(默认5秒)
  • 返回包含执行结果的字典

主要安全特性:

  1. 进程隔离:
  • 使用 multiprocessing 在独立进程中运行代码
  • 防止代码影响主进程
  1. 超时控制:
  • 如果超过超时时间则终止执行
  • 使用 proc.join(timeout)proc.terminate()
  1. 受限环境:
  • 创建带有受控内置函数的 safe_globals
  • 限制访问危险函数
  1. 输出捕获:
  • 重定向标准输出以捕获打印内容
  • 不捕获函数返回值,只捕获打印输出

执行流程:

  1. 创建用于存储结果的管理字典
  2. 设置安全执行环境
  3. 创建进程来运行代码
  4. 等待执行并应用超时限制
  5. 返回结果或超时信息

返回格式:

{
    "observation": "输出内容或错误信息",
    "success": True/False
}

使用示例:

tool = PythonExecute()
result = await tool.execute("print('你好')\nprint(42)")
# 返回: {"observation": "你好\n42\n", "success": True}

result = await tool.execute("while True: pass")
# 返回: {"observation": "执行超时,超过5秒", "success": False}

这个实现提供了一种安全执行不可信 Python 代码的方式,通过进程隔离、超时限制和受控执行环境来保护主机系统。

然后是浏览器执行工具
app/tool/browser_use_tool.py

import asyncio
import base64
import json
from typing import Generic, Optional, TypeVar

from browser_use import Browser as BrowserUseBrowser
from browser_use import BrowserConfig
from browser_use.browser.context import BrowserContext, BrowserContextConfig
from browser_use.dom.service import DomService
from pydantic import Field, field_validator
from pydantic_core.core_schema import ValidationInfo

from app.config import config
from app.llm import LLM
from app.tool.base import BaseTool, ToolResult
from app.tool.web_search import WebSearch


_BROWSER_DESCRIPTION = """
Interact with a web browser to perform various actions such as navigation, element interaction, content extraction, and tab management. This tool provides a comprehensive set of browser automation capabilities:

Navigation:
- 'go_to_url': Go to a specific URL in the current tab
- 'go_back': Go back
- 'refresh': Refresh the current page
- 'web_search': Search the query in the current tab, the query should be a search query like humans search in web, concrete and not vague or super long. More the single most important items.

Element Interaction:
- 'click_element': Click an element by index
- 'input_text': Input text into a form element
- 'scroll_down'/'scroll_up': Scroll the page (with optional pixel amount)
- 'scroll_to_text': If you dont find something which you want to interact with, scroll to it
- 'send_keys': Send strings of special keys like Escape,Backspace, Insert, PageDown, Delete, Enter, Shortcuts such as `Control+o`, `Control+Shift+T` are supported as well. This gets used in keyboard.press.
- 'get_dropdown_options': Get all options from a dropdown
- 'select_dropdown_option': Select dropdown option for interactive element index by the text of the option you want to select

Content Extraction:
- 'extract_content': Extract page content to retrieve specific information from the page, e.g. all company names, a specifc description, all information about, links with companies in structured format or simply links

Tab Management:
- 'switch_tab': Switch to a specific tab
- 'open_tab': Open a new tab with a URL
- 'close_tab': Close the current tab

Utility:
- 'wait': Wait for a specified number of seconds
"""

Context = TypeVar("Context")


class BrowserUseTool(BaseTool, Generic[Context]):
    name: str = "browser_use"
    description: str = _BROWSER_DESCRIPTION
    parameters: dict = {
        "type": "object",
        "properties": {
            "action": {
                "type": "string",
                "enum": [
                    "go_to_url",
                    "click_element",
                    "input_text",
                    "scroll_down",
                    "scroll_up",
                    "scroll_to_text",
                    "send_keys",
                    "get_dropdown_options",
                    "select_dropdown_option",
                    "go_back",
                    "web_search",
                    "wait",
                    "extract_content",
                    "switch_tab",
                    "open_tab",
                    "close_tab",
                ],
                "description": "The browser action to perform",
            },
            "url": {
                "type": "string",
                "description": "URL for 'go_to_url' or 'open_tab' actions",
            },
            "index": {
                "type": "integer",
                "description": "Element index for 'click_element', 'input_text', 'get_dropdown_options', or 'select_dropdown_option' actions",
            },
            "text": {
                "type": "string",
                "description": "Text for 'input_text', 'scroll_to_text', or 'select_dropdown_option' actions",
            },
            "scroll_amount": {
                "type": "integer",
                "description": "Pixels to scroll (positive for down, negative for up) for 'scroll_down' or 'scroll_up' actions",
            },
            "tab_id": {
                "type": "integer",
                "description": "Tab ID for 'switch_tab' action",
            },
            "query": {
                "type": "string",
                "description": "Search query for 'web_search' action",
            },
            "goal": {
                "type": "string",
                "description": "Extraction goal for 'extract_content' action",
            },
            "keys": {
                "type": "string",
                "description": "Keys to send for 'send_keys' action",
            },
            "seconds": {
                "type": "integer",
                "description": "Seconds to wait for 'wait' action",
            },
        },
        "required": ["action"],
        "dependencies": {
            "go_to_url": ["url"],
            "click_element": ["index"],
            "input_text": ["index", "text"],
            "switch_tab": ["tab_id"],
            "open_tab": ["url"],
            "scroll_down": ["scroll_amount"],
            "scroll_up": ["scroll_amount"],
            "scroll_to_text": ["text"],
            "send_keys": ["keys"],
            "get_dropdown_options": ["index"],
            "select_dropdown_option": ["index", "text"],
            "go_back": [],
            "web_search": ["query"],
            "wait": ["seconds"],
            "extract_content": ["goal"],
        },
    }

    lock: asyncio.Lock = Field(default_factory=asyncio.Lock)
    browser: Optional[BrowserUseBrowser] = Field(default=None, exclude=True)
    context: Optional[BrowserContext] = Field(default=None, exclude=True)
    dom_service: Optional[DomService] = Field(default=None, exclude=True)
    web_search_tool: WebSearch = Field(default_factory=WebSearch, exclude=True)

    # Context for generic functionality
    tool_context: Optional[Context] = Field(default=None, exclude=True)

    llm: Optional[LLM] = Field(default_factory=LLM)

    @field_validator("parameters", mode="before")
    def validate_parameters(cls, v: dict, info: ValidationInfo) -> dict:
        if not v:
            raise ValueError("Parameters cannot be empty")
        return v

    async def _ensure_browser_initialized(self) -> BrowserContext:
        """Ensure browser and context are initialized."""
        if self.browser is None:
            browser_config_kwargs = {"headless": False, "disable_security": True}

            if config.browser_config:
                from browser_use.browser.browser import ProxySettings

                # handle proxy settings.
                if config.browser_config.proxy and config.browser_config.proxy.server:
                    browser_config_kwargs["proxy"] = ProxySettings(
                        server=config.browser_config.proxy.server,
                        username=config.browser_config.proxy.username,
                        password=config.browser_config.proxy.password,
                    )

                browser_attrs = [
                    "headless",
                    "disable_security",
                    "extra_chromium_args",
                    "chrome_instance_path",
                    "wss_url",
                    "cdp_url",
                ]

                for attr in browser_attrs:
                    value = getattr(config.browser_config, attr, None)
                    if value is not None:
                        if not isinstance(value, list) or value:
                            browser_config_kwargs[attr] = value

            self.browser = BrowserUseBrowser(BrowserConfig(**browser_config_kwargs))

        if self.context is None:
            context_config = BrowserContextConfig()

            # if there is context config in the config, use it.
            if (
                config.browser_config
                and hasattr(config.browser_config, "new_context_config")
                and config.browser_config.new_context_config
            ):
                context_config = config.browser_config.new_context_config

            self.context = await self.browser.new_context(context_config)
            self.dom_service = DomService(await self.context.get_current_page())

        return self.context

    async def execute(
        self,
        action: str,
        url: Optional[str] = None,
        index: Optional[int] = None,
        text: Optional[str] = None,
        scroll_amount: Optional[int] = None,
        tab_id: Optional[int] = None,
        query: Optional[str] = None,
        goal: Optional[str] = None,
        keys: Optional[str] = None,
        seconds: Optional[int] = None,
        **kwargs,
    ) -> ToolResult:
        """
        Execute a specified browser action.

        Args:
            action: The browser action to perform
            url: URL for navigation or new tab
            index: Element index for click or input actions
            text: Text for input action or search query
            scroll_amount: Pixels to scroll for scroll action
            tab_id: Tab ID for switch_tab action
            query: Search query for Google search
            goal: Extraction goal for content extraction
            keys: Keys to send for keyboard actions
            seconds: Seconds to wait
            **kwargs: Additional arguments

        Returns:
            ToolResult with the action's output or error
        """
        async with self.lock:
            try:
                context = await self._ensure_browser_initialized()

                # Get max content length from config
                max_content_length = getattr(
                    config.browser_config, "max_content_length", 2000
                )

                # Navigation actions
                if action == "go_to_url":
                    if not url:
                        return ToolResult(
                            error="URL is required for 'go_to_url' action"
                        )
                    page = await context.get_current_page()
                    await page.goto(url)
                    await page.wait_for_load_state()
                    return ToolResult(output=f"Navigated to {url}")

                elif action == "go_back":
                    await context.go_back()
                    return ToolResult(output="Navigated back")

                elif action == "refresh":
                    await context.refresh_page()
                    return ToolResult(output="Refreshed current page")

                elif action == "web_search":
                    if not query:
                        return ToolResult(
                            error="Query is required for 'web_search' action"
                        )
                    search_results = await self.web_search_tool.execute(query)

                    if search_results:
                        # Navigate to the first search result
                        first_result = search_results[0]
                        if isinstance(first_result, dict) and "url" in first_result:
                            url_to_navigate = first_result["url"]
                        elif isinstance(first_result, str):
                            url_to_navigate = first_result
                        else:
                            return ToolResult(
                                error=f"Invalid search result format: {first_result}"
                            )

                        page = await context.get_current_page()
                        await page.goto(url_to_navigate)
                        await page.wait_for_load_state()

                        return ToolResult(
                            output=f"Searched for '{query}' and navigated to first result: {url_to_navigate}\nAll results:"
                            + "\n".join([str(r) for r in search_results])
                        )
                    else:
                        return ToolResult(
                            error=f"No search results found for '{query}'"
                        )

                # Element interaction actions
                elif action == "click_element":
                    if index is None:
                        return ToolResult(
                            error="Index is required for 'click_element' action"
                        )
                    element = await context.get_dom_element_by_index(index)
                    if not element:
                        return ToolResult(error=f"Element with index {index} not found")
                    download_path = await context._click_element_node(element)
                    output = f"Clicked element at index {index}"
                    if download_path:
                        output += f" - Downloaded file to {download_path}"
                    return ToolResult(output=output)

                elif action == "input_text":
                    if index is None or not text:
                        return ToolResult(
                            error="Index and text are required for 'input_text' action"
                        )
                    element = await context.get_dom_element_by_index(index)
                    if not element:
                        return ToolResult(error=f"Element with index {index} not found")
                    await context._input_text_element_node(element, text)
                    return ToolResult(
                        output=f"Input '{text}' into element at index {index}"
                    )

                elif action == "scroll_down" or action == "scroll_up":
                    direction = 1 if action == "scroll_down" else -1
                    amount = (
                        scroll_amount
                        if scroll_amount is not None
                        else context.config.browser_window_size["height"]
                    )
                    await context.execute_javascript(
                        f"window.scrollBy(0, {direction * amount});"
                    )
                    return ToolResult(
                        output=f"Scrolled {'down' if direction > 0 else 'up'} by {amount} pixels"
                    )

                elif action == "scroll_to_text":
                    if not text:
                        return ToolResult(
                            error="Text is required for 'scroll_to_text' action"
                        )
                    page = await context.get_current_page()
                    try:
                        locator = page.get_by_text(text, exact=False)
                        await locator.scroll_into_view_if_needed()
                        return ToolResult(output=f"Scrolled to text: '{text}'")
                    except Exception as e:
                        return ToolResult(error=f"Failed to scroll to text: {str(e)}")

                elif action == "send_keys":
                    if not keys:
                        return ToolResult(
                            error="Keys are required for 'send_keys' action"
                        )
                    page = await context.get_current_page()
                    await page.keyboard.press(keys)
                    return ToolResult(output=f"Sent keys: {keys}")

                elif action == "get_dropdown_options":
                    if index is None:
                        return ToolResult(
                            error="Index is required for 'get_dropdown_options' action"
                        )
                    element = await context.get_dom_element_by_index(index)
                    if not element:
                        return ToolResult(error=f"Element with index {index} not found")
                    page = await context.get_current_page()
                    options = await page.evaluate(
                        """
                        (xpath) => {
                            const select = document.evaluate(xpath, document, null,
                                XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue;
                            if (!select) return null;
                            return Array.from(select.options).map(opt => ({
                                text: opt.text,
                                value: opt.value,
                                index: opt.index
                            }));
                        }
                    """,
                        element.xpath,
                    )
                    return ToolResult(output=f"Dropdown options: {options}")

                elif action == "select_dropdown_option":
                    if index is None or not text:
                        return ToolResult(
                            error="Index and text are required for 'select_dropdown_option' action"
                        )
                    element = await context.get_dom_element_by_index(index)
                    if not element:
                        return ToolResult(error=f"Element with index {index} not found")
                    page = await context.get_current_page()
                    await page.select_option(element.xpath, label=text)
                    return ToolResult(
                        output=f"Selected option '{text}' from dropdown at index {index}"
                    )

                # Content extraction actions
                elif action == "extract_content":
                    if not goal:
                        return ToolResult(
                            error="Goal is required for 'extract_content' action"
                        )
                    page = await context.get_current_page()
                    try:
                        # Get page content and convert to markdown for better processing
                        html_content = await page.content()

                        # Import markdownify here to avoid global import
                        try:
                            import markdownify

                            content = markdownify.markdownify(html_content)
                        except ImportError:
                            # Fallback if markdownify is not available
                            content = html_content

                        # Create prompt for LLM
                        prompt_text = """
Your task is to extract the content of the page. You will be given a page and a goal, and you should extract all relevant information around this goal from the page. If the goal is vague, summarize the page. Respond in json format.
Extraction goal: {goal}

Page content:
{page}
"""
                        # Format the prompt with the goal and content
                        max_content_length = min(50000, len(content))
                        formatted_prompt = prompt_text.format(
                            goal=goal, page=content[:max_content_length]
                        )

                        # Create a proper message list for the LLM
                        from app.schema import Message

                        messages = [Message.user_message(formatted_prompt)]

                        # Define extraction function for the tool
                        extraction_function = {
                            "type": "function",
                            "function": {
                                "name": "extract_content",
                                "description": "Extract specific information from a webpage based on a goal",
                                "parameters": {
                                    "type": "object",
                                    "properties": {
                                        "extracted_content": {
                                            "type": "object",
                                            "description": "The content extracted from the page according to the goal",
                                        }
                                    },
                                    "required": ["extracted_content"],
                                },
                            },
                        }

                        # Use LLM to extract content with required function calling
                        response = await self.llm.ask_tool(
                            messages,
                            tools=[extraction_function],
                            tool_choice="required",
                        )

                        # Extract content from function call response
                        if (
                            response
                            and response.tool_calls
                            and len(response.tool_calls) > 0
                        ):
                            # Get the first tool call arguments
                            tool_call = response.tool_calls[0]
                            # Parse the JSON arguments
                            try:
                                args = json.loads(tool_call.function.arguments)
                                extracted_content = args.get("extracted_content", {})
                                # Format extracted content as JSON string
                                content_json = json.dumps(
                                    extracted_content, indent=2, ensure_ascii=False
                                )
                                msg = f"Extracted from page:\n{content_json}\n"
                            except Exception as e:
                                msg = f"Error parsing extraction result: {str(e)}\nRaw response: {tool_call.function.arguments}"
                        else:
                            msg = "No content was extracted from the page."

                        return ToolResult(output=msg)
                    except Exception as e:
                        # Provide a more helpful error message
                        error_msg = f"Failed to extract content: {str(e)}"
                        try:
                            # Try to return a portion of the page content as fallback
                            return ToolResult(
                                output=f"{error_msg}\nHere's a portion of the page content:\n{content[:2000]}..."
                            )
                        except:
                            # If all else fails, just return the error
                            return ToolResult(error=error_msg)

                # Tab management actions
                elif action == "switch_tab":
                    if tab_id is None:
                        return ToolResult(
                            error="Tab ID is required for 'switch_tab' action"
                        )
                    await context.switch_to_tab(tab_id)
                    page = await context.get_current_page()
                    await page.wait_for_load_state()
                    return ToolResult(output=f"Switched to tab {tab_id}")

                elif action == "open_tab":
                    if not url:
                        return ToolResult(error="URL is required for 'open_tab' action")
                    await context.create_new_tab(url)
                    return ToolResult(output=f"Opened new tab with {url}")

                elif action == "close_tab":
                    await context.close_current_tab()
                    return ToolResult(output="Closed current tab")

                # Utility actions
                elif action == "wait":
                    seconds_to_wait = seconds if seconds is not None else 3
                    await asyncio.sleep(seconds_to_wait)
                    return ToolResult(output=f"Waited for {seconds_to_wait} seconds")

                else:
                    return ToolResult(error=f"Unknown action: {action}")

            except Exception as e:
                return ToolResult(error=f"Browser action '{action}' failed: {str(e)}")

    async def get_current_state(
        self, context: Optional[BrowserContext] = None
    ) -> ToolResult:
        """
        Get the current browser state as a ToolResult.
        If context is not provided, uses self.context.
        """
        try:
            # Use provided context or fall back to self.context
            ctx = context or self.context
            if not ctx:
                return ToolResult(error="Browser context not initialized")

            state = await ctx.get_state()

            # Create a viewport_info dictionary if it doesn't exist
            viewport_height = 0
            if hasattr(state, "viewport_info") and state.viewport_info:
                viewport_height = state.viewport_info.height
            elif hasattr(ctx, "config") and hasattr(ctx.config, "browser_window_size"):
                viewport_height = ctx.config.browser_window_size.get("height", 0)

            # Take a screenshot for the state
            page = await ctx.get_current_page()

            await page.bring_to_front()
            await page.wait_for_load_state()

            screenshot = await page.screenshot(
                full_page=True, animations="disabled", type="jpeg", quality=100
            )

            screenshot = base64.b64encode(screenshot).decode("utf-8")

            # Build the state info with all required fields
            state_info = {
                "url": state.url,
                "title": state.title,
                "tabs": [tab.model_dump() for tab in state.tabs],
                "help": "[0], [1], [2], etc., represent clickable indices corresponding to the elements listed. Clicking on these indices will navigate to or interact with the respective content behind them.",
                "interactive_elements": (
                    state.element_tree.clickable_elements_to_string()
                    if state.element_tree
                    else ""
                ),
                "scroll_info": {
                    "pixels_above": getattr(state, "pixels_above", 0),
                    "pixels_below": getattr(state, "pixels_below", 0),
                    "total_height": getattr(state, "pixels_above", 0)
                    + getattr(state, "pixels_below", 0)
                    + viewport_height,
                },
                "viewport_height": viewport_height,
            }

            return ToolResult(
                output=json.dumps(state_info, indent=4, ensure_ascii=False),
                base64_image=screenshot,
            )
        except Exception as e:
            return ToolResult(error=f"Failed to get browser state: {str(e)}")

    async def cleanup(self):
        """Clean up browser resources."""
        async with self.lock:
            if self.context is not None:
                await self.context.close()
                self.context = None
                self.dom_service = None
            if self.browser is not None:
                await self.browser.close()
                self.browser = None

    def __del__(self):
        """Ensure cleanup when object is destroyed."""
        if self.browser is not None or self.context is not None:
            try:
                asyncio.run(self.cleanup())
            except RuntimeError:
                loop = asyncio.new_event_loop()
                loop.run_until_complete(self.cleanup())
                loop.close()

    @classmethod
    def create_with_context(cls, context: Context) -> "BrowserUseTool[Context]":
        """Factory method to create a BrowserUseTool with a specific context."""
        tool = cls()
        tool.tool_context = context
        return tool

说实话这个代码还是比较多的
我来详细解读这段代码。这是一个功能强大的浏览器自动化工具类 BrowserUseTool,用于控制浏览器执行各种操作。下面是逐步解析:


主要组成部分

1. 导入模块
  • asyncio: 支持异步操作,用于非阻塞执行。
  • base64: 用于将截图编码为 base64 字符串。
  • json: 用于序列化数据。
  • typing.Generic, TypeVar: 支持泛型编程。
  • browser_use: 提供浏览器控制的核心功能。
  • pydantic: 用于数据验证和模型定义。
  • app 模块: 引用应用配置、语言模型 (LLM) 和其他工具。
2. 类定义
class BrowserUseTool(BaseTool, Generic[Context]):
  • 继承自 BaseTool,并使用泛型 Generic[Context] 以支持自定义上下文。
  • 实现了一个浏览器自动化工具,支持导航、元素交互、内容提取和标签管理。
3. 类属性
  • name: 工具名称为 "browser_use"
  • description: 详细描述了工具功能,分为:
    • 导航: 跳转 URL、前进后退、刷新、网页搜索。
    • 元素交互: 点击、输入文本、滚动、发送按键、下拉框操作。
    • 内容提取: 提取页面特定内容。
    • 标签管理: 切换、打开、关闭标签。
    • 实用工具: 等待指定时间。
  • parameters: 定义了工具的输入参数,包括:
    • action: 操作类型(枚举值,如 "go_to_url")。
    • 可选参数:url, index, text, scroll_amount, tab_id, query, goal, keys, seconds
    • dependencies: 指定每个操作所需的参数。
4. 实例属性
  • lock: asyncio.Lock,确保异步操作线程安全。
  • browser: 浏览器实例。
  • context: 浏览器上下文。
  • dom_service: DOM 操作服务。
  • web_search_tool: 网页搜索工具实例。
  • tool_context: 泛型上下文。
  • llm: 大语言模型实例,用于内容提取。

核心方法

1. _ensure_browser_initialized
async def _ensure_browser_initialized(self) -> BrowserContext:
  • 作用: 初始化浏览器和上下文。
  • 流程:
    • 如果 browser 未初始化,根据配置(如 headless 模式、代理设置)创建 BrowserUseBrowser
    • 如果 context 未初始化,创建新上下文并初始化 DomService
  • 返回: 浏览器上下文对象。
2. execute
async def execute(self, action: str, url: Optional[str] = None, ...) -> ToolResult:
  • 作用: 执行指定的浏览器操作。
  • 参数: 根据 action 动态使用不同参数。
  • 实现细节(按操作类型分组):
导航操作
  • "go_to_url": 跳转到指定 URL。
  • "go_back": 返回上一页。
  • "refresh": 刷新页面。
  • "web_search": 执行网页搜索并跳转到第一个结果。
元素交互
  • "click_element": 点击指定索引的元素。
  • "input_text": 在指定元素输入文本。
  • "scroll_down" / "scroll_up": 按指定像素滚动页面。
  • "scroll_to_text": 滚动到包含指定文本的位置。
  • "send_keys": 发送键盘按键(如 Enter、Ctrl+O)。
  • "get_dropdown_options": 获取下拉框选项。
  • "select_dropdown_option": 选择下拉框中的选项。
内容提取
  • "extract_content": 根据目标(如“提取公司名称”)从页面提取内容。
    • 获取页面 HTML,转为 Markdown。
    • 使用 LLM 处理内容,返回 JSON 格式结果。
标签管理
  • "switch_tab": 切换到指定标签。
  • "open_tab": 打开新标签。
  • "close_tab": 关闭当前标签。
实用工具
  • "wait": 等待指定秒数。

  • 返回: ToolResult,包含操作输出或错误信息。

3. get_current_state
async def get_current_state(self, context: Optional[BrowserContext] = None) -> ToolResult:
  • 作用: 获取浏览器当前状态。
  • 返回内容:
    • URL、标题、标签列表。
    • 可交互元素列表。
    • 滚动信息(上方/下方像素、总高度)。
    • 视口高度。
    • 页面截图(base64 编码)。
4. cleanup__del__
  • cleanup: 关闭浏览器和上下文,释放资源。
  • __del__: 析构时确保清理资源。

关键特性

  1. 异步支持: 使用 asyncio 实现非阻塞操作。
  2. 安全性:
    • 使用锁 (lock) 防止并发冲突。
    • 限制浏览器操作范围,避免意外行为。
  3. 灵活性:
    • 支持多种浏览器操作。
    • 通过泛型支持自定义上下文。
  4. 内容提取:
    • 集成 LLM,智能提取页面信息。
  5. 错误处理:
    • 每个操作都有异常捕获,返回详细错误信息。

使用示例

tool = BrowserUseTool()

# 跳转到 URL
result = await tool.execute(action="go_to_url", url="https://example.com")
print(result.output)  # "Navigated to https://example.com"

# 点击元素
result = await tool.execute(action="click_element", index=0)
print(result.output)  # "Clicked element at index 0"

# 提取内容
result = await tool.execute(action="extract_content", goal="提取所有公司名称")
print(result.output)  # JSON 格式的公司名称列表

# 获取当前状态
state = await tool.get_current_state()
print(state.output)  # 当前页面信息

BrowserUseTool 是一个功能全面的浏览器自动化工具,集成了导航、交互、内容提取和状态管理功能。它通过异步操作、锁机制和 LLM 增强,适用于需要动态网页处理的场景,如爬虫、自动化测试或数据提取。代码结构清晰,参数验证严格,具有良好的扩展性和安全性。

这个部分感觉跟智谱所提出改变整个手机产业的Agentic LLM感觉很像。因为app本质上就是浏览器。所以这里是不是说好久在开源里面做不出成绩的智谱的老家丢掉了。

接下里我们继续看

"""File and directory manipulation tool with sandbox support."""

from collections import defaultdict
from pathlib import Path
from typing import Any, DefaultDict, List, Literal, Optional, get_args

from app.config import config
from app.exceptions import ToolError
from app.tool import BaseTool
from app.tool.base import CLIResult, ToolResult
from app.tool.file_operators import (
    FileOperator,
    LocalFileOperator,
    PathLike,
    SandboxFileOperator,
)


Command = Literal[
    "view",
    "create",
    "str_replace",
    "insert",
    "undo_edit",
]

# Constants
SNIPPET_LINES: int = 4
MAX_RESPONSE_LEN: int = 16000
TRUNCATED_MESSAGE: str = (
    "<response clipped><NOTE>To save on context only part of this file has been shown to you. "
    "You should retry this tool after you have searched inside the file with `grep -n` "
    "in order to find the line numbers of what you are looking for.</NOTE>"
)

# Tool description
_STR_REPLACE_EDITOR_DESCRIPTION = """Custom editing tool for viewing, creating and editing files
* State is persistent across command calls and discussions with the user
* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep
* The `create` command cannot be used if the specified `path` already exists as a file
* If a `command` generates a long output, it will be truncated and marked with `<response clipped>`
* The `undo_edit` command will revert the last edit made to the file at `path`

Notes for using the `str_replace` command:
* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!
* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique
* The `new_str` parameter should contain the edited lines that should replace the `old_str`
"""


def maybe_truncate(
    content: str, truncate_after: Optional[int] = MAX_RESPONSE_LEN
) -> str:
    """Truncate content and append a notice if content exceeds the specified length."""
    if not truncate_after or len(content) <= truncate_after:
        return content
    return content[:truncate_after] + TRUNCATED_MESSAGE


class StrReplaceEditor(BaseTool):
    """A tool for viewing, creating, and editing files with sandbox support."""

    name: str = "str_replace_editor"
    description: str = _STR_REPLACE_EDITOR_DESCRIPTION
    parameters: dict = {
        "type": "object",
        "properties": {
            "command": {
                "description": "The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`.",
                "enum": ["view", "create", "str_replace", "insert", "undo_edit"],
                "type": "string",
            },
            "path": {
                "description": "Absolute path to file or directory.",
                "type": "string",
            },
            "file_text": {
                "description": "Required parameter of `create` command, with the content of the file to be created.",
                "type": "string",
            },
            "old_str": {
                "description": "Required parameter of `str_replace` command containing the string in `path` to replace.",
                "type": "string",
            },
            "new_str": {
                "description": "Optional parameter of `str_replace` command containing the new string (if not given, no string will be added). Required parameter of `insert` command containing the string to insert.",
                "type": "string",
            },
            "insert_line": {
                "description": "Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.",
                "type": "integer",
            },
            "view_range": {
                "description": "Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.",
                "items": {"type": "integer"},
                "type": "array",
            },
        },
        "required": ["command", "path"],
    }
    _file_history: DefaultDict[PathLike, List[str]] = defaultdict(list)
    _local_operator: LocalFileOperator = LocalFileOperator()
    _sandbox_operator: SandboxFileOperator = SandboxFileOperator()

    # def _get_operator(self, use_sandbox: bool) -> FileOperator:
    def _get_operator(self) -> FileOperator:
        """Get the appropriate file operator based on execution mode."""
        return (
            self._sandbox_operator
            if config.sandbox.use_sandbox
            else self._local_operator
        )

    async def execute(
        self,
        *,
        command: Command,
        path: str,
        file_text: str | None = None,
        view_range: list[int] | None = None,
        old_str: str | None = None,
        new_str: str | None = None,
        insert_line: int | None = None,
        **kwargs: Any,
    ) -> str:
        """Execute a file operation command."""
        # Get the appropriate file operator
        operator = self._get_operator()

        # Validate path and command combination
        await self.validate_path(command, Path(path), operator)

        # Execute the appropriate command
        if command == "view":
            result = await self.view(path, view_range, operator)
        elif command == "create":
            if file_text is None:
                raise ToolError("Parameter `file_text` is required for command: create")
            await operator.write_file(path, file_text)
            self._file_history[path].append(file_text)
            result = ToolResult(output=f"File created successfully at: {path}")
        elif command == "str_replace":
            if old_str is None:
                raise ToolError(
                    "Parameter `old_str` is required for command: str_replace"
                )
            result = await self.str_replace(path, old_str, new_str, operator)
        elif command == "insert":
            if insert_line is None:
                raise ToolError(
                    "Parameter `insert_line` is required for command: insert"
                )
            if new_str is None:
                raise ToolError("Parameter `new_str` is required for command: insert")
            result = await self.insert(path, insert_line, new_str, operator)
        elif command == "undo_edit":
            result = await self.undo_edit(path, operator)
        else:
            # This should be caught by type checking, but we include it for safety
            raise ToolError(
                f'Unrecognized command {command}. The allowed commands for the {self.name} tool are: {", ".join(get_args(Command))}'
            )

        return str(result)

    async def validate_path(
        self, command: str, path: Path, operator: FileOperator
    ) -> None:
        """Validate path and command combination based on execution environment."""
        # Check if path is absolute
        if not path.is_absolute():
            raise ToolError(f"The path {path} is not an absolute path")

        # Only check if path exists for non-create commands
        if command != "create":
            if not await operator.exists(path):
                raise ToolError(
                    f"The path {path} does not exist. Please provide a valid path."
                )

            # Check if path is a directory
            is_dir = await operator.is_directory(path)
            if is_dir and command != "view":
                raise ToolError(
                    f"The path {path} is a directory and only the `view` command can be used on directories"
                )

        # Check if file exists for create command
        elif command == "create":
            exists = await operator.exists(path)
            if exists:
                raise ToolError(
                    f"File already exists at: {path}. Cannot overwrite files using command `create`."
                )

    async def view(
        self,
        path: PathLike,
        view_range: Optional[List[int]] = None,
        operator: FileOperator = None,
    ) -> CLIResult:
        """Display file or directory content."""
        # Determine if path is a directory
        is_dir = await operator.is_directory(path)

        if is_dir:
            # Directory handling
            if view_range:
                raise ToolError(
                    "The `view_range` parameter is not allowed when `path` points to a directory."
                )

            return await self._view_directory(path, operator)
        else:
            # File handling
            return await self._view_file(path, operator, view_range)

    @staticmethod
    async def _view_directory(path: PathLike, operator: FileOperator) -> CLIResult:
        """Display directory contents."""
        find_cmd = f"find {path} -maxdepth 2 -not -path '*/\\.*'"

        # Execute command using the operator
        returncode, stdout, stderr = await operator.run_command(find_cmd)

        if not stderr:
            stdout = (
                f"Here's the files and directories up to 2 levels deep in {path}, "
                f"excluding hidden items:\n{stdout}\n"
            )

        return CLIResult(output=stdout, error=stderr)

    async def _view_file(
        self,
        path: PathLike,
        operator: FileOperator,
        view_range: Optional[List[int]] = None,
    ) -> CLIResult:
        """Display file content, optionally within a specified line range."""
        # Read file content
        file_content = await operator.read_file(path)
        init_line = 1

        # Apply view range if specified
        if view_range:
            if len(view_range) != 2 or not all(isinstance(i, int) for i in view_range):
                raise ToolError(
                    "Invalid `view_range`. It should be a list of two integers."
                )

            file_lines = file_content.split("\n")
            n_lines_file = len(file_lines)
            init_line, final_line = view_range

            # Validate view range
            if init_line < 1 or init_line > n_lines_file:
                raise ToolError(
                    f"Invalid `view_range`: {view_range}. Its first element `{init_line}` should be "
                    f"within the range of lines of the file: {[1, n_lines_file]}"
                )
            if final_line > n_lines_file:
                raise ToolError(
                    f"Invalid `view_range`: {view_range}. Its second element `{final_line}` should be "
                    f"smaller than the number of lines in the file: `{n_lines_file}`"
                )
            if final_line != -1 and final_line < init_line:
                raise ToolError(
                    f"Invalid `view_range`: {view_range}. Its second element `{final_line}` should be "
                    f"larger or equal than its first `{init_line}`"
                )

            # Apply range
            if final_line == -1:
                file_content = "\n".join(file_lines[init_line - 1 :])
            else:
                file_content = "\n".join(file_lines[init_line - 1 : final_line])

        # Format and return result
        return CLIResult(
            output=self._make_output(file_content, str(path), init_line=init_line)
        )

    async def str_replace(
        self,
        path: PathLike,
        old_str: str,
        new_str: Optional[str] = None,
        operator: FileOperator = None,
    ) -> CLIResult:
        """Replace a unique string in a file with a new string."""
        # Read file content and expand tabs
        file_content = (await operator.read_file(path)).expandtabs()
        old_str = old_str.expandtabs()
        new_str = new_str.expandtabs() if new_str is not None else ""

        # Check if old_str is unique in the file
        occurrences = file_content.count(old_str)
        if occurrences == 0:
            raise ToolError(
                f"No replacement was performed, old_str `{old_str}` did not appear verbatim in {path}."
            )
        elif occurrences > 1:
            # Find line numbers of occurrences
            file_content_lines = file_content.split("\n")
            lines = [
                idx + 1
                for idx, line in enumerate(file_content_lines)
                if old_str in line
            ]
            raise ToolError(
                f"No replacement was performed. Multiple occurrences of old_str `{old_str}` "
                f"in lines {lines}. Please ensure it is unique"
            )

        # Replace old_str with new_str
        new_file_content = file_content.replace(old_str, new_str)

        # Write the new content to the file
        await operator.write_file(path, new_file_content)

        # Save the original content to history
        self._file_history[path].append(file_content)

        # Create a snippet of the edited section
        replacement_line = file_content.split(old_str)[0].count("\n")
        start_line = max(0, replacement_line - SNIPPET_LINES)
        end_line = replacement_line + SNIPPET_LINES + new_str.count("\n")
        snippet = "\n".join(new_file_content.split("\n")[start_line : end_line + 1])

        # Prepare the success message
        success_msg = f"The file {path} has been edited. "
        success_msg += self._make_output(
            snippet, f"a snippet of {path}", start_line + 1
        )
        success_msg += "Review the changes and make sure they are as expected. Edit the file again if necessary."

        return CLIResult(output=success_msg)

    async def insert(
        self,
        path: PathLike,
        insert_line: int,
        new_str: str,
        operator: FileOperator = None,
    ) -> CLIResult:
        """Insert text at a specific line in a file."""
        # Read and prepare content
        file_text = (await operator.read_file(path)).expandtabs()
        new_str = new_str.expandtabs()
        file_text_lines = file_text.split("\n")
        n_lines_file = len(file_text_lines)

        # Validate insert_line
        if insert_line < 0 or insert_line > n_lines_file:
            raise ToolError(
                f"Invalid `insert_line` parameter: {insert_line}. It should be within "
                f"the range of lines of the file: {[0, n_lines_file]}"
            )

        # Perform insertion
        new_str_lines = new_str.split("\n")
        new_file_text_lines = (
            file_text_lines[:insert_line]
            + new_str_lines
            + file_text_lines[insert_line:]
        )

        # Create a snippet for preview
        snippet_lines = (
            file_text_lines[max(0, insert_line - SNIPPET_LINES) : insert_line]
            + new_str_lines
            + file_text_lines[insert_line : insert_line + SNIPPET_LINES]
        )

        # Join lines and write to file
        new_file_text = "\n".join(new_file_text_lines)
        snippet = "\n".join(snippet_lines)

        await operator.write_file(path, new_file_text)
        self._file_history[path].append(file_text)

        # Prepare success message
        success_msg = f"The file {path} has been edited. "
        success_msg += self._make_output(
            snippet,
            "a snippet of the edited file",
            max(1, insert_line - SNIPPET_LINES + 1),
        )
        success_msg += "Review the changes and make sure they are as expected (correct indentation, no duplicate lines, etc). Edit the file again if necessary."

        return CLIResult(output=success_msg)

    async def undo_edit(
        self, path: PathLike, operator: FileOperator = None
    ) -> CLIResult:
        """Revert the last edit made to a file."""
        if not self._file_history[path]:
            raise ToolError(f"No edit history found for {path}.")

        old_text = self._file_history[path].pop()
        await operator.write_file(path, old_text)

        return CLIResult(
            output=f"Last edit to {path} undone successfully. {self._make_output(old_text, str(path))}"
        )

    def _make_output(
        self,
        file_content: str,
        file_descriptor: str,
        init_line: int = 1,
        expand_tabs: bool = True,
    ) -> str:
        """Format file content for display with line numbers."""
        file_content = maybe_truncate(file_content)
        if expand_tabs:
            file_content = file_content.expandtabs()

        # Add line numbers to each line
        file_content = "\n".join(
            [
                f"{i + init_line:6}\t{line}"
                for i, line in enumerate(file_content.split("\n"))
            ]
        )

        return (
            f"Here's the result of running `cat -n` on {file_descriptor}:\n"
            + file_content
            + "\n"
        )

我来详细解读这段代码。这是一个名为 StrReplaceEditor 的文件和目录操作工具,支持在沙盒环境中查看、创建和编辑文件。以下是逐步分析:


主要组成部分

1. 导入模块
  • collections.defaultdict: 用于存储文件编辑历史。
  • pathlib.Path: 处理文件路径。
  • typing: 类型注解支持。
  • app.config, app.exceptions, app.tool: 应用配置、异常和基类。
  • app.tool.file_operators: 文件操作的具体实现。
2. 常量和类型定义
  • 命令类型:
    Command = Literal["view", "create", "str_replace", "insert", "undo_edit"]
    
    定义了支持的五种操作:查看、创建、字符串替换、插入、撤销编辑。
  • 常量:
    • SNIPPET_LINES = 4: 显示代码片段时每侧显示的行数。
    • MAX_RESPONSE_LEN = 16000: 输出最大长度。
    • TRUNCATED_MESSAGE: 当输出被截断时的提示信息。
3. 工具描述
  • _STR_REPLACE_EDITOR_DESCRIPTION: 详细说明工具功能和使用注意事项:
    • 状态在多次调用间持久化。
    • view: 显示文件内容(带行号)或目录列表。
    • create: 创建新文件(不可覆盖已有文件)。
    • str_replace: 替换文件中唯一匹配的字符串。
    • insert: 在指定行后插入文本。
    • undo_edit: 撤销上一次编辑。
4. 类定义
class StrReplaceEditor(BaseTool):
  • 继承自 BaseTool,提供文件操作功能。
  • 属性:
    • name: "str_replace_editor"
    • description: 工具功能描述。
    • parameters: 定义输入参数及其类型和要求。
    • _file_history: 存储每个文件的编辑历史。
    • _local_operator_sandbox_operator: 本地和沙盒文件操作实例。

核心方法

1. _get_operator
def _get_operator(self) -> FileOperator:
  • 作用: 根据配置选择本地或沙盒文件操作器。
  • 逻辑: 如果配置启用沙盒,返回 SandboxFileOperator,否则返回 LocalFileOperator
2. execute
async def execute(self, *, command: Command, path: str, ...) -> str:
  • 作用: 执行指定的文件操作命令。
  • 参数:
    • command: 操作类型。
    • path: 文件或目录路径。
    • 其他可选参数根据命令不同而变化。
  • 流程:
    • 获取操作器。
    • 验证路径和命令组合。
    • 根据命令调用对应方法。
  • 返回: 操作结果字符串。
3. validate_path
async def validate_path(self, command: str, path: Path, operator: FileOperator) -> None:
  • 作用: 验证路径和命令的有效性。
  • 检查:
    • 路径必须是绝对路径。
    • create 命令要求路径存在。
    • 目录只支持 view 命令。
    • create 命令要求路径不存在。
4. view
async def view(self, path: PathLike, view_range: Optional[List[int]] = None, operator: FileOperator = None) -> CLIResult:
  • 作用: 查看文件或目录内容。
  • 逻辑:
    • 如果是目录,调用 _view_directory
    • 如果是文件,调用 _view_file,支持指定行范围。
5. _view_directory
async def _view_directory(path: PathLike, operator: FileOperator) -> CLIResult:
  • 作用: 显示目录内容(最多两级深度,不含隐藏文件)。
  • 实现: 使用 find 命令列出目录内容。
6. _view_file
async def _view_file(self, path: PathLike, operator: FileOperator, view_range: Optional[List[int]] = None) -> CLIResult:
  • 作用: 显示文件内容,可选指定行范围。
  • 逻辑:
    • 读取文件内容。
    • 如果指定 view_range,验证并截取对应行。
    • 返回带行号的格式化输出。
7. str_replace
async def str_replace(self, path: PathLike, old_str: str, new_str: Optional[str] = None, operator: FileOperator = None) -> CLIResult:
  • 作用: 替换文件中唯一匹配的字符串。
  • 逻辑:
    • 读取文件内容。
    • 检查 old_str 是否唯一。
    • 执行替换并保存历史。
    • 返回编辑后的片段预览。
8. insert
async def insert(self, path: PathLike, insert_line: int, new_str: str, operator: FileOperator = None) -> CLIResult:
  • 作用: 在指定行后插入文本。
  • 逻辑:
    • 验证插入行号。
    • 分割文件内容并插入新文本。
    • 保存历史并返回片段预览。
9. undo_edit
async def undo_edit(self, path: PathLike, operator: FileOperator = None) -> CLIResult:
  • 作用: 撤销上一次编辑。
  • 逻辑:
    • 从历史中恢复上一版本。
    • 更新文件并返回结果。
10. _make_output
def _make_output(self, file_content: str, file_descriptor: str, init_line: int = 1, expand_tabs: bool = True) -> str:
  • 作用: 格式化文件内容,添加行号。
  • 逻辑: 处理截断、制表符扩展,返回带行号的字符串。

关键特性

  1. 沙盒支持: 通过配置切换本地或沙盒操作。
  2. 历史记录: 支持撤销编辑。
  3. 输出控制: 长输出会被截断并提示。
  4. 安全性:
    • 严格验证路径和参数。
    • 防止覆盖已有文件。
  5. 灵活性:
    • 支持多种操作。
    • view 可指定行范围。

使用示例

editor = StrReplaceEditor()

# 创建文件
await editor.execute(command="create", path="/tmp/test.txt", file_text="Hello\nWorld")

# 查看文件
result = await editor.execute(command="view", path="/tmp/test.txt")
print(result)  # 显示带行号的内容

# 替换字符串
result = await editor.execute(command="str_replace", path="/tmp/test.txt", old_str="Hello", new_str="Hi")
print(result)  # 显示替换后的片段

# 插入文本
result = await editor.execute(command="insert", path="/tmp/test.txt", insert_line=1, new_str="Test")
print(result)  # 显示插入后的片段

# 撤销编辑
result = await editor.execute(command="undo_edit", path="/tmp/test.txt")
print(result)  # 显示恢复的内容

StrReplaceEditor 是一个功能强大且安全的文件操作工具,支持查看、创建和编辑文件。它通过沙盒支持、历史记录和严格验证确保操作安全和可控,适用于需要文件管理的自动化任务。代码结构清晰,异常处理完善,具有良好的可扩展性。

终于到最后一个组件了写这个是因为我想新加一些本地化组件
app/tool/terminate.py

from app.tool.base import BaseTool


_TERMINATE_DESCRIPTION = """Terminate the interaction when the request is met OR if the assistant cannot proceed further with the task.
When you have finished all the tasks, call this tool to end the work."""


class Terminate(BaseTool):
    name: str = "terminate"
    description: str = _TERMINATE_DESCRIPTION
    parameters: dict = {
        "type": "object",
        "properties": {
            "status": {
                "type": "string",
                "description": "The finish status of the interaction.",
                "enum": ["success", "failure"],
            }
        },
        "required": ["status"],
    }

    async def execute(self, status: str) -> str:
        """Finish the current execution"""
        return f"The interaction has been completed with status: {status}"

我来详细解读这段代码。这是一个简单的工具类 Terminate,用于终止交互或任务执行。下面是逐步分析:


主要组成部分

1. 导入模块
from app.tool.base import BaseTool
  • app.tool.base 模块导入 BaseTool 类,作为 Terminate 的基类。
  • BaseTool 是一个抽象基类(假设),提供工具的基本结构和方法。
2. 工具描述
_TERMINATE_DESCRIPTION = """Terminate the interaction when the request is met OR if the assistant cannot proceed further with the task.
When you have finished all the tasks, call this tool to end the work."""
  • 定义了一个常量字符串,描述工具的功能:
    • 在请求完成或助手无法继续任务时终止交互。
    • 在所有任务完成后调用此工具结束工作。
3. 类定义
class Terminate(BaseTool):
  • 继承: 从 BaseTool 继承,表明这是一个工具类。
  • 作用: 提供一种机制来显式结束当前的交互或任务流程。
4. 类属性
  • name:
    name: str = "terminate"
    
    • 工具的名称,固定为 "terminate"
  • description:
    description: str = _TERMINATE_DESCRIPTION
    
    • 工具的描述,引用前面定义的 _TERMINATE_DESCRIPTION
  • parameters:
    parameters: dict = {
        "type": "object",
        "properties": {
            "status": {
                "type": "string",
                "description": "The finish status of the interaction.",
                "enum": ["success", "failure"],
            }
        },
        "required": ["status"],
    }
    
    • 定义工具的输入参数:
      • 类型为对象 (object)。
      • 属性包含一个字段 status,类型为字符串,描述交互的完成状态。
      • status 的值限定为 "success""failure"(枚举)。
      • status 是必填参数。
5. 执行方法
async def execute(self, status: str) -> str:
    """Finish the current execution"""
    return f"The interaction has been completed with status: {status}"
  • 作用: 执行终止操作。
  • 参数:
    • status: 字符串,表示终止状态("success""failure")。
  • 实现:
    • 使用异步方法(async),支持非阻塞执行。
    • 返回一个简单的字符串,表明交互已完成并包含状态。
  • 返回: 格式化的字符串,例如 "The interaction has been completed with status: success"

关键特性

  1. 简单性:
    • 代码非常简洁,仅实现单一功能:终止交互并报告状态。
  2. 异步支持:
    • 使用 async def 定义方法,支持异步操作,适用于需要非阻塞的环境。
  3. 参数验证:
    • 通过 parameters 定义,限制 status 的值,确保输入有效性。
  4. 可扩展性:
    • 继承自 BaseTool,可以在需要时扩展功能。

使用示例

# 创建工具实例
terminate_tool = Terminate()

# 成功终止
result = await terminate_tool.execute(status="success")
print(result)  # 输出: "The interaction has been completed with status: success"

# 失败终止
result = await terminate_tool.execute(status="failure")
print(result)  # 输出: "The interaction has been completed with status: failure"

应用场景

  • 任务完成: 当助手完成所有任务后,调用此工具通知系统结束交互。
  • 异常退出: 当任务无法继续(如遇到不可恢复的错误)时,使用此工具报告失败并退出。
  • 流程控制: 在复杂的多步骤工作流中,作为结束信号。

总结

Terminate 是一个轻量级的工具类,用于在特定条件下终止交互。它通过简单的接口和异步执行提供了灵活的流程控制机制。代码结构清晰,功能明确,适合嵌入到更大的系统中作为任务管理的终点。唯一需要注意的是,它假设调用者会提供有效的 status 值("success""failure"),否则可能需要额外的验证逻辑(通常由 BaseTool 或调用框架处理)。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值