让浏览器为你工作：探索AI驱动的自动化工具-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_37786060/article/details/144279682

0、相关文档

https://github.com/gregpr07/browser-use

1、安装

Python3.11版本以上

bashpip install browser-use -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install lxml_html_clean -i https://pypi.tuna.tsinghua.edu.cn/simple

playwright install

2、使用

使用

browser=Browser(BrowserConfig(headless=False))

可以让他不使用无头浏览器的方式打开，可以看到操作过程

py#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author   : tingbai
# @Time     : 2024/12/2 16:45
# @File     : demo.py
# @Project  : PyCharm
from langchain_openai import ChatOpenAI
from browser_use import Agent
import asyncio
from browser_use.browser.browser import Browser, BrowserConfig

llm = ChatOpenAI(model="gpt-4o", base_url="xxxx/v1", api_key="sk-xxxx")


async def main():
    agent = Agent(
        browser=Browser(BrowserConfig(headless=False)),
        task="""
        0、打开酷家乐(http://zhuzhan.feat.qunhequnhe.com/uic/trust/login?userId=xxxx)
        1、打开群核商城
        2、购买钻石VIP
        """,
        llm=llm
    )
    result = await agent.run()
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

3、内部原理

3.1、提示词

断点到第一步执行传入的提示词，分为3部分

提示词

1、内置的提示词

trueYou are an AI agent that helps users interact with websites. You receive a list of interactive elements from the current webpage and must respond with specific actions. Today's date is 2024-12-04 18:48.

INPUT FORMAT:

Example:
33[:]    <button>Interactive element</button>
_[:] Text content...

Explanation:
index[:] Interactible element with index. You can only interact with all elements which are clickable and refer to them by their index.
_[:] elements are just for more context, but not interactable.
    : Tab indent (1 tab for depth 1 etc.). This is to help you understand which elements belong to each other.


You have to respond in the following RESPONSE FORMAT: 

{{
    "current_state": {{
        "valuation_previous_goal": "String starting with "Success", "Failed:" or "Unknown" to evaluate if the previous next_goal is achieved. If failed or unknown describe why.",
        "memory": "Your memory with things you need to remeber until the end of the task for the user. You can also store overall progress in a bigger task. You have access to this in the next steps.",
        "next_goal": "String describing the next immediate goal which can be achieved with one action"
    }},
    "action": {{
        // EXACTLY ONE of the following available actions must be specified
    }}
}}

Your AVAILABLE ACTIONS:
Search Google: 
{search_google: {'query': {'type': 'string'}}}
Navigate to URL: 
{go_to_url: {'url': {'type': 'string'}}}
Go back: 
{go_back: {}}
Click element: 
{click_element: {'index': {'type': 'integer'}, 'num_clicks': {'default': 1, 'type': 'integer'}, 'xpath': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None}}}
Input text: 
{input_text: {'index': {'type': 'integer'}, 'text': {'type': 'string'}, 'xpath': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None}}}
Switch tab: 
{switch_tab: {'page_id': {'type': 'integer'}}}
Open new tab: 
{open_tab: {'url': {'type': 'string'}}}
Extract page content to get the text or markdown : 
{extract_content: {'value': {'default': 'text', 'enum': ['text', 'markdown', 'html'], 'type': 'string'}}}
Complete task: 
{done: {'text': {'type': 'string'}}}
Scroll down the page by pixel amount - if no amount is specified, scroll down one page: 
{scroll_down: {'amount': {'anyOf': [{'type': 'integer'}, {'type': 'null'}], 'default': None}}}
Scroll up the page by pixel amount - if no amount is specified, scroll up one page: 
{scroll_up: {'amount': {'anyOf': [{'type': 'integer'}, {'type': 'null'}], 'default': None}}}

Example:
{"current_state": {"valuation_previous_goal": "Success", "memory": "We applied already for 3/7 jobs, 1. ..., 2. ..., 3. ...", "next_goal": "Click on the button x to apply for the next job"}, "action": {"click_element": {"index": 44,"num_clicks": 2}}}

IMPORTANT RULES:

1. Only use indexes that exist in the input list for click or input text actions. If no indexes exist, try alternative actions, e.g. go back, search google etc.
2. If stuck, try alternative approaches, e.g. go back, search google, or extract_page_content
3. When you are done with the complete task, use the done action. Make sure to have all information the user needs and return the result.
4. If an image is provided, use it to understand the context, the bounding boxes around the buttons have the same indexes as the interactive elements.
6. ALWAYS respond in the RESPONSE FORMAT with valid JSON.
7. If the page is empty use actions like "go_to_url", "search_google" or "open_tab"
8. Remember: Choose EXACTLY ONE action per response. Invalid combinations or multiple actions will be rejected.
9. If popups like cookies appear, accept or close them
10. Call 'done' when you are done with the task - dont hallucinate or make up actions which the user did not ask for

2、task中传入的内容

task提示词

3、当前页面截图的文本内容和base64位转换后的截图

截图提示词

其中获取页面截图和元素的过程在

browser_use.browser.context.BrowserContext._update_state中

async def _update_state(self, use_vision: bool = False) -> BrowserState:
        """Update and return state."""
        await self.remove_highlights()
        page = await self.get_current_page()
        dom_service = DomService(page)
        content = await dom_service.get_clickable_elements()  # Assuming this is async

        screenshot_b64 = None
        if use_vision:
            screenshot_b64 = await self.take_screenshot()

        self.current_state = BrowserState(
            element_tree=content.element_tree,
            selector_map=content.selector_map,
            url=page.url,
            title=await page.title(),
            tabs=await self.get_tabs_info(),
            screenshot=screenshot_b64,
        )

        return self.current_state

执行

content = await dom_service.get_clickable_elements()

后可以看到页面高亮出可操作元素

如图所示，元素可点击的内容通过索引1,2,3,4来标记，在后续点击操作的时候使用

高亮截图

然后进行截图，并移除高亮

3.2、处理返回结果并执行操作

调用接口/chat/completions将内容传递给openapi

从返回结果的

response.choices[0].message.tool_calls[0].function.arguments

中可以拿到

'{"current_state":{"valuation_previous_goal":"Unknown","memory":"","next_goal":"Open the URL for 酷家乐."},"action":{"go_to_url":{"url":"http://zhuzhan.feat.qunhequnhe.com/uic/trust/login?userId=1111269472"}}}'

操作1

json.loads(response.choices[0].message.tool_calls[0].function.arguments)['action']

操作2

操作3

操作4

它会根据返回的内容执行操作

这个参照上面提示词中的【Your AVAILABLE ACTIONS】

在执行

result = await self.controller.act(model_output.action, self.browser_context)

的时候会根据关键词找到

site-packages/browser_use/controller/service.py

中的具体方法进行操作

例如

Navigate to URL: {go_to_url: {'url': {'type': 'string'}}} 对应的go_to_url方法

@self.registry.action('Navigate to URL', param_model=GoToUrlAction, requires_browser=True)
async def go_to_url(params: GoToUrlAction, browser: BrowserContext):
    page = await browser.get_current_page()
    await page.goto(params.url)
    await page.wait_for_load_state()

最后完成操作后将当前页面的内容和截图放入到下一步传给openapi在进行循环直至结束

操作结果

3.3、自定义操作

如果需要有自己的特殊操作可以修改提示词+增加一个方法来用自定义

例如：

class TestUserId(BaseModel):
    userId: int


controller = Controller()


@controller.action('登录测试账号', param_model=TestUserId, requires_browser=True)
async def open_dev(params: TestUserId, browser: BrowserContext):
    page = await browser.get_current_page()
    await page.goto(f"http://zhuzhan.feat.qunhequnhe.com/uic/trust/login?{params}")
    await page.wait_for_load_state()

再把controller传给Agent

agent = Agent(
        browser=Browser(BrowserConfig(headless=False)),
        task="""
        ...
        """,
        llm=llm,
        controller=controller
    )

执行截图