开源项目SeeClick安装与使用指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01147/article/details/141238929

开源项目SeeClick安装与使用指南

SeeClickThe model, data and code for the visual GUI Agent SeeClick项目地址:https://gitcode.com/gh_mirrors/se/SeeClick

一、项目介绍

SeeClick是一款基于视觉识别技术的图形用户界面(GUI)代理模型，能够理解和操作各种操作系统环境中的元素，包括iOS、Android、macOS、Windows以及Web界面。该项目利用深度学习技术进行GUI元素定位和交互，是自然语言处理(NLP)在UI自动化领域的创新尝试。

主要功能: 根据人类指令或描述自动定位并点击屏幕上的元素。
兼容性: 支持跨平台多系统操作环境。
技术创新点: 结合了图像理解、自然语言处理及人工智能算法。
应用场景: UI测试自动化、无障碍辅助工具开发等。

技术亮点:

预训练机制: 基于大量GUI界面数据集进行预训练，增强了模型的泛化能力。
推理效率: 在保持高精度的同时优化了推理速度，适合实时应用。
灵活性: 可以通过微调适应特定场景的需求。

二、项目快速启动

为了开始使用SeeClick模型，首先确保你的环境满足以下条件：

Python >= 3.7
PyTorch和其他相关依赖包(见requirements.txt)

安装依赖:

pip install -r requirements.txt

环境设置与模型加载:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "SeeClick-ckpt-dir",
    device_map="cuda",
    trust_remote_code=True,
    bf16=True
).eval()
generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

示例代码:

假设你有一张名为test_img.png的UI截图图片，你可以使用下面的Python脚本来获取一个元素的位置：

img_path = "assets/test_img.png"
prompt = "In this UI screenshot, what is the position of the element corresponding to the command '[]' (with point)"
query = tokenizer(prompt.format("ref"))
response_history = model.chat(tokenizer=tokenizer, query=query)
print(response_history)

其中ref应当替换为你想要查询的具体命令。