pup机器人技术：机器人研究网页的数据采集工具-优快云博客

pup机器人技术：机器人研究网页的数据采集工具

【免费下载链接】pup Parsing HTML at the command line 项目地址: https://gitcode.com/gh_mirrors/pu/pup

你还在为机器人研究中的网页数据采集烦恼吗？面对杂乱的HTML代码，如何快速提取关键信息？本文将带你探索pup——一款命令行HTML解析工具，让机器人研究中的网页数据采集变得简单高效。读完本文，你将掌握pup的安装方法、基本用法和高级技巧，轻松应对各种网页数据采集场景。

什么是pup

pup是一款命令行HTML解析工具，它可以从标准输入读取HTML内容，通过CSS选择器、parse.go、selector.go等文件组成，核心实现了HTML解析和CSS选择器功能。

安装pup

Go安装

如果你已经安装了Go环境，可以直接使用go get命令安装：

go get github.com/ericchiang/pup

相关依赖信息可查看go.mod和go.sum文件。

Homebrew安装

如果你使用的是OS X系统，可以通过Homebrew安装（无需Go环境）：

brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

Homebrew安装脚本位于pup.rb。

直接下载

你也可以通过releases page直接下载可执行文件。

快速上手

基本用法

pup的基本使用格式为：

$ cat index.html | pup [flags] '[selectors] [display function]'

我们以获取Hacker News的标题为例，首先使用curl获取网页内容：

$ curl -s https://news.ycombinator.com/

得到的HTML内容杂乱无章，我们使用pup过滤标题：

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a'

如果只需要链接，可以使用attr{href}：

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}'

要同时获取标题和链接，可以使用json{}：

$ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}'

更多命令示例可参考tests/cmds.txt。

核心功能

清理和缩进HTML

pup默认会填充缺失的标签并正确缩进页面，使用--color标志还可以显示彩色HTML：

$ cat tests/index.html | pup --color

测试用的HTML文件为tests/index.html。

按标签、ID、属性过滤

pup支持多种CSS选择器，例如按标签过滤：

$ cat tests/index.html | pup 'title'

按ID过滤：

$ cat tests/index.html | pup 'span#See_also'

按属性过滤：

$ cat tests/index.html | pup 'th[scope="row"]'

伪类选择器

pup实现了多种CSS伪类选择器，例如:contains("text")用于选择包含指定文本的元素：

$ cat tests/index.html | pup ':contains("History")'

:empty用于选择空元素：

$ cat tests/index.html | pup 'a[rel]:empty'

:parent-of(selector)用于选择包含指定子元素的父元素：

$ cat tests/index.html | pup ':parent-of([action="edit"])'

更多已实现的选择器可参考implemented selectors部分。

组合选择器

pup支持使用+、>和,组合选择器，例如使用,指定多个选择器组：

$ cat tests/index.html | pup 'title, h1 span[dir="auto"]'

也可以链式组合选择器，前一个选择器选择的HTML节点将传递给下一个选择器：

$ cat tests/index.html | pup 'h1#firstHeading span'

显示函数

text{}

text{}函数用于打印所选节点及其子节点的所有文本，按深度优先顺序：

$ cat tests/index.html | pup '.mw-headline text{}'

attr{attrkey}

attr{attrkey}函数用于打印所选节点中指定键的所有属性值：

$ cat tests/index.html | pup '.catlinks div attr{id}'

json{}

json{}函数用于将HTML转换为JSON格式输出：

$ cat tests/index.html | pup 'div#p-namespaces a json{}'

使用-i/--indent标志可以控制缩进级别：

$ cat tests/index.html | pup -i 4 'title json{}'

高级用法

命令行标志

运行pup --help可以查看所有可用标志，例如--color用于彩色输出，-i用于控制JSON缩进等。

测试用例

项目提供了丰富的测试用例，位于tests/目录下。测试命令位于tests/cmds.txt，预期输出位于tests/expected_output.txt，可以通过tests/run.py脚本运行测试。

总结与资源

pup是一款功能强大的命令行HTML解析工具，为机器人研究中的网页数据采集提供了高效解决方案。通过本文的介绍，你已经掌握了pup的基本使用方法和高级技巧。更多详细信息可参考项目README.md。

如果你觉得本文对你有帮助，请点赞、收藏、关注三连。下期我们将介绍pup在机器人研究中的具体应用案例，敬请期待！

【免费下载链接】pup Parsing HTML at the command line 项目地址: https://gitcode.com/gh_mirrors/pu/pup

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考