XPath与lxml实战-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_36589234/article/details/105719439

1.xpath学习，使用lxml+xpath提取内容。

什么是 XPath?

XPath 使用路径表达式在
XML 文档中进行导航
XPath 包含一个标准函数库
XPath 是 XSLT 中的主要元素
XPath是一个 W3C 标准

（1）XPath 节点

在 XPath 中，有七种类型的节点：元素、属性、文本、命名空间、处理指令、注释以及文档（根）节点。XML 文档是被作为节点树来对待的。树的根被称为文档节点或者根节点。

2.使用xpath提取丁香园论坛的回复内容。

爬取思路：

获取url的html
lxml解析html
利用Xpath表达式获取user和content
保存爬取的内容

# 导入库
from lxml import etree
import requests

proxies = {'http': "socks5://127.0.0.1:7891/",
          'https': "socks5://127.0.0.1:7891/",}
url = "http://www.dxy.cn/bbs/thread/626626#626626"
response = requests.get(url, proxies=proxies)
response.encoding='utf-8'
html = response.text
tree = etree.HTML(html) #lxml解析html

# 提取信息
users = tree.xpath('//div[@class="auth"]/a/text()')
contents = tree.xpath('//td[@class="postbody"]')
for user, content in zip(users, contents):
    content = content.xpath('string(.)')
    content = re.sub('\s', '', content)
    print(user,":", content)