xcx:什么是XPath?
shy:XPath即为XML路径语言,用于确定XML树结构中某一部分的位置。XPath技术基于XML的树结构,能够在树结构中遍历节点(元素、属性等)。
hwjw:什么是lxml库?
shy:lxml是以Python语言编写的库,主要用于解析和提取HTML或者XML格式的数据,可以利用XPath语法进行快速地定位特定的元素或节点。
描述:
使用XPath和lxml库对京东笔记本数据进行爬取、解析和提取。基础url:https://list.jd.com/list.html?cat=670%2C671%2C672&page=1
任务一:
修改贴吧数据获取程序,使得程序能爬取到京东笔记本数据,将数据打印到屏幕上。
提示:
原程序无法爬取京东数据,分析得知须在headers中添加"cookie",如
headers = {
"User-Agent": "……",
"cookie": "……"
}
Code:
import urllib.request
import urllib.parse
import time
def load_page(url, filename):
'''
作用:根据url发送请求,获取服务器响应文件
url:需要爬取的url地址
1、定义headers
2、定义request
3、获得response
4、返回内容
'''
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"Cookie": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
} # 此处放入自己的Cookie
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
# print(response.read())
return response.read().decode("utf-8")
def tieba_spider():
url = "https://list.jd.com/list.html?cat=670%2C671%2C672&page=1"
html = load_page(url, "")
print(html)
if __name__ == "__main__":
tieba_spider()
代码分析:
-
导入模块:首先导入了
urllib.request
和urllib.parse
模块,用于发送 HTTP 请求和解析 URL。 -
定义
load_page
函数ÿ