shy:数据爬取与可视化技术系列已发文三篇了,更多爬虫技术请查看专栏文章。
数据爬取与可视化技术——使用XPath和lxml库爬取、解析、提取数据
shy:现已开辟专栏四个:C++、ACM、数据库系统概论、数据爬取与可视化技术,更多文章请关注主页@HUAYI_SUN Blog
问题描述:
爬取新浪股票吧的数据,其URL格式为https://guba.sina.com.cn/?s=bar&name=sh000300&type=0&page=1。分析其URL格式,并注意查看response的编码。添加数据提取功能,要求报告股吧的标题、作者和发表时间,将以上三个属性封装成item对象,并将每页的item封装到items数组中。要求股票代码由用户输入,同时输入起始页和结束页,把爬取到的网页保存成html文件。
Code:
import urllib.parse
import urllib.request
# from requests import get, post
from lxml import etree
import time
# enconding = 'utf-8'
def load_page(url, filename):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"Cookie": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
} # 此处加入自己的Cookie
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
return response.read().decode('gbk')
def parse_page(html):
# 标题: //*[@id="blk_list_02"]/table/tbody/tr[2]/td[3]/a/text()
# 作者://*[@id="blk_list_02"]/table/tbody/tr[2]/td[4]/div/a/text()
# 发表时间://*[@id="blk_list_02"]/table/tbody/tr[2]/td[5]/text()
root = etree.HTML(html)
lis = root.xpath('//*[@id="blk_list_02"]/table/tbody/tr')
# print(len(lis))
items = []
for i in range(2, len(lis) + 1):
name = root