平时喜欢看一些书,加上朋友有时候也喜欢让我给他爬取一些小说,趁最近空闲下来就简单的记录一下自己写爬虫的过程吧
首先需要导入相关的模块
import requests
from lxml import etree
安装对应模块的方式
# pip快速安装
pip install requests
pip install lxml
向网站发送请求并获取网站数据
- 如图所示红框的地方为这本小说的网址:网页地址
- 于是就可以获取到网页数据:
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
注意:
- 网页编码格式需要打开开发者工具查看,如下图:
如何获取正文地址和章节名称
- 图中红框的表示小说的正文章节内容和章节名称于是可以获取到对应的信息
XPath 语法 - 对于要想快速定位到对应内容的位置,可以用chrome浏览器的插件
XPath Helper
如图XPath Helper使用效果
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
url_list = html.xpath('//div[@id="list"]/dl/dd/a/@href')
name_list = html.xpath('//div[@id="list"]/dl/dd/a/text()')
获取正文内容
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
url_list = html.xpath('//div[@id="list"]/dl/dd/a/@href')
name_list = html.xpath('//div[@id="list"]/dl/dd/a/text()')
for ur, na in zip(url_list, name_list):
res = requests.get(f'https://www.xbiquge.la{ur}') # 向网站发送请求并获取网站数据
res.encoding = 'utf-8'
res_html = etree.HTML(res.text)
info = res_html.xpath('//div[@id="content"]/text()')
然后把正文内容写入到文件中就完成了
- 完整代码如下:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from lxml import etree
def book():
url = "https://www.xbiquge.la/7/7194/"
response = requests.get(url)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
url_list = html.xpath('//div[@id="list"]/dl/dd/a/@href')
name_list = html.xpath('//div[@id="list"]/dl/dd/a/text()')
fp = open("修真聊天群.txt", 'w')
for ur, na in zip(url_list, name_list):
res = requests.get(f'https://www.xbiquge.la{ur}') # 向网站发送请求并获取网站数据
res.encoding = 'utf-8'
res_html = etree.HTML(res.text)
info = res_html.xpath('//div[@id="content"]/text()')
fp.write(f'{na}\n\n')
print(f'{na}__{ur}') # 查看当前章节名称和链接地址
for i in info:
i = i.replace(r'\xa0', '').replace('\n\n', '\n') # 去除垃圾信息并调整排版
if i == '\r':
continue
fp.write(i) # 写入正文到文本中
fp.write('\n\n')
fp.close()
if __name__ == '__main__':
book()