python爬虫-获取一本小说的全部内容_爬取一本小说每一章的网址添加到redis的列表中python-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_40863724/article/details/82991136

本文介绍了一个使用Python 3.6.3编写的爬虫程序，该程序能够在macOS下运行，利用requests和BeautifulSoup库从指定网站抓取小说章节内容，并将其保存到本地文件中。程序首先禁用了HTTPS验证警告，定义了标准输出编码，然后解析网页获取章节链接和名称，最后下载并保存章节文本。

python版本3.6.3，并非最新版本是应为tensorflow并不支持3.7的python，因此降版本到了3.6.3.
操作系统macos
时间2018.10.9

import requests
import os
import io
import sys
import urllib
import certifi
import urllib3
from bs4 import BeautifulSoup

def multipledownload():
	flag=0
	indexlist=[]
	indexname=[]
	response=requests.get('https://www.xs.la/0_5/?wscckey=0e53004a7336372d_1539109494',verify=False)
	soup=BeautifulSoup(response.text,features='lxml')
	filtered=soup.find_all('div',id='list')
	filtered_=BeautifulSoup(str(filtered[0]),features='lxml').find_all('a')
	for i in filtered_:
		indexlist.append('https://www.xs.la'+i.get('href'))
		indexname.append(i.string)
	neededindexlist=indexlist[14:]
	neededindexname=indexname[14:]
	for url in neededindexlist:
		texthtml=requests.get(url,verify=False)
		texthtml=BeautifulSoup(texthtml.text,features='lxml')
		textdoc=texthtml.find_all('div',id='content')
		bettertext=str(textdoc).replace('\t','').replace('<br/>','').replace('　','').replace('    ','')
		with open('完美世界-辰东.txt','a+',encoding='utf-8') as f:
			f.write(neededindexname[flag])
			f.write(bettertext)
			print('%.3f' % float(flag*100/len(neededindexname)),flush=True)
		flag=flag+1	

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='utf-8')
multipledownload()

导入必要的库，我先disable了一个要求验证的warning，这个warning并不影响程序执行反而输出一堆warning，于是在程序主体直接去掉的。然后规定标准输出的编码格式，减少烦人的unicodeerror错误。发出请求之后，用beautifulsoup获取，在html文件中找到章节的url处于id=‘list’中，于是查找，随后再转为str后在查找a开头的各章url。随后循环存入两个数组中，我们并不需要前15个章节，于是slice去掉，随后再循环获取章节中的contenttag里的章节内容，通过replace去掉各种莫名奇妙的空格之后，存入文档中，以追加模式a+写入，输出flag于数组长度len的百分比，表示进度。flag加一。程序即可运行。