qu.la网站上的小说爬取

最新推荐文章于 2024-01-03 14:36:28 发布

原创最新推荐文章于 2024-01-03 14:36:28 发布 · 2k 阅读

1 ·

CC 4.0 BY-SA版权

python 专栏收录该内容

74 篇文章

订阅专栏

本文分享了一个针对Qu.la网站的小说爬虫项目，介绍了如何使用Python和BeautifulSoup库抓取小说章节并保存为文本文件。项目虽早期且代码简陋，但展示了基本的爬虫技术和网页解析流程。

qu.la网站上的小说爬取

##这个项目是我最早开始写的爬虫项目，代码比较简陋

在写这个项目时，我还不会Python的协程编程，用协程可提升爬虫速度至少5倍，参考我的文章[线程，协程对比和Python爬虫实战说明]

(https://github.com/zhang0peter/python-coroutine)

# -*- coding: utf-8 -*-
"""
Created on Tue Aug 22 11:04:57 2017
@author: zhang
"""


#小说第一面的网址
url="https://www.qu.la/book/26974/9765888.html"
#all1表示想要爬取多少面
all1=860
#path是你想要保存的文件名，可以是绝对路径
path = "超维术士.txt"




next=url.split('/')[-1]
url0 = url.replace(next,"")
from bs4 import BeautifulSoup
import requests
import time
time0=time.time()
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36"}
a=""
i=0
def replace(content):
    content=content.replace("<br/>\u3000\u3000<br/>\u3000\u3000\xa0\xa0\xa0\xa0","\r\n")
    content=content.replace("[<div id=\"content\">\r\n\t\t\t\t\xa0\xa0\xa0\xa0","\r\n")
    content=content.replace("\t\t\t\t<script>chaptererror();</script>\n</div>]","")
    content=content.replace("<br/>\u3000\u3000","\r\n")
    content=content.replace("[<div id=\"content\">","\r\n")
    content=content.replace("\t\t\t\t<script>chaptererror();</script>\n</br></div>]","")
    return content
while 1:
    url=url0+next
    try :
        r=requests.get(url,timeout=10,headers=header)
    except:
        continue
    r.raise_for_status()
    if  not r.status_code ==200:
              print("产生异常1")
    r.encoding=r.apparent_encoding
    demo=r.text
    soup = BeautifulSoup(demo,"html.parser")
    title =str(soup.h1.string)
    a+=title
    content =str(soup.select("#content"))
    content=replace(content)
    a+=content
    try:
     next= soup.find('a','next').attrs['href']
    except:
        break
    i+=1
    time1=time.time()
    print("\r当前进度:  {0:.2f}% 总花费时间:{1:.2f}s".format(i*100/all1,time1-time0),end="")
    if i>=all1:
        break
print("\n共爬取了{0}章内容,花时{1:.2f}s".format(all1,time1-time0))
g=open(path,mode='w',encoding='utf-8')
g.write(a)
g.close()