Python——爬取目标豆瓣图书TOP250

最新推荐文章于 2025-02-14 22:04:03 发布

丶蓝色

最新推荐文章于 2025-02-14 22:04:03 发布

阅读量1.3k

点赞数 3

文章标签： python 爬虫

本文链接：https://blog.youkuaiyun.com/lanse_l/article/details/88293372

版权

目标网址：https://book.douban.com/top250?start=0

参考资料：

Requests: http://docs.python-requests.org/zh_CN/latest/

BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

安装模块：

pip3 install Beautifulsoup4
pip install requests

导入模块：

import requests
from bs4 import BeautifulSoup

添加headers，模拟浏览器访问：

因为有些网页如果我们直接去请求的话，他会查看请求的对象是不是浏览器，如果没有浏览器信息就会禁止我们爬虫的访问

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

User-Agent在浏览器中，查看网页源代码，找到Network

爬取豆瓣图书信息（）：

i = 1
s = ""
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
for x in range(0,10):
    resp = requests.get("https://book.douban.com/top250?start=%d"%(x*25),timeout=None,headers = headers)
    #目标网址中，每一页都以25的倍数递增，timeout=None，在网络不佳的时候一直等待
    soup = BeautifulSoup(resp.text,'html.parser')
    book_names = soup.find_all('div',class_='pl2')#书名
    authors = soup.find_all('p','pl')#作者
    scores = soup.find_all('span',class_='rating_nums')#评分
    introducts = soup.find_all('table',width="100%")#简介
    for book_name,author,score,introduct in zip(book_names,authors,scores,introducts):
        s += str("%d.《%s》\n" % (i,book_name.find('a')['title']))
        s += str("%s\n" % (author.get_text()))
        s += str("评分：%s\n" % (score.get_text()))
        itd = introduct.find('span',class_='inq')
        if(itd != None):  #因为有些图书没有简介
            s += str("简介：\"%s\"\n" % (itd.get_text()))
        else:
            s += str("简介：None\n")
        s += str("===========================================================================\n")
        i += 1

现在我们爬取到了信息，然后把它保存到本地文本文件

在windows下面，新文件的默认编码是gbk，这样的话，python解释器会用gbk编码去解析我们的网络数据流txt，然而txt此时已经是decode过的unicode编码，这样的话就会导致解析不了，出现上述问题。解决的办法就是，改变目标文件的编码：

with open("豆瓣图书TOP250.txt","w",encoding = 'utf-8') as f:
    f.write(s)

这样就把豆瓣图书的信息保存在了本地文件里面