爬虫豆瓣三部曲之电影排行榜

最新推荐文章于 2025-09-08 07:55:51 发布

g36k

最新推荐文章于 2025-09-08 07:55:51 发布

阅读量309

点赞数

CC 4.0 BY-SA版权

分类专栏： Python 文章标签：爬虫豆瓣 Xpath csv

本文链接：https://blog.youkuaiyun.com/a1161510735/article/details/90727017

Python 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了一种使用Python爬虫技术抓取豆瓣电影Top250榜单的方法，包括电影名称、排名、评分及简介等内容的抓取流程。通过requests库获取网页，lxml库解析数据，并利用XPath定位元素，最终将数据保存为CSV格式。

豆瓣系列应该是每个学爬虫的人都会练习的。。。hhh
还是先看下代码`在这里插入代码片

 import requests
import csv
from lxml import etree
f = open("你想保存的位置", 'wt', newline='', encoding='utf-8-sig')
write = csv.writer(f)
write.writerow(('name', 'rank', 'pf', 'pj'))
urls = ['https://movie.douban.com/top250?start={}&filter='.format(str(i)) for i in range(0,250,25)]
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
for url in urls:
    html = requests.get(url,headers=headers)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//div[@class="item"]')
    for info in infos:
            name = info.xpath('div[@class="info"]/div[@class="hd"]/a/span[1][@class="title" and 1]/text()')
            rank = info.xpath('div[@class="pic"]/em/text()')
            pf = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')
            pj = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()')
            write.writerow((name, rank, pf, pj))

f.close()

首先是需要导入库 request, 然后是解析所需要的库lxml 这里采用的是xpath方法然后用csv 来保存。
思路分析：
打开 https://movie.douban.com/top250 豆瓣电影链接。然后找到所要爬取的内容按下键盘F12。
我这里需要爬取的是电影名字排名评分评价。
我们发现所有需要的内容都在div[@class=“item”] 中。所以构造infos = selector.xpath('//div[@class="item"]')
爬取的内容都在这里面
电影名字还是右键 Copy 然后 Copy xpath。得到//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]
这里需要修改下。name = info.xpath('div[@class="info"]/div[@class="hd"]/a/span[1][@class="title" and 1]/text()') 电影名字都在class=title 中
排名//*[@id="content"]/div/div[1]/ol/li[3]/div/div[1]/em 修改成rank = info.xpath('div[@class="pic"]/em/text()')
评分//[@id=“content”]/div/div[1]/ol/li[3]/div/div[2]/div[2]/div/span[2]
修改成 info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')
评价：//[@id=“content”]/div/div[1]/ol/li[3]/div/div[2]/div[2]/p[2]/span
修改如下pj = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()') write.writerow((name, rank, pf, pj))
准备
https://movie.douban.com/top250 我们发现每页只有25部电影
第一页：https://movie.douban.com/top250
第二页：https://movie.douban.com/top250?start=25&filter=
我们只需要构造start 后面的数字用format方法比较好再用for循环遍历便可得到每页内容

https://movie.douban.com/top250?start={}&filter='.format(str(i)) for i in range(0,250,25)

然后就是 headers 怎么获取就不说了，很容易。
储存再CSV
我们可以写f = open("你想保存的位置", 'wt', newline='', encoding='utf-8-sig') 新建csv 可读写
再write = csv.writer(f) 再文件中写入电影名排名评分评价write.writerow(('name', 'rank', 'pf', 'pj')) 最后就是关闭文件

f.close()

有问题欢迎指正
233333