实现步骤:
-
确定网页源码
-
把网页源码生成一个element对象
-
通过element对象实现XPath语法
-
保存数据
目标页面如图:
首先进行url分析:
第一页、第二页、第三页的url如下:
https://movie.douban.com/top250
https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=
其中,第一页也等价于
https://movie.douban.com/top250?start=0&filter=
页面分析:
这里一个
- 标签就代表一部电影。
且所有想要获取的目标数据都在
1.导包
from lxml import etree
import requests
import csv
2.拿到目标url
doubanUrl = ‘https://movie.douban.com/top250?start={}&filter=’
3.获取网页源码
def getSource(url):
headers = {
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36’}
response = requests.get(url, headers=headers)
response.encoding = ‘utf-8’
return response.text
4.解析数据
def getEveryItem(source):
html_element = etree.HTML(source)
class = ‘info’ 电影的名字,评分,引言,详情页的
movieItemList = html_element.xpath(“//div[@class=‘info’]”)
定义一个空列表,添加字典数据
movieList = []
for eachMovie in movieItemList:
定义一个字典保存每部电影的数据
movieDict = {}
标题
title = eachMovie.xpath(“div[@class=‘hd’]/a/span[@class=‘title’]/text()”)
副标题
otherTitle = eachMovie.xpath(“div[@class=‘hd’]/a/span[@class=‘other’]/text()”)
详情页url
star = eachMovie.xpath(“div[@class=‘bd’]/div[@class=‘star’]/span[@class=‘rating_num’]/text()”)[0]
link = eachMovie.xpath(‘div[@class=“hd”]/a/@href’)[0]
quote = eachMovie.xpath(“div[@class=‘bd’]/p[@class=‘quote’]/span/text()”)
非空判断
if quote:
quote = quote[0]
else:
quote = ‘’
保存数据,标题是主标题+副标题
movieDict[‘title’] = ‘’.join(title+otherTitle)
movieDict[‘url’] = link
movieDict[‘star’] = star
movieDict[‘quote’] = quote
#print(movieDict)
movieList.append(movieDict)
return movieList
5.保存数据
def writeData(movieList):
with open(‘douban.csv’, ‘w’, encoding=‘utf-8’ , newline=‘’) as f:
writer = csv.DictWriter(f, fieldnames=[‘title’, ‘star’, ‘quote’, ‘url’])