开发工具
- python版本: python-3.8.1-amd64
- python开发工具: JetBrains PyCharm 2018.3.6 x64
- 安装requests库(指定阿里镜像安装会很快)
pip install requests -i http://mirrors.aliyun.com/pypi/simple/
-豆瓣电影TOP250网页地址https://movie.douban.com/top250
目标
获取豆瓣电影TOP250中电影名称,年份,评分和评论数
网页分析
在浏览器中右键点击查看网页源代码
该网页内容为服务器端渲染展示内容,需要获取网页源代码抓取数据
<ol class="grid_view">
<li>
<div class="item">
<div class="pic">
<em class="">1</em>
<a href="https://movie.douban.com/subject/1292052/">
<img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1292052/" class="">
<span class="title">肖申克的救赎</span>
<span class="title"> / The Shawshank Redemption</span>
<span class="other"> / 月黑高飞(港) / 刺激1995(台)</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 弗兰克·德拉邦特 Frank Darabont 主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
1994 / 美国 / 犯罪 剧情
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.7</span>
<span property="v:best" content="10.0"></span>
<span>2499740人评价</span>
</div>
<p class="quote">
<span class="inq">希望让人自由。</span>
</p>
</div>
</div>
</div>
</li>
<li>
<div class="item">
<div class="pic">
<em class="">2</em>
<a href="https://movie.douban.com/subject/1291546/">
<img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
</a>
</div>
<div class="info">
<div class="hd">
<a href="https://movie.douban.com/subject/1291546/" class="">
<span class="title">霸王别姬</span>
<span class="other"> / 再见,我的妾 / Farewell My Concubine</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="bd">
<p class="">
导演: 陈凯歌 Kaige Chen 主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
1993 / 中国大陆 中国香港 / 剧情 爱情 同性
</p>
<div class="star">
<span class="rating5-t"></span>
<span class="rating_num" property="v:average">9.6</span>
<span property="v:best" content="10.0"></span>
<span>1858334人评价</span>
</div>
<p class="quote">
<span class="inq">风华绝代。</span>
</p>
</div>
</div>
</div>
</li>
正则匹配分析
这里用到最多的是懒惰匹配.? ; 还有给要获取的内容命名(?<名称>.?)
<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
r'</span>.*?<p class="">.*?<br>(?P<year>.*?) .*?'
r'<span class="rating_num" property="v:average">(?P<rating_num>.*?)</span>.*?'
r'<span>(?P<evaluationsNum>.*?)人评价</span>
代码实例
import requests
import re
import json
from requests.exceptions import RequestException
import csv
import time
def get_one_page(url):
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
'Connection': 'close'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
response.close()
return response.text
return None
except RequestException:
return None
def parse_one_page(html):
pattern = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
r'</span>.*?<p class="">.*?<br>(?P<year>.*?) .*?'
r'<span class="rating_num" property="v:average">(?P<rating_num>.*?)</span>.*?'
r'<span>(?P<evaluationsNum>.*?)人评价</span>', re.S)
result = pattern.finditer(html)
for it in result:
yield {
'name': it.group("name"), # 电影名
'year': it.group("year").strip(), # 年份
'rating_num': it.group("rating_num"), #排名
'evaluationsNum': it.group("evaluationsNum") # 评分
}
def write_to_txt(content):
with open('result.txt', 'a', encoding='utf-8') as f:
# print(type(json.dumps(content)))
f.write(json.dumps(content, ensure_ascii=False) + "\n")
def write_to_csv(content):
f = open('douban.csv', mode='a',encoding='utf-8')
csvwriter = csv.writer(f)
csvwriter.writerow(content.values())
def main(offset):
url = 'http://movie.douban.com/top250?start=' + str(offset)
html = get_one_page(url)
for item in parse_one_page(html):
print(item)
# write_to_txt(item)
write_to_csv(item)
if __name__ == '__main__':
for i in range(10):
main(offset=i * 25)
time.sleep(1) # 如果速度过快,则会无响应,所以这里又增加了一个延时等待
总结
1.time.sleep(1)增加延时等待,避免访问速度过快
2.open打开文件时,mode模式用a,可以进行文件内容的追加