python爬虫实例-运用requests抓取豆瓣电影TOP250(详解)

开发工具

  • python版本: python-3.8.1-amd64
  • python开发工具: JetBrains PyCharm 2018.3.6 x64
  • 安装requests库(指定阿里镜像安装会很快)
    pip install requests -i http://mirrors.aliyun.com/pypi/simple/

-豆瓣电影TOP250网页地址https://movie.douban.com/top250

目标

获取豆瓣电影TOP250中电影名称,年份,评分和评论数

网页分析

在浏览器中右键点击查看网页源代码
该网页内容为服务器端渲染展示内容,需要获取网页源代码抓取数据

<ol class="grid_view">
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">1</em>
                    <a href="https://movie.douban.com/subject/1292052/">
                        <img width="100" alt="肖申克的救赎" src="https://img2.doubanio.com/view/photo/s_ratio_poster/public/p480747492.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1292052/" class="">
                            <span class="title">肖申克的救赎</span>
                                    <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
                                <span class="other">&nbsp;/&nbsp;月黑高飞(港)  /  刺激1995(台)</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 弗兰克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·罗宾斯 Tim Robbins /...<br>
                            1994&nbsp;/&nbsp;美国&nbsp;/&nbsp;犯罪 剧情
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.7</span>
                                <span property="v:best" content="10.0"></span>
                                <span>2499740人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">希望让人自由。</span>
                            </p>
                    </div>
                </div>
            </div>
        </li>
        <li>
            <div class="item">
                <div class="pic">
                    <em class="">2</em>
                    <a href="https://movie.douban.com/subject/1291546/">
                        <img width="100" alt="霸王别姬" src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.webp" class="">
                    </a>
                </div>
                <div class="info">
                    <div class="hd">
                        <a href="https://movie.douban.com/subject/1291546/" class="">
                            <span class="title">霸王别姬</span>
                                <span class="other">&nbsp;/&nbsp;再见,我的妾  /  Farewell My Concubine</span>
                        </a>


                            <span class="playable">[可播放]</span>
                    </div>
                    <div class="bd">
                        <p class="">
                            导演: 陈凯歌 Kaige Chen&nbsp;&nbsp;&nbsp;主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br>
                            1993&nbsp;/&nbsp;中国大陆 中国香港&nbsp;/&nbsp;剧情 爱情 同性
                        </p>

                        
                        <div class="star">
                                <span class="rating5-t"></span>
                                <span class="rating_num" property="v:average">9.6</span>
                                <span property="v:best" content="10.0"></span>
                                <span>1858334人评价</span>
                        </div>

                            <p class="quote">
                                <span class="inq">风华绝代。</span>
                            </p>
                    </div>
                </div>
            </div>
        </li>

正则匹配分析

这里用到最多的是懒惰匹配.? ; 还有给要获取的内容命名(?<名称>.?)

<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?'
r'<span class="rating_num" property="v:average">(?P<rating_num>.*?)</span>.*?'
 r'<span>(?P<evaluationsNum>.*?)人评价</span>

代码实例

import requests
import re
import json
from requests.exceptions import RequestException
import csv
import time

def get_one_page(url):
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
            'Connection': 'close'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.close()
            return response.text
        return None
    except RequestException:
        return None


def parse_one_page(html):
    pattern = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'
              r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?'
              r'<span class="rating_num" property="v:average">(?P<rating_num>.*?)</span>.*?'
              r'<span>(?P<evaluationsNum>.*?)人评价</span>', re.S)
    result = pattern.finditer(html)
    for it in result:
        yield {
            'name': it.group("name"),  # 电影名
            'year': it.group("year").strip(),  # 年份
            'rating_num': it.group("rating_num"), #排名
            'evaluationsNum': it.group("evaluationsNum")  # 评分
        }


def write_to_txt(content):
    with open('result.txt', 'a', encoding='utf-8') as f:
        # print(type(json.dumps(content)))
        f.write(json.dumps(content, ensure_ascii=False) + "\n")

def write_to_csv(content):
    f = open('douban.csv', mode='a',encoding='utf-8')
    csvwriter = csv.writer(f)
    csvwriter.writerow(content.values())

def main(offset):
    url = 'http://movie.douban.com/top250?start=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        print(item)
        # write_to_txt(item)
        write_to_csv(item)


if __name__ == '__main__':
    for i in range(10):
        main(offset=i * 25)
        time.sleep(1)  # 如果速度过快,则会无响应,所以这里又增加了一个延时等待

总结

1.time.sleep(1)增加延时等待,避免访问速度过快
2.open打开文件时,mode模式用a,可以进行文件内容的追加

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值