Python爬虫实战 | (21) Scrapy+Selenium爬取新浪滚动新闻

最新推荐文章于 2023-04-10 21:17:49 发布

CoreJT

最新推荐文章于 2023-04-10 21:17:49 发布

阅读量3.6k

点赞数 3

分类专栏： Python3网络爬虫从理论到实践Base 文章标签： Python爬虫实战 scrapy selenium 新浪滚动新闻

本文链接：https://blog.youkuaiyun.com/sdu_hao/article/details/97158136

版权

在本篇博客中，我们将使用Scrapy对接Selenium来爬取新浪滚动新闻，之前我们用Selenium爬取过滚动新闻，它是由javascript动态渲染的页面，Scrapy 抓取页面的方式和requests 库类似，都是直接模拟HTTP 请求，所以Scrapy 也不能直接抓取JavaScript 动态渲染的页面。所以需要使用Selenium。

抓取JavaScript 渲染的页面有两种方式：

1）一种是分析Ajax 请求，找到其对应的接口抓取， Scrapy 同样可以用此种方式抓取。

2）另一种是直接用Selenium 模拟浏览器进行抓取，不需要关心页面后台发生的请求，也不需要分析渲染过程，只需要关心页面最终结果即可，可见即可爬。

在命令行创建scrapy项目

首先在命令行进入PyCharm的项目目录，然后执行 scrapy startproject ScrapySinaRollNews，生成爬虫项目。会自动生成项目结构和一些文件：

在命令行创建Spider

Spider 是一个自定义的类， Scrapy 用它来从网页里抓取内容，并解析抓取的结果。这个类必须继承Spider 类（scrapy.Spider），需定义Spider 的名称和起始请求，以及解析爬取结果的方法。

命令：scrapy genspider Spider名称网站域名

例：scrapy genspider sinanews

进入之前生成的spiders目录，执行上述命令：

此时会在spiders目录下生成一个以爬虫名字命名的.py文件：

创建item

Item 是保存爬取数据的容器。创建Item 需要继承scrapy.Item 类，并且定义类型为scrapy.Field 的字段。

我们主要获取每篇新闻的链接、标题、时间、来源、正文这些字段。接下来我们要自定义items.py(原本是空的，只有主要结构)，定义我们想要的字段,items.py：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapysinarollnewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    link = scrapy.Field()
    title = scrapy.Field()
    date = scrapy.Field()
    source = scrapy.Field()
    article = scrapy.Field()
    pass

对接Selenium

在middlewares.py中定义SeleniumDownloaderMiddleware类：

class SeleniumDownloaderMiddleware():
    def __init__(self, timeout=None):
        self.logger = getLogger(__name__)
        self.timeout = timeout
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')   #无界面浏览器
        self.browser = webdriver.Chrome(options=chrome_options)
        self.wait = WebDriverWait(self.browser, self.timeout)

    def __del__(self):
        self.browser.close()

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            timeout=crawler.settings.get('SELENIUM_TIMEOUT') #在配置文件中拿到SELENIUM_TIMEOUT 需要自己定义
        )

    def process_request(self, request, spider):
        self.logger.debug('------------Chrome is starting-------------' + request.url)
        try:
            self.browser.get(request.url)
            #需要爬两次 第一次在滚动新闻页面 爬取所有新闻的url；第二次在爬取新闻的详细信息
            if 'https://news.sina.com.cn/roll' in request.url:  #如果是滚动新闻页面
                news_list = ''   #存储所有新闻的url
                page = 0
                while page < 2:  #只爬两页
                    try:
                        page = page + 1
                        '''
                        <div class="d_list_txt" id="d_list" style="width:100%;">
                        <ul>
                        <li onmouseover="this.className='hover'" onmouseout="this.className=''" class="">
                        <span class="c_chl">[全部]</span><span class="c_tit">
                        <a href="https://finance.sina.com.cn/money/bank/gsdt/2019-07-24/doc-ihytcerm5959531.shtml" target="_blank">招商银行：上半年实现净利润506.12亿 同比增13.08%</a></span><span class="c_time" s="1563959946">07-24 17:19</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/w/2019-07-24/doc-ihytcerm5973095.shtml" target="_blank">为寻失踪36年少女 梵蒂冈掘公主墓发现数千根人骨</a></span><span class="c_time" s="1563959920">07-24 17:18</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcitm4320345.shtml" target="_blank">李鼎缘:黄金原油怎么操作 日内走势分析及操作建议</a></span><span class="c_time" s="1563959917">07-24 17:18</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://news.sina.com.cn/s/2019-07-24/doc-ihytcerm5961028.shtml" target="_blank">5000元欠了六年才还上 背后的故事却这么温暖</a></span><span class="c_time" s="1563959862">07-24 17:17</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/future/roll/2019-07-24/doc-ihytcitm4320124.shtml" target="_blank">沪镍下滑震荡 需求疲弱打压</a></span><span class="c_time" s="1563959860">07-24 17:17</span></li></ul><ul><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/relnews/us/2019-07-24/doc-ihytcerm5959813.shtml" target="_blank">美股科技股盘前走低 美司法部启动大范围反垄断调查</a></span><span class="c_time" s="1563959797">07-24 17:16</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/stock/relnews/hk/2019-07-24/doc-ihytcerm5965807.shtml" target="_blank">中信建投证券完成兑付30亿元本年度第一期短期融资券</a></span><span class="c_time" s="1563959760">07-24 17:16</span></li><li onmouseover="this.className='hover'" onmouseout="this.className=''" class=""><span class="c_chl">[全部]</span><span class="c_tit"><a href="https://finance.sina.com.cn/money/forex/forexanaly/2019-07-24/doc-ihytcitm431969

最低0.47元/天解锁文章