scrapy猫眼爬虫

提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档


前言

提示:这里可以添加本文要记录的大概内容:
例如:随着人工智能的不断发展,机器学习这门技术也越来越重要,很多人都开启了学习机器学习,本文就介绍了机器学习的基础内容。


提示:以下是本篇文章正文内容,下面案例可供参考

一、要求

在这里插入图片描述

二、使用步骤

1.引入库

代码如下(示例):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import  ssl
ssl._create_default_https_context = ssl._create_unverified_context

2.maoyanspider.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import MaoyanItem
import urllib

class MaoyanspiderSpider(scrapy.Spider):
    name = 'maoyanspider'
    allowed_domains = ['maoyan.com']
    start_urls = ['https://maoyan.com/board/4']

    def parse(self, response):
        dls = response.xpath("//dl[@class='board-wrapper']/dd")
        for dl in dls:
            item = MaoyanItem()
            item['name'] = dl.xpath("div[@class='board-item-main']/div[@class='board-item-content']/div[@class='movie-item-info']/p[@class='name']/a/text()").extract_first()
            item['actors'] = dl.xpath("div[@class='board-item-main']/div[@class='board-item-content']/div[@class='movie-item-info']/p[@class='star']/text()").extract_first().strip()
            item['releasetime'] = dl.xpath("div[@class='board-item-main']/div[@class='board-item-content']/div[@class='movie-item-info']/p[@class='releasetime']/text()").extract_first()
            yield item
        next_page = response.xpath('//div[@class="pager-main"]/ul/li/a[contains(text(), "下一页")]/@href').extract_first()
        if next_page is not None:
            new_link = urllib.parse.urljoin(response.url, next_page)
            yield scrapy.Request(new_link, callback=self.parse)


3.items.py

import scrapy


class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    actors = scrapy.Field()
    releasetime = scrapy.Field()

4.pipelines.py

import pymysql, csv

class MaoyanPipeline(object):

    def process_item(self, item, spider):
        data_list = [item['name'], item['actors'], item['releasetime']]
        head = ('company', 'salary', 'address', 'experience', 'education', 'number_people')
        with open('maoyan.csv', 'a+', encoding='utf-8', newline='') as file:
            writer = csv.writer(file)
            # writer.writerow(head)  # 写入表头  也就是文件标题
            writer.writerow(data_list)
        return item
class MaoyanMysqlPipeline(object):
    def open_spider(self, spider):
        print('爬虫开始执行')
        self.db = pymysql.connect(host='localhost', user='root',
                                  password='123456', database='test', port=3306, charset='utf8')
        # 执行语句,游标对象
        self.cursor = self.db.cursor()
        self.df =  open("maoyan.csv", "w", newline="")

    def process_item(self, item, spider):
        t = (item['name'], item['actors'], item['releasetime'])
        sql = 'insert into maoyan values (%s, %s, %s)'
        self.cursor.execute(sql, t)
        self.db.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.db.close()
        print('退出爬虫')


以下是使用Scr框架爬取猫眼电影TOP100的示例代码: 1. 创建Scrapy项目 在命令行中输入以下命令创建一个名为maoyanScrapy项目: ``` scrapy startproject maoyan ``` 2. 创建爬虫 在命令行中进入maoyan项目目录,输入以下命令创建一个名为movies的爬虫: ``` scrapy genspider movies maoyan.com ``` 3. 编写爬虫代码 打开maoyan/spiders/movies.py文件,将以下代码复制进去: ```python import scrapy from maoyan.items import MaoyanItem class MoviesSpider(scrapy.Spider): name = 'movies' allowed_domains = ['maoyan.com'] start_urls = ['https://maoyan.com/board/4'] def parse(self, response): movies = response.xpath('//div[@class="movie-item-info"]') for movie in movies: item = MaoyanItem() item['rank'] = movie.xpath('div[@class="board-index"]/text()').get().strip() item['title'] = movie.xpath('div[@class="movie-item-info"]/p[@class="name"]/a/@title').get().strip() item['star'] = movie.xpath('div[@class="movie-item-info"]/p[@class="star"]/text()').get().strip() item['release_time'] = movie.xpath('div[@class="movie-item-info"]/p[@class="releasetime"]/text()').get().strip() yield item ``` 4. 编写Item 在maoyan目录下创建一个名为items.py的文件,将以下代码复制进去: ```python import scrapy class MaoyanItem(scrapy.Item): rank = scrapy.Field() title = scrapy.Field() star = scrapy.Field() release_time = scrapy.Field() ``` 5. 运行爬虫 在命令行中进入maoyan目录,输入以下命令运行爬虫: ``` scrapy crawl movies -o movies.csv ``` 6. 查看结果 在maoyan目录下会生成一个名为movies.csv的文件,里面包含了猫眼电影TOP100的排行信息。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

sunnuan01

一起学习,共同进步

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值