[学习笔记] 在ubuntu系统下使用scrapy爬虫框架进行爬虫

本文档记录了在Ubuntu系统中配置Python虚拟环境并使用Scrapy爬虫框架的过程。首先,介绍了如何创建虚拟环境和爬虫项目。接着,详细讲述了编写爬虫文件,包括设置数据库连接以存储爬取数据,以及创建main.py来管理项目启动。最后,实际运行爬虫,抓取和存储数据。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.系统环境的配置

  • 首先进行python虚拟环境的安装
sudo apt install python3-pip python3-dev build-essential
sudo python3 -m pip3 install --upgrade pip
sudo pip3 install virtualenvwrapper
  • 创建专门用于存放虚拟环境的目录
mkdir /var/www/EnvRoot
# 修改 .zshrc 文件,添加一下代码
export WORKON_HOME =/var/www/EnvRoot
export VIRTUALENVWRAPPER_PYTHON =/usr/bin/python3
source /usr/local/bin/virtualenvwrapper.sh
# 重新运行.zshrc文件
source ~/.zshrc
  • 创建一个虚拟环境和创建一个爬虫项目
mkvirtualenv scrapy
# 该虚拟卡环境专门用户scrapy爬虫框架的使用
# 安装scrapy爬虫框架
pip install scrapy
# 创建scrapy项目
scrapy startproject douban

2.爬虫文件的编写

  • 编辑爬虫文件以及数据库连接存储文件,将爬取到的数据存到数据库中,后期可以利用这些数据进行一些数据分析
cd douban
vim spiders/douban.py
# douban.py
import scrapy
import re
from bs4 import BeautifulSoup
from douban.items import DoubanItem

class DbSpider(scrapy.Spider):
    name ='douban'
    allowed_domains = ["douban.com"]
    start_urls =["https://www.douban.com/doulist/43430373"]
   
    def parse(self,response):
        item = DoubanItem()
        response.encding='utf-8'
        soup = BeautifulSoup(response.text,'html.parser')
        books= soup.select('.doulist-item')
        selector = scrapy.Selector(response)
        for book in books:
            if len(book.select('.title a'))>0:
                title =book.select('.title a')[0].text
                rate =book.select('.rating span')[1].text
                score =book.select('.rating span')[2].text.lstrip('(').strip('人评价)')#使用beautifulsoup的strip去掉不需要的内容
                author =book.select('.abstract')[0].text
                title=title.replace(' ','').replace('\n','')
                author =author.replace('\n\r','').replace(' ','')
                aa=re.split('[\n]+',author)
                urlb =book.select('.title a')[0]['href']
                
                item['title'] = title
                item['rate'] = rate
                item['author'] = aa[1][3:] #去掉(作者:)三个字
                item['score'] =score
                item['press'] =aa[2][4:]
                item['pretime'] = aa[3][4:]
                #爬取书名链接的其他信息,使用parse_book函数实现,需要将item传过去
                yield scrapy.http.Request(urlb,callback=self.parse_book,meta=item)
            #获取下一页的链接
            nextPage = selector.xpath('//span[@class="next"]/link/@href').extract()
            if nextPage:  #判断是否为最后一条链接
                next =nextPage[0]
                #重复爬取链接中的图书信息
                yield scrapy.http.Request(next,callback=self.parse)

    #该函数用于爬取图书的详细信息,包括ISBN,价格,页数等,本例子中只爬取这三项
    def parse_book(self,response):
        item=response.meta #获取传过来的item
        #图书的价格等信息不在标签内,使用的获取信息方法
        ISBN=response.xpath(u'//span[.//text()[normalize-space(.)="ISBN:"]]/following::text()[1]').extract()[0]
        price=response.xpath(u'//span[.//text()[normalize-space(.)="定价:"]]/following::text()[1]').extract()[0]
        number=response.xpath(u'//span[.//text()[normalize-space(.)="页数:"]]/following::text()[1]').extract()[0]
        #去掉内容中带有的空格
        ISBN=ISBN.replace(' ','')
        price =price.replace(' ','')
        number = number.replace(' ','')
        item['ISBN']=ISBN
        item['price']=price
        item['number']=number
        yield item
# -*- items.py 
# -*- coding: utf-8 -*-
import scrapy
class DoubanItem(scrapy.Item):
    ISBN =scrapy.Field()
    title =scrapy.Field()
    rate = scrapy.Field()
    author = scrapy.Field()
    score = scrapy.Field()
    press =scrapy.Field()
    pretime = scrapy.Field()
    prise =scrapy.Field()
    number=scrapy.Field()
# middlewares.py
from scrapy import signals

class DoubanSpiderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_spider_input(self, response, spider):
        return None
    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i
    def process_spider_exception(self, response, exception, spider):
        pass
    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
class DoubanDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
    def process_request(self, request, spider):
       return None
    def process_response(self, request, response, spider):
        return response
    def process_exception(self, request, exception, spider):
        pass
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
# -*- pipelines.py -*-
import pymysql

#将抓取到的数据存到数据库当中

class DoubanPipeline(object):
    def __init__(self):
        #连接数据库
        self.connect = pymysql.connect(
                host='localhost',
                db ='douban',
                user = 'root',
                password = 'root',
                port = 3306,
                charset ='utf8',
                use_unicode = True)
        self.cursor = self.connect.cursor();

    def process_item(self, item, spider):
        try:
            self.cursor.execute(
                    """select * from dbook where ISBN=%s""",item['ISBN'])
            repetition =self.cursor.fetchone()
            if repetition:
                pass
            else:
                #插入数据库相应的数据
                self.cursor.execute( """insert into dbook(ISBN,title,rate,author,score,press,pretime,price,number) values(%s,%s,%s,%s,%s,%s,%s,%s,%s)""",
                        (item['ISBN'],item['title'],item['rate'],item['author'],                     item['score'],item['press'],item['pretime'],item['price'],item['number']))
            self.connect.commit()

        except Exception as error:
            print(error)
        return item
# setting.py
BOT_NAME = 'douban'

SPIDER_MODULES = ['douban.spiders']
NEWSPIDER_MODULE = 'douban.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3
COOKIES_ENABLED = False
ITEM_PIPELINES = {
   'douban.pipelines.DoubanPipeline': 300,
}
  • 创建一个main.py来管理项目的启动
main
from scrapy import cmdline
cmdline.execute("scrapy crawl douban".split())
  • 启动爬虫项目,对数据进行爬取和存储
python3 main.py
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值