scrapy 入门案例

最新推荐文章于 2024-07-17 17:18:44 发布

Claroja

最新推荐文章于 2024-07-17 17:18:44 发布

阅读量329

点赞数

分类专栏：爬虫文章标签： scrapy

本文链接：https://blog.youkuaiyun.com/claroja/article/details/79299337

版权

爬虫专栏收录该内容

28 篇文章

订阅专栏

1.创建爬虫项目

scrapy startproject mySpider

会生成以下目录

scrapy.cfg ：项目的配置文件
mySpider/items.py ：设置抓取数据的存储格式,字段.
mySpider/pipelines.py ：管道文件,用来连接数据库和保存文件
mySpider/settings.py ：配置文件,比如cookie,header,以及各种组件
mySpider/spider/ ：爬虫文件,写解析html的规则

2.创建爬虫文件

scrapy genspider <爬虫名字> <允许爬取的域名>
scrapy genspider web "web.com"  #web对应py文件,"web.com"对应里面的name属性

会在spider文件夹下生成爬虫文件

3. 编写各种文件

3.1 web.py文件

1.scrapy.Spider爬虫类中必须有名为parse的解析
2.启动爬虫的时候注意启动的位置，是在项目路径下启动
3.parse()函数中使用yield返回数据，注意：解析函数中的yield能够传递的对象只能是：BaseItem, Request, dict, None
4.response.xpath
response.xpath方法的返回结果是一个类似list的类型，其中包含的是selector对象，操作和列表一样，但是有一些额外的方法
额外方法extract()：返回一个包含有字符串的列表
额外方法extract_first()：返回列表中的第一个字符串，列表为空没有返回None
5.response响应对象的常用属性
response.url：当前响应的url地址
response.request.url：当前响应对应的请求的url地址
response.headers：响应头
response.requests.headers：当前响应的请求头
response.body：响应体，也就是html代码，byte类型
response.status：响应状态码

1）spider类写法

import scrapy 
from mySpider.items import webItem  # 导入item规定好的字段

class webSpider(scrapy.Spider):  # 必须继承scrapy.Spider
    name = "web"  # 指定爬虫的名字,这几个变量其实是在__init__()方法里面的,相当于self.name
    allowed_domains = ["web.cn"]  # 限定扒取的域名,其他域名不会扒取
    start_urls = ('http://www.web.cn/',)  # 第一个访问的网页

    def parse(self, response):  # 解析网页,一般用`xpath`来获得相应
	    for each in response.xpath("//div[@class='aaa']"):  #获得指定div
	    	## 爬取页面的数据
	        item = webItem() # 将数据封装到我们定义好的字段对象
	        name = each.xpath("h3/text()").extract()
	        title = each.xpath("h4/text()").extract()
	        item['name'] = name[0]  # 和定义的字段对应
	        item['title'] = title[0]
	        yield item # 用生成器将获取的数据交给pipelines
			## 增加新的连接
            curpage = re.search('(\d+)',response.url).group(1) # 获取当前页码  
            page = int(curpage) + 10 # 根据规律加page=??
            url = re.sub('\d+', str(page), response.url)  #后续的要抓取的链接
            yield scrapy.Request(url, callback = self.parse) # 发送新的url请求加入待爬队列，并调用回调函数 self.parse

3.2 items.py

设置页面处理后的数据格式

import scrapy

class webItem(scrapy.Item):
    name = scrapy.Field()
    level = scrapy.Field()

3.3 pipelines.py

利用管道来保存数据
# 爬虫文件中提取数据的方法每yield一次item，就会运行一次
# 该方法为固定名称函数

import json

class webJsonPipeline(object):
    def __init__(self):
        self.file = open('web.json', 'wb')
    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content)
        return item
    def close_spider(self, spider):
        self.file.close()

3.4 setting.py

在settings.py配置启用管道

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

配置项中键为使用的管道类，管道类使用.进行分割，第一个为项目目录，第二个为文件，第三个为定义的管道类。
配置项中值为管道的使用顺序，设置的数值约小越优先执行，该值一般设置为1000以内。

4.启动爬虫

scrapy crawl web

其他：

2.CrawlSpider类写法

scrapy genspider -t crawl web web.com  # 生成的命令也变了

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from mySpider.items import webItem

class webSpider(CrawlSpider):  # 这里继承的CrawlSpider
    name = "web"
    allowed_domains = ["hr.web.com"]
    start_urls = ["http://hr.web.com/position.php?&start=0#a"]

    page_lx = LinkExtractor(allow=("start=\d+"))  # 符合规则的链接会被提取

    rules = [Rule(page_lx, callback = "parseContent", follow = True)]  # 如果提取了重复的url则去重,取第一个

    def parseContent(self, response):
        for each in response.xpath('//*[@class="even"]'):
            name = each.xpath('./td[1]/a/text()').extract()[0]
            detailLink = each.xpath('./td[1]/a/@href').extract()[0]
            positionInfo = each.xpath('./td[2]/text()').extract()[0]

            peopleNumber = each.xpath('./td[3]/text()').extract()[0]
            workLocation = each.xpath('./td[4]/text()').extract()[0]
            publishTime = each.xpath('./td[5]/text()').extract()[0]
            #print name, detailLink, catalog,recruitNumber,workLocation,publishTime

            item = webItem()
            item['name']=name.encode('utf-8')
            item['detailLink']=detailLink.encode('utf-8')
            item['positionInfo']=positionInfo.encode('utf-8')
            item['peopleNumber']=peopleNumber.encode('utf-8')
            item['workLocation']=workLocation.encode('utf-8')
            item['publishTime']=publishTime.encode('utf-8')

            yield item

3.模拟登陆

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/settings.html#topics-settings-ref

参考文献:
http://scrapy-chs.readthedocs.io/zh_CN/1.0/index.html