明确目标:
抓取内容:职位名称、人数,类别、地点、发布时间 以及详情页面的岗位职责、工作要求
1.配置itens.py
既然以及确定目标,开始定义items.py
import scrapy
class TtspiderItem(scrapy.Item):
mc = scrapy.Field() # 名称
lb = scrapy.Field() # 类别
rs = scrapy.Field() # 人数
dd = scrapy.Field() # 地点
sj = scrapy.Field() # 时间
zz = scrapy.Field() # 职责
yq = scrapy.Field() # 要求
2.配置settings(开启piplines,写入数据库信息)
ITEM_PIPELINES = {
'ttspider.pipelines.TtspiderPipeline': 300,
}
LOG_LEVEL = "WARNING"# mongodb 配置
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'tencent'
MONGODB_COLLECTION = 'zhaopin'
3.配置piplines
# -*- coding: utf-8 -*-
from pymongo import MongoClient
from scrapy.conf import settings
"""
注意点:
1.从pymongo中导入MongoClient(芒果客户端)
2.导入setting文件,from scrapy.conf import settings
2.1 这里的mongodb的登录信息都是写在setting文件下的,所以要导入
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017 此处应该是int 不是字符串
MONGODB_DBNAME = 'tencent'
MONGODB_COLLECTION = 'zhaopin'
2.2 其实也可以在piplines中直接写入
"""
class TtspiderPipeline(object):
def __init__(self):
# 连接数据库
con = MongoClient(settings.get('MONGODB_HOST'), settings.get('MONGODB_PORT'))
# 连接数据表
db = con[settings.get('MONGODB_DBNAME')]
# 连接集合
self.collection = db[settings.get('MONGODB_COLLECTION')]
def process_item(self, item, spider):
# 插入数据
self.collection.insert(item)
print(item)
return item
4.spider
# -*- coding: utf-8 -*-
import scrapy
class TencentspiderSpider(scrapy.Spider):
name = 'tencentSpider'
allowed_domains = ['hr.tencent.com']
start_urls = ['http://hr.tencent.com/position.php']
def parse(self, response):
tr_list = response.xpath('//tr[@class="even" or @class="odd"]')
for tr in tr_list:
item = {}
item['mc'] = tr.xpath('./td[1]/a/text()').extract_first()
item['lb'] = tr.xpath('./td[2]/text()').extract_first()
item['rs'] = tr.xpath('./td[3]/text()').extract_first()
item['dd'] = tr.xpath('./td[4]/text()').extract_first()
item['sj'] = tr.xpath('./td[5]/text()').extract_first()
yield scrapy.Request(
url='http://hr.tencent.com/' + tr.xpath('./td[1]/a/@href').extract_first(),
callback=self.parse_detail,
meta={'item': item}
)
next_url = response.xpath('//a[@id="next"]/@href').extract_first()
if next_url != 'javascript:;':
next_url = 'http://hr.tencent.com/' + next_url
yield scrapy.Request(
next_url,
callback=self.parse
)
def parse_detail(self, response):
item = response.meta['item']
ul_list = response.xpath('//ul[@class="squareli"]')
if len(ul_list) > 1:
item['zz'] = ul_list[0].xpath('./li/text()').extract()
item['yq'] = ul_list[1].xpath('./li/text()').extract()
else:
item['zz'] = ul_list[0].xpath('./li/text()').extract()
item['yq'] = None
yield item
这样整个爬虫基本就全了,剩下的cookies 和 代理ip ,这个案列就不写了,等我在练练
然后访问数据库
本文介绍了一个使用Scrapy框架实现的腾讯职位信息爬虫案例,包括定义数据项、配置设置、编写管道及解析器等内容。
1823

被折叠的 条评论
为什么被折叠?



