首先本文是基于win7操作系统,并且配置好scrapy的运行环境,采用Python语言编写的(前面有疑问的请先Google,百度 ,拒绝伸手党)
1.假设你已经配置好scrapy的运行环境,这里你只需要运行CMD进入DOS创建一个新的项目工程
scrapy startproject 项目名称 项目名称改成你的 具体名称
我们可以看到创建好的目录的树形图,里面需要修改的items.py , pipelines.py setting.py 另外spiders文件夹下需要编写自己的爬虫文件(新建spiders.py)
2.这里就以抓取某网站的产品信息为例子编写
items.py文件如下
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class ComputerItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
name=Field(serializer=str)#产品名称1
price=Field(serializer=str)#产品价格2
jprice=Field(serializer=str)#
#gprice=Field(serializer=str)#
#sprice=Field(serializer=str)#二手价5
score=Field(serializer=str)#评分6
screval=Field(serializer=str)#屏幕效果7
buffval=Field(serializer=str)#电池续航8
phtval=Field(serializer=str)#拍照效果9/运行速度
yuval=Field(serializer=str)#娱乐10
desval=Field(serializer=str)#外观设计效果11
cpval=Field(serializer=str)#性价比12
ImageAddress = Field(serializer=str)#图片地址
pass
以上的这些特征是你需要抓取的,当然你可以根据自己的需求适当的修改即可。
pipelines.py文件如下
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy import log
from twisted.enterprise import adbapi
from scrapy.http import Request
import MySQLdb
import MySQLdb.cursors
class ComputerPipeline(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'product',
user = 'root',
passwd = '123456',
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self.handle_error)
return item
def _conditional_insert(self,tx,item):
tx.execute("select * from computer where m_name= %s",(item['name'][0]))
result=tx.fetchone()
log.msg(result,level=log.DEBUG)
print result
if result:
log.msg("Item already stored in db:%s" % item,level=log.DEBUG)
else:
cpval=price=''
lenprice=len(item['price'])
lencpval=len(item['cpval'])
for n in xrange(lenprice):
price+=item['price'][n]
if n<lenprice-1:
price+='/'
for n in xrange(lencpval):
cpval+=item['cpval'][n]
if n<lencpval-1:
cpval+='/'
tx.execute(\
"insert into computer (m_name,m_score,m_screval,m_buffval,m_phtval,m_yuval,m_desval,m_price,max_price,m_cpval,m_image) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",\
(item['name'][0],item['score'][0],item['screval'][0],item['buffval'][0],item['phtval'][0],item['yuval'][0],item['desval'][0],price,item['jprice'][0],cpval,item['ImageAddress'][0]))
log.msg("Item stored in db: %s" % item, level=log.DEBUG)
def handle_error(self, e):
log.err(e)
setting.py文件如下
# -*- coding: utf-8 -*-
# Scrapy settings for computer project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'computer'
SPIDER_MODULES = ['computer.spiders']
NEWSPIDER_MODULE = 'computer.spiders'
ITEM_PIPELINES={
'computer.pipelines.ComputerPipeline'
}
LOG_LEVEL='DEBUG'
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
MySQL_SERVER = 'localhost'
MySQL_PORT = 3306
3.这里是非常重要的部分,编写自己的爬虫文件主要是找到相应的网页去寻找你需要的信息的XPath(请自行解决不懂地方问我)
spiders.py文件
4.接下来是创建数据库,这里用的是MySQL(这里假设你已经安装好了MySQL)
创建数据库 create database product;
use product;
create table computer;
这里建立字段根据你在itemS.py中的特征创建相应的字段,当然数据库和表明以及字段都是根据你自己的需求信息来自己更改即可。
CREATE TABLE `computer` (
`id` int(4) NOT NULL AUTO_INCREMENT,
`m_name` text NOT NULL,
`m_score` text NOT NULL,
`m_screval` text NOT NULL,
`m_buffval` text NOT NULL,
`m_phtval` text NOT NULL,
`m_yuval` text NOT NULL,
`m_desval` text NOT NULL,
`m_cpval` text NOT NULL,
`m_price` text NOT NULL,
`max_price` text NOT NULL,
`predict` int(10) DEFAULT NULL,
`pricedrop` int(4) DEFAULT NULL,
`accuracy` double(4,1) DEFAULT NULL,
`m_image` varchar(500) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;
最后你只需要启动的爬虫程序即可: scrapy crawl 爬虫名字(computer) 在spiders.py文件中我们可以看到我们能够自己定义爬虫的名字 name= “computer”
到这里全都结束了,你可以抓取你在网页中的信息,最后给几张效果图。
找到相应的图片的地址:http://2d.zol-img.com.cn/product/114_280x210/77/cemR8HO11mOmE.jpg
以上有问题可以联系hustzhf@163.com
相互学习共同进步