如何运用SCRAPY+MySQL抓取相关信息_scrapy mysql date类型-优快云博客

本文链接：https://blog.youkuaiyun.com/apachephpmysql/article/details/44803787

本文介绍在Windows 7环境下，利用Scrapy爬虫框架配合MySQL数据库抓取网站产品信息的方法。首先创建Scrapy项目，然后定义items.py、pipelines.py和settings.py文件，接着编写爬虫文件spiders.py，最后设置数据库并创建表结构。通过运行爬虫，将抓取到的数据存储到MySQL数据库中。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

首先本文是基于win7操作系统，并且配置好scrapy的运行环境，采用Python语言编写的（前面有疑问的请先Google，百度，拒绝伸手党）

1.假设你已经配置好scrapy的运行环境，这里你只需要运行CMD进入DOS创建一个新的项目工程

scrapy startproject 项目名称 项目名称改成你的具体名称

我们可以看到创建好的目录的树形图，里面需要修改的items.py , pipelines.py setting.py 另外spiders文件夹下需要编写自己的爬虫文件（新建spiders.py）

2.这里就以抓取某网站的产品信息为例子编写

items.py文件如下

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class ComputerItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
name=Field(serializer=str)#产品名称1
price=Field(serializer=str)#产品价格2
jprice=Field(serializer=str)#
#gprice=Field(serializer=str)#
#sprice=Field(serializer=str)#二手价5
score=Field(serializer=str)#评分6
screval=Field(serializer=str)#屏幕效果7
buffval=Field(serializer=str)#电池续航8
phtval=Field(serializer=str)#拍照效果9/运行速度
yuval=Field(serializer=str)#娱乐10
desval=Field(serializer=str)#外观设计效果11
cpval=Field(serializer=str)#性价比12
ImageAddress = Field(serializer=str)#图片地址
pass

以上的这些特征是你需要抓取的，当然你可以根据自己的需求适当的修改即可。

pipelines.py文件如下

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy import log
from twisted.enterprise import adbapi
from scrapy.http import Request

import MySQLdb
import MySQLdb.cursors

class ComputerPipeline(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'product',
user = 'root',
passwd = '123456',
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)

def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
query.addErrback(self.handle_error)
return item
def _conditional_insert(self,tx,item):
tx.execute("select * from computer where m_name= %s",(item['name'][0]))
result=tx.fetchone()
log.msg(result,level=log.DEBUG)
print result
if result:
log.msg("Item already stored in db:%s" % item,level=log.DEBUG)
else:
cpval=price=''
lenprice=len(item['price'])
lencpval=len(item['cpval'])
for n in xrange(lenprice):
price+=item['price'][n]
if n<lenprice-1:
price+='/'
for n in xrange(lencpval):
cpval+=item['cpval'][n]
if n<lencpval-1:
cpval+='/'
tx.execute(\
"insert into computer (m_name,m_score,m_screval,m_buffval,m_phtval,m_yuval,m_desval,m_price,max_price,m_cpval,m_image) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",\
(item['name'][0],item['score'][0],item['screval'][0],item['buffval'][0],item['phtval'][0],item['yuval'][0],item['desval'][0],price,item['jprice'][0],cpval,item['ImageAddress'][0]))
log.msg("Item stored in db: %s" % item, level=log.DEBUG)

def handle_error(self, e):
log.err(e)

setting.py文件如下
# -*- coding: utf-8 -*-

# Scrapy settings for computer project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'computer'

SPIDER_MODULES = ['computer.spiders']
NEWSPIDER_MODULE = 'computer.spiders'
ITEM_PIPELINES={
'computer.pipelines.ComputerPipeline'
}
LOG_LEVEL='DEBUG'

DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
COOKIES_ENABLED = True
MySQL_SERVER = 'localhost'
MySQL_PORT = 3306

3.这里是非常重要的部分，编写自己的爬虫文件主要是找到相应的网页去寻找你需要的信息的XPath（请自行解决不懂地方问我）

spiders.py文件

4.接下来是创建数据库，这里用的是MySQL（这里假设你已经安装好了MySQL）

创建数据库 create database product；

use product；

create table computer；

这里建立字段根据你在itemS.py中的特征创建相应的字段，当然数据库和表明以及字段都是根据你自己的需求信息来自己更改即可。

CREATE TABLE `computer` (
`id` int(4) NOT NULL AUTO_INCREMENT,
`m_name` text NOT NULL,
`m_score` text NOT NULL,
`m_screval` text NOT NULL,
`m_buffval` text NOT NULL,
`m_phtval` text NOT NULL,
`m_yuval` text NOT NULL,
`m_desval` text NOT NULL,
`m_cpval` text NOT NULL,
`m_price` text NOT NULL,
`max_price` text NOT NULL,
`predict` int(10) DEFAULT NULL,
`pricedrop` int(4) DEFAULT NULL,
`accuracy` double(4,1) DEFAULT NULL,
`m_image` varchar(500) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;