python当当网爬虫

最新推荐文章于 2025-05-20 13:59:48 发布

原创最新推荐文章于 2025-05-20 13:59:48 发布 · 2.3k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#python #当当网爬虫

python 同时被 2 个专栏收录

22 篇文章

订阅专栏

爬虫

7 篇文章

订阅专栏

最终要实现的是将当当网上面的书籍信息，书籍名字，网址和评论数爬取，存入到数据库中。（首先要做的是创建好数据库，创建的数据库名字为dd,创建的表为books,字段为title,link,comment）。

1、创建项目 scrapy startproject dangdang

2、进入项目文件夹创建爬虫文件

>scrapy genspider –t basic dd dangdang.com

3、用pycharm打开这个项目

编辑items.py文件

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    link=scrapy.Field()
    comment=scrapy.Field()

编辑dd.py

# -*- coding: utf-8 -*-
import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request
class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://dangdang.com/']
    def parse(self, response):
        item=DangdangItem()
        item['title']=response.xpath('//a[@class="pic"]/@title').extract()
        item['link'] = response.xpath('//a[@class="pic"]/@href').extract()
        item['comment'] = response.xpath('//a[@class="search_comment_num"]/text()').extract()
        yield item
        for i in range(2,101):#循环爬多页的东西
            url='http://category.dangdang.com/pg'+str(i)+'-cp01.54.06.00.00.00.html'
            yield Request(url,callback=self.parse)

在seetings.py文件中打开pipelines

ITEM_PIPELINES = {
'dangdang.pipelines.DangdangPipeline': 300,
}

Pipelines.py文件，将数据写入数据库

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn=pymysql.connect(host='localhost',port=3306,user='root',passwd='123456',db='dd')
        for i in range(0,len(item['title'])):
            title=item['title'][i]
            link=item['link'][i]
            comment=item['comment'][i]
            sql="insert into books(title,link,comment)values('"+title+"','"+link+"','"+comment+"')"
            conn.query(sql)
            conn.commit()
       conn.close()
         return item