Python爬虫中创建pipeline.py文件并传入数据,实现标题和url持久化

最新推荐文章于 2023-05-15 10:01:15 发布

兜-兜

最新推荐文章于 2023-05-15 10:01:15 发布

阅读量848

点赞数

分类专栏：爬虫文章标签： pipeline 创建pipeline文件 open（）函数 yield

本文链接：https://blog.youkuaiyun.com/doudou_wsx/article/details/101977627

版权

1.爬取博客园中每条新闻的标题和url，在cnblog.py中写入操作内容

import scrapy
import sys
import io
from..items import cnlogsItem
from scrapy.selector import Selector
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding="utf-8")

class CnblogsSpider(scrapy.Spider):
    name = 'cnblogs'
    allowed_domains = ['cnblogs.com']
    start_urls = ['http://cnblogs.com/']

    def parse(self, response):
        line = Selector(response=response).xpath('//div[@id="post_list"]//div[@class="post_item_body"]')
        # href = Selector(response=response).xpath('//div[@id="post_list"]//div[@class="post_item_body"]/h3/a[@class="titlelnk"]/@href').extract()
        items = []
        for node in line:
            title = node.xpath('./h3/a[@class="titlelnk"]/text()').extract()