[url=]
[/url]
1 # -*- coding: utf-8 -*- 2 3 # Define here the models for your scraped items 4 # 5 # See documentation in: 6 #http://www.smpeizi.com/en/latest/topics/items.html 7 8 import scrapy 9 10 11 class AuthorInfo(scrapy.Item):12 authorName = scrapy.Field() # 作者昵称13 authorUrl = scrapy.Field() # 作者Url14 15 class ReplyItem(scrapy.Item):16 content = scrapy.Field() # 回复内容17 time = scrapy.Field() # 发布时间18 author = scrapy.Field() # 回复人(AuthorInfo)19 20 class TopicItem(scrapy.Item):21 title = scrapy.Field() # 帖子标题22 url = scrapy.Field() # 帖子页面Url23 content = scrapy.Field() # 帖子内容24 time = scrapy.Field() # 发布时间25 author = scrapy.Field() # 发帖人(AuthorInfo)26 reply = scrapy.Field() # 回复列表(ReplyItem list)27 replyCount = scrapy.Field() # 回复条数[url=]
[/url]
[url=]
[/url]
1 # -*- coding: utf-8 -*- 2 from scrapy.selector import Selector 3 from scrapy.spiders import CrawlSpider, Rule 4 fromscrapy.linkextractors import LinkExtractor 5 6 from kiwi.items import TopicItem, AuthorInfo, ReplyItem 7 classKiwiSpider(CrawlSpider): 8 name = "kiwi" 9 allowed_domains = ["douban.com"] 10 11 anchorTitleXPath = 'a/text()'12 anchorHrefXPath = 'a/@href' 13 14 start_urls = [ 15 "https://www.pzzs168.com/group/topic/90895393/?start=0", 16 &

本文介绍了如何使用Python Scrapy框架编写爬虫来抓取网页数据,包括定义Item模型、设置XPath规则、遍历网页内容、解析帖子和回复等信息。此外,还展示了如何处理发帖人、回复人信息,以及如何处理帖子的回复列表。最后,文章还提及了使用UserAgentMiddleware进行用户代理轮换以避免被网站封禁。
最低0.47元/天 解锁文章
3600

被折叠的 条评论
为什么被折叠?



