初试Scrapy(四)—抓取和讯论坛关键字搜索的结果
本来按照计划这篇文章是要讲一讲Scrapy框架中的Spider Middleware,后来有个学金融的朋友说要我帮忙抓取下和讯论坛中通过关键字搜索正文后结果中所有的的帖子内容,发帖人,以及发帖的时间,刚好最近在学这个,拿来练练手,这种利人利己的事情,何乐而不为呢。
一,代码实现
整个实现思路很简单,废话不多说,直接上代码:
# -*- coding: utf-8 -*-
import re
import scrapy
import urllib
from Hexun_Forum.items import HexunForumItem
# from scrapy.shell import inspect_response
class HonglingSpider(scrapy.Spider):
name = "hongling"
allowed_domains = ["bbs.hexun.com"]
@staticmethod
def __remove_html_tags(str):
return re.sub(r'<[^>]+>', '', str)
def start_requests(self):
# keywords = getattr(self, 'keywords', None)
# '网站的编码是gb2312的'
keywords = u'红岭'.encode('gb2312')
requesturl = "http://bbs.hexun.com/search/?q={0}&type=2&Submit=".format(urllib.quote(keywords))
return [scrapy.Request(requesturl, meta={
'dont_obey_robotstxt ': True}, callback=self.__parse_blog_pages)]
def __parse_blog_pages(self, response):
# '解析跳转到每篇文件链接'
# for blog_url in response.xpath('//tr[@class="bg"]/td[@class="f14"]/a[@class="f234"]/@href').extract():
for blog_url in response.css('tr[class=bg] td[class=f14] a[class=f234] ::attr(href)').extract():
yield scrapy.Request(blog_url, meta={
'dont_obey_robotstxt ': True}, callback=self.parse)
# '解析跳转到搜索结果的下一页'
# inspect_response(response, self)
# next_page = response.xpath('//div[@class="pagenum"]/a[@class="next"]/@href').extract_first()
next_page = response.css('div[class=pagenum] a[class=next]::attr(href)').extract_first()
if next_page is not None:
requesturl = "http://bbs.hexun.com{0}".format(next_page)
yield scrapy.Request(requesturl, meta={
'dont_obey_robotstxt ': True}, callback=self.__parse_blog_pages)