爬一爬糗事百科

最新推荐文章于 2020-10-21 14:10:52 发布

理想与少年

最新推荐文章于 2020-10-21 14:10:52 发布

阅读量354

点赞数

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/xylander23/article/details/52735631

python 专栏收录该内容

5 篇文章

订阅专栏

这段时间在学python爬虫，就先拿糗事百科下手，教程是学点击打开链接这个的，但是我学的时候，糗事百科改版了，所以之前的正则表达式就不能用了，要自己写。

思路还是一样，先用hearders跳过验证，然后用urllib2.Request()和urllib2.urlopen()来获取网页内容，最后用正则式提取所想要的内容。

我提取的是作者、内容、点赞数和评论数。

在这里特别说一下我觉得最麻烦的其实是正则匹配，.*?这个可以匹配任何字符，所以没有用的字符都可以交给它，同时，不要忘记后面跟一个特别的字符，不然会贪婪太多的字符。

# -*- coding: utf-8 -*-
__author__ = 'XYlander'


import urllib
import urllib2
import re

class crawler(object):
    def requesturl(self):
        page = 1
        url = 'http://www.qiushibaike.com/hot/page/'+str(page) #URL地址
        user_agent = 'Mozilla/4.0 (compatible;MSIE 5.5;Windows NT)'
        headers = {'User-Agent' : user_agent}
        try :
            request = urllib2.Request(url,headers=headers)
            self._response = urllib2.urlopen(request)
            #print self._response.read()
            print 'get'
        except urllib2.URLError,e:
            if hasattr(e,"code"):
                print e.code
            if hasattr(e,"reason"):
                print e.reason
    def absruct(self):
        content = self._response.read().decode('utf-8')
        pattern = re.compile('<div class="author clearfix">\n<a.*?img.*?h2>(.*?)</h2>.*?</div>.*?<div class="content">.*?<span>(.*?)</span>.*?<div class="stats".*?i class="number">(.*?)</i>.*?i class="number">(.*?)</i>',re.S)
        items = re.findall(pattern,content)
        for item in items:
            #print 'the author is {0},the content is {1},the number of like is {2},the number of comment is {3}'.format(item[0].encode('utf-8'),item[1],item[2],item[3])
            print 'the author is %s,the content is %s,the number of like is%s,the number of comment %s'%(item[0],item[1],item[2],item[3])

if __name__ == '__main__':
    cr = crawler()
    cr.requesturl()
    cr.absruct()