使用python 编写抓取内涵段子动态图的简单爬虫

最新推荐文章于 2021-08-11 14:36:45 发布

小菜鸟bird

最新推荐文章于 2021-08-11 14:36:45 发布

阅读量4k

点赞数 1

分类专栏： python 文章标签： python 爬虫正则表达式

本文链接：https://blog.youkuaiyun.com/oqqFengniao123456789/article/details/45226791

版权

python 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了使用Python编写抓取内涵段子动态图的爬虫过程，包括分析网页结构、正则表达式抓取URL、下载图片，并探讨了如何通过模拟点击“加载更多”获取更多数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前段时间在浏览知乎的时候发现了一个关于python编写爬虫的帖子，下面是帖子的链接 www.zhihu.com/question/20899988

所以就想到了使用python也来试试爬取一些东西，本打算是根据关键词爬取百度图片的图片并下载，但是过程中遇到了阻碍，暂时停止了。然后去内涵段子的页面结构发现比较简

单一点，然后就实现了一个下图爬虫。

我编写这个程序时是参考的知乎里面帖子中的这个博主的相关博客 blog.youkuaiyun.com/pleasecallmewhy/article/details/8929576

编写这个程序主要分为下面的几个步骤：

1.分析内涵社区的页面结构

2.使用正则表达式找出待下载的url

3.下载这些图片

首先是第一步，这也是比较关键的一步，如果页面分析的不正确，那么后面的步骤也就无法下手了。

1.打开内涵段子的囧图页面 http://neihanshequ.com/pic/

我们会看到下面的页面

在这个页面下就有我们想要的一些搞笑图片，但是我们首先需要的就是获得这个这个页面的html文件，这里我用到了python的urllib这个库，代码如下

def get_html(url):
    print "---------------now get html from url :" + url + "----------"

    send_headers = {
     'Host':'neihanshequ.com',
     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
     'Cookie':"pksrqup=1; csrftoken=237f4451075fe45cef3a4f5449f70658; tt_webid=3379513254; uuid=\"w:33266c46f0cc4fa6944c073b1b1bccea\"",
     'Connection':'keep-alive'
    }
    
    req = urllib2.Request(url ,headers=send_headers)  
    try:  
      
        response = urllib2.urlopen(req ,timeout = 100)  
        html = response.read()
        return html
    except urllib2.HTTPError, e:  
      
        print 'The server couldn\'t fulfill the request.'  
      
        print 'Error code: ', e.code  
      
    except urllib2.URLError, e:  
      
        print 'We failed to reach a server.'  
      
        print 'Reason: ', e.reason  
      
    else:  
        print 'No exception was raised.'

需要使用urllib 模拟发送的信息使用火狐的Firebug插件就可以看到，然后复制头信息出来，填到上面的header里面去就可以了。这里面的Cooiker需要添加，不添加会获取不到html文件，具体的urlib的使用介绍参见上面那位博主的博客，讲的很清楚。

现在html文件是获取到了，我们来观察一下这个文件，这个html文件结构还是比较清晰的。

每一个帖子都是由一个div组成，然后对于标题，图片和评论又各是一个div

在class = content-wrapper的div里面我们找到了这句话

这个data-text 就是囧图的配字，data-pic就是囧图的地址，那么我们的工作来了，就是获取这里面所有的data-pic和data-text(之后可以作为图片的名称)

解析这个html中的所有这两个字段，需要用到python的正则表达式，我们这里用到的非常简单，我是模仿得到的，具体的re教程去上面的博主那也可以获得

下面是我的re解析代码

这样就可以根据我刚才获得html文件解析出来所有的图片的地址了，然后下面就可以下载了，下载使用到了urllib相关的函数

-----------------截止上面你就可以下载几十张图片了

为什么只是几十张图片呢？

原因是我们刚才获取的只是首页面的html文件，那么更多的html文件怎么获得呢？

我们注意到在页面的下端有一个加载更多的按钮吧，点击它之后就可以获得图片了。

同样我们使用firebug 来抓一下包。

打开这个Get请求和结果

请求： http://neihanshequ.com/pic/?is_json=1&max_time=1429794628

响应：我们在浏览器里面输入这个请求地址可以得到一个json响应

逐步展开json就可以获得

在large_image下面就有我们需要的啦。。

仔细观察获取到的json响应，你会发现这里面有一个min_time字段，这个字段是一个unix时间戳。而这个min_time正好就是这个下一个请求的max_time

如此循环就可以获取到所有的图片啦！！

去第一次获取的html文件同样可以找到一个

那么我们的任务基本就是不断解析json文件并下载了

下面是我的第一个版本的源代码

# -*- coding: utf-8 -*-
   
import urllib2  
import urllib  
import re  
import thread  
import time
import os
import random
import json


#内涵段子抓取类
class neiHanSpider :
    def  __init__(self):
        self.primer_url = 'http://neihanshequ.com/pic/'
        #点击加载更多之后请求的url
        self.base_url   = 'http://neihanshequ.com/pic/?is_json=1&max_time='

    def Start(self):
        #首先获取第一个页面的html数据，并分析其中的data-pic和max_time
        primer_html = self.__getHtml(self.primer_url)
        data_pic   = self.__getDataPic(primer_html)
        max_time   = self.__getMaxTime(primer_html)
        #download pic
        self.__downloadPic(data_pic)
        count = 0
        #下面开始下载点击更多之后的图片
        while max_time:
            count = count + 1
            print "=--------------------THIS　IS THE " + str(count) + " Json Data  Time : " + str(max_time) + "--------------------"
            url = self.base_url + str(max_time)
            json_data = self.__getHtml(url)
            json_ret  = self.__parseJson(json_data)
            max_time =  json_ret['max_time']
            print max_time
            image_url = json_ret['image_url']
            image_content = json_ret['image_content']
            self.__downloadPic(image_url,image_content)

    #python 以两个下划线开始的为私有函数
    #尝试5次

    #解析json，并获取json中的数据
    def __parseJson(self,json_data):
        print "------This is parse_json --------"
        dct = json.loads(json_data)
        image_content = []
        image_url = []
        max_time   = ""
        try :
            max_time = dct['data']['max_time']
            data = dct['data']['data']
            for item in data:

                content = item['group']['content']
                url     = item['group']['large_image']['url_list'][0]['url']
                image_content.append(content)
                image_url.append(url)

            ret = {}
            ret['image_content'] = image_content
            ret['image_url']    =  image_url
            ret['max_time']   = max_time
            return ret
        except :
            print "json_parse error"

    #定义下载图片函数        
    def __downloadPic(self,imageAddressList,contentList = []):
        print "---download------"
        contentExist = len(contentList)
        count = 0
        for image in imageAddressList :
            print image
            count = count + 1
            randTail = str(random.randint(0,30000000))
            try :
                #tail =  contentExist ? contentList[count - 1] : randTail ;
                if contentExist :
                   tail = contentList[count - 1]
                else :
                   tail = randTail
                fullPath = "C:\\Users\\Administrator\\Desktop\\python\\" + tail + ".jpg"
                urllib.urlretrieve(image , fullPath)
            except :
                failedMsg = "第" + str(count) + "张下载失败，URL： " + str(image) + "" 
                print failedMsg
            pass


    def __getDataPic(self,html):
        re_str = r'data-pic="([^"]*)"'
        data_pic = self.__getDataByRe(html,re_str)
        return data_pic
    
    def __getMaxTime(self,html):
        re_str = r'max_time: \'([\d]*)\''
        max_time = self.__getDataByRe(html,re_str)
        return max_time
    
    def __getDataByRe(self,text,re_str):
        pattern = re.compile(re_str)
        ret = pattern.findall(text)
        return ret 

        
    def __getHtml(self,url):
        print "GET　HTML********"
        count = 0
        while count < 5:
            count = count + 1
            print str(count) + " times ,try download html"
            html = self.__getDataByUrl(url)
            if not html:
                continue;
            else:
                return html
    def __getDataByUrl(self,url):
        print "---------------now get html from url :" + url + "----------"
        send_headers = {
         'Host':'neihanshequ.com',
         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
         'Cookie':"pksrqup=1; csrftoken=237f4451075fe45cef3a4f5449f70658; tt_webid=3379513254; uuid=\"w:33266c46f0cc4fa6944c073b1b1bccea\"",
         'Connection':'keep-alive'
        }
        req = urllib2.Request(url ,headers=send_headers)  
        try:   
            response = urllib2.urlopen(req ,timeout = 100)  
            html = response.read()
            return html
        except urllib2.HTTPError, e:  
            print 'The server couldn\'t fulfill the request.'  
            print 'Error code: ', e.code            
        except urllib2.URLError, e:  
            print 'We failed to reach a server.'  
            print 'Reason: ', e.reason  
        else:  
            print 'No exception was raised.'



#------------------------------------------程序入口处------------------------------


mySpider = neiHanSpider()  
mySpider.Start()

之后我又尝试了一个多线程版本

# -*- coding: utf-8 -*-
   
import urllib2  
import urllib  
import re  
import threading 
import time
import os
import random
import json


#内涵段子抓取类
class neiHanSpider :
    def  __init__(self ):
        self.primer_url = 'http://neihanshequ.com/pic/'
        #点击加载更多之后请求的url
        self.base_url   = 'http://neihanshequ.com/pic/?is_json=1&max_time='

    def Start(self):
        #首先获取第一个页面的html数据，并分析其中的data-pic和max_time
        primer_html = self.__getHtml(self.primer_url)
        data_pic   = self.__getDataPic(primer_html)
        max_time   = self.__getMaxTime(primer_html)
        #download pic
        #self.__downloadPic(data_pic)
        global downloadUrlList
        global downloadTitleList
        #downloadList = downloadList + data_pic
        count = 0
        #下面开始下载点击更多之后的图片
        while max_time  and count <= 1:
            count = count + 1
            print "=--------------------THIS　IS THE " + str(count) + " Json Data  Time : " + str(max_time) + "--------------------"
            url = self.base_url + str(max_time)
            json_data = self.__getHtml(url)
            json_ret  = self.__parseJson(json_data)
            max_time =  json_ret['max_time']
            print max_time
            image_url = json_ret['image_url']
            image_content = json_ret['image_content']
            #self.__downLoadPic(image_url,image_content)
            downloadUrlList = downloadUrlList + image_url
            downloadTitleList = downloadTitleList + image_content
    #python 以两个下划线开始的为私有函数
    #尝试5次

    #解析json，并获取json中的数据
    def __parseJson(self,json_data):
        print "------This is parse_json --------"
        dct = json.loads(json_data)
        image_content = []
        image_url = []
        max_time   = ""
        try :
            max_time = dct['data']['max_time']
            data = dct['data']['data']
            for item in data:
                content = item['group']['content']
                url     = item['group']['large_image']['url_list'][0]['url']
                image_content.append(content)
                image_url.append(url)

            ret = {}
            ret['image_content'] = image_content
            ret['image_url']    =  image_url
            ret['max_time']   = max_time
            return ret
        except :
            print "json_parse error"

    def __getDataPic(self,html):
        re_str = r'data-pic="([^"]*)"'
        data_pic = self.__getDataByRe(html,re_str)
        return data_pic
    
    def __getMaxTime(self,html):
        re_str = r'max_time: \'([\d]*)\''
        max_time = self.__getDataByRe(html,re_str)
        return max_time
    
    def __getDataByRe(self,text,re_str):
        pattern = re.compile(re_str)
        ret = pattern.findall(text)
        return ret 

        
    def __getHtml(self,url):
        print "GET　HTML********"
        count = 0
        while count < 5:
            count = count + 1
            print str(count) + " times ,try download html"
            html = self.__getDataByUrl(url)
            if not html:
                continue;
            else:
                return html
    def __getDataByUrl(self,url):
        print "---------------now get html from url :" + url + "----------"
        send_headers = {
         'Host':'neihanshequ.com',
         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0',
         'Cookie':"pksrqup=1; csrftoken=237f4451075fe45cef3a4f5449f70658; tt_webid=3379513254; uuid=\"w:33266c46f0cc4fa6944c073b1b1bccea\"",
         'Connection':'keep-alive'
        }
        req = urllib2.Request(url ,headers=send_headers)  
        try:   
            response = urllib2.urlopen(req ,timeout = 100)  
            html = response.read()
            return html
        except urllib2.HTTPError, e:  
            print 'The server couldn\'t fulfill the request.'  
            print 'Error code: ', e.code            
        except urllib2.URLError, e:  
            print 'We failed to reach a server.'  
            print 'Reason: ', e.reason  
        else:  
            print 'No exception was raised.'



class myDownLoad (threading.Thread):
    def __init__(self, threadID, name):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
    def run(self):
        print "Starting " + self.name
       # 获得锁，成功获得锁定后返回True
       # 可选的timeout参数不填时将一直阻塞直到获得锁定
       # 否则超时后将返回False
        global pos 
        global size
        global downloadUrlList
        global downloadTitleList

        #while threadLock.acquire():   陷入死循环
            #if pos + 1 >= size :
                #threadLock.release()  
                #return;
        while  pos < size - 1 :
            ret = threadLock.acquire()
            if not ret :
                break
            pos = pos + 1
            temp_pos = pos
            # 释放锁
            threadLock.release()
            try :
                tail = downloadTitleList[temp_pos]
                image_url = downloadUrlList[temp_pos]
                fullPath = "C:\\Users\\Administrator\\Desktop\\python\\" + tail + ".jpg"
                urllib.urlretrieve(image_url , fullPath)
                print "Pos :" + str(temp_pos) + "  DownLoad Ok----------"
            except :
                failedMsg = "第" + str(temp_pos) + "张下载失败，URL： " + str(image_url) + "" 
                print failedMsg
            pass
        threading.exit()


#------------------------------------------程序入口处------------------------------
startTime = time.time()
downloadUrlList = []
downloadTitleList = []
pos = 0
size = 0
mySpider = neiHanSpider()  
mySpider.Start() 

print str(len(downloadUrlList)) + "----->" + str(len(downloadTitleList)) 

threadLock = threading.Lock()
threads = []
size =  len(downloadUrlList)


for i in range(1,10) : 
    thread = myDownLoad(i,"Thread-" + str(i));
    thread.start()
    threads.append(thread)

aliveCount =  10
while aliveCount > 1 :
    print "Now There is " + str(aliveCount) + "Threads alive"
    aliveCount = threading.activeCount()
    time.sleep(10)


endTime = time.time()
print " Download " + str(size) + "张图，共耗时 " + str((endTime - startTime) / 60) + "min"
print "Exiting Main Thread"

可能写的不是很整洁，有时间再整理。python现学现用，欢迎批评指正