Python爬虫获取电影链接(续)

最新推荐文章于 2024-04-23 00:25:01 发布

原创

最新推荐文章于 2024-04-23 00:25:01 发布 · 3.7k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #多线程 #爬取电影下载链接

本文介绍了如何使用Python爬虫处理需要POST请求的电影下载链接抓取，通过Fiddler进行断点查询，以BT天堂为例，详细讲解了获取POST请求URL、数据提交以及如何构造爬虫程序，实现对多个网站的通用性支持，并将结果保存到TXT文件中。

上一篇文章中的两个网站都是用的是get方法，获取很简单并且没有任何防爬手段，后面我再尝试BT天堂，影视大全网发现更多的网站搜索页面是post请求并需要提交表单，

所以这里给之前的程序作出一些补充，使之可以爬虫需要post请求的网站。

首先提出一个使用fiddler的小技巧，断点查询，在这里点击Rules在其下拉列表中选择Automatic breakpoint之后选择After Request 这样更容易查询到浏览器提交的相关信息以及请求的url

这里以BT天堂为例子其网站url:http://www.bttiantangs.com/ BT天堂当然最好的方案是先进入该网页，获取该网页的cookies，但是我在实践中发现该网站并不会检查cookies。

通过fiddler发现提交请求的url为http://www.bttiantangs.com/e/search/new.php 请求类型为post 提交的数据只有一个‘keyboard’，值得注意的是这一项数据直接提交字符串不需要其他处理。

为了扩展性，将网址所有相关信息放在一个字典中：

btmovie={
        'name':'BT天堂',
        'root':'http://www.bttiantangs.com',
        'posturl':'http://www.bttiantangs.com/e/search/new.php',
        'dict':{'keyboard':''},
        'encode':'utf-8',
        'pat':['<a href="(.*?)" class="zoom" rel="bookmark" target="_blank" title="(.*?)">',r'<em>.*?</em></a><a href="(.*?)" target="_blank">']
    }

其信息包括网站主页地址，网站名称，请求url，编码格式，提交表单，以及提取信息的相关正则表达式，这样做好处在于需要添加更多网页时可以增加一个字典即可。数据存储的最后一步将此类型网站数据放入一个列表中：

postlist=[]
postlist.append(btmovie)

创建一个线程用以加工所需要提交的表单数据：

class posttask(threading.Thread):
    def __init__(self,keyword):
        self.keyword=keyword
        threading.Thread.__init__(self)
    def run(self):
        for item in postlist:#对发送的字典进行加工
            if item['name'] == 'BT天堂':   
                item['dict']['keyboard']=self.keyword
        se=posturlget(urls=postlist)
        se.start()

这个类可以通过判断网站名对不同网站进行提交表单的加工，并传入请求url

接下来即需要请求搜索结果，并提取搜索结果中的相关连接，创建一个新的线程即上面程序中的posturlget类

class posturlget(threading.Thread):
    def __init__(self,urls):
        self.urls=urls
        self.encode=''
        self.pat=[]
        self.id=''
        self.data={}
        self.res=''
        self.root=''
        threading.Thread.__init__(self)
    def postlist(self,url,counts=3):
        try:
            webpage=requests.post(headers=headers,data=self.data,url=url)
            webpage.encoding=self.encode
            self.res=webpage.text
        except Exception as f:
            print(f)
            if counts>0:
                counts-=1
                time.sleep(1)
                self.postlist(url,counts)
            else:
                print('请求错误')

最低0.47元/天解锁文章