最近正在学习爬虫,看到多线程爬虫,自己写了一个针对多线程爬虫的文章,希望可以对初学者有一定的帮助。写了两个爬虫,都是爬豆瓣电影获取电影信息的。最后发现多线程相对单线程确实节约了大量的时间。先把代码贴出来,然后在稍微分析一下。(ps:最好有一定的爬虫基础,python版本2.7)
#coding:utf-8
import requests
import json
import time
def get_html(url):
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
}
try:
res = requests.get(url, headers = header)
return res.text
except:
print u"获取失败"
return None
def get_data(html):
a = json.loads(html)
a_list = a[u'subjects']
item = a_list[0]
print item[u'title'], item[u'rate']
save_data(a_list)
def save_data(a_list):
file = open('t1.txt', 'ab+')
res = u""
for item in a_list:
s1 = item[u'title']
s2 = item[u'rate']
res += s1+s2
file.write(res.encode('utf-8'))
file.write(u'\r\n')
if __name__ == '__main__':
begin = time.time()
for i in range(0,300,20):
url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E6%9C%80%E6%96%B0&page_limit=20&page_start='+str(i)
html = get_html(url)
get_data(html)
end = time.time()
print end - begin
上述就是单线程爬取电影信息的代码。因为返回的数据是json类型的,所以导入了json模块。稍微了解爬虫的同学应该就可以看明白,就不在解释了。
单线程运行时间
下面贴出多线程爬虫的代码,比上述单线程,多了两个模块Queue与threading。这两个模块的戳这里
Queue:https://www.cnblogs.com/itogo/p/5635629.html
threading:https://www.cnblogs.com/fnng/p/3670789.html
#coding:utf-8
import requests
import json
import time
import Queue
import threading
Share_Q = Queue.Queue()
THREAD_NUM = 5
title_list = []
def get_html(url):
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36',
}
try:
res = requests.get(url, headers = header)
return res.text
except:
print u"获取失败"
return None
def get_data(html):
a = json.loads(html)
a_list = a[u'subjects']
item = a_list[0]
print item[u'title'], item[u'rate']
list = []
for item in a_list:
list.append(item[u'title'] +u',')
title_list.append(list)
def worker():
global Share_Q
while not Share_Q.empty():
url = Share_Q.get()
html = get_html(url)
get_data(html)
if __name__ == '__main__':
begin = time.time()
threads = []
for i in range(0,300,20):
url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E6%9C%80%E6%96%B0&page_limit=20&page_start='+str(i)
Share_Q.put(url)
for i in range(THREAD_NUM):
thread = threading.Thread(target=worker)
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
file = open('t2.txt', 'ab')
for title in title_list:
res = u""
for i in title:
res += i
file.write(res.encode('utf-8'))
file.write(u'\r\n')
end = time.time()
print end - begin
在程序中,我们首先将url放置到队列中,然后多个线程在队列中获得url。同时爬取网页的内容。相对于单线程一个一个url爬取。速度要快了许多。