愚公移山日记·15
学习进度
前天说到爬取一个网页的免费IP地址,由于昨天的学习进度很慢,仅仅弄检验IP的一点点,所以昨天没有发博客,今天本以为会把多线程解决掉,但是事与愿违,还是有点难度的。
检验IP的代码:
import requests
import re
from fake_useragent import UserAgent
def get_html(url):
count = 0
while True:
headers = {'user-agent':UserAgent().random}
response = requests.get(url,headers = headers)
if response.status_code == 200:
response.encoding = 'utf-8'
return response
else :
count += 1
if count == 3:
return
else:
continue
def get_infos(response):
num = re.findall(r'<tr>[\s\S]*?<td>(.*?)</td>',response.text)
return num
if __name__ == '__main__':
urls = ['http://www.xiladaili.com/gaoni/{}/'.format(str(i)) for i in range(1,2)]
for url in urls :
response = get_html(url)
num = get_infos(response)
for i in num:
try:
requests.get('http://wenshu.court.gov.cn/', proxies={"http":'http://' + i})
except:
print ('connect failed')
else:
print ('success')
结果如下图

这方法很慢只能一条一条的去逐个检验,很慢。
下面我来搞一下我今天的学习结果,虽然没有解决重要的问题,但是希望路过的看官,留下您宝贵的意见。
import requests
import re
from fake_useragent import UserAgent
import threading
def get_html(url):
count = 0
while True:
headers = {'user-agent':UserAgent().random}
response = requests.get(url,headers = headers)
if response.status_code == 200:
response.encoding = 'utf-8'
return response
else :
count += 1
if count == 3:
return
else:
continue
def get_infos(response):
num = re.findall(r'<tr>[\s\S]*?<td>(.*?)</td>',response.text)
return num
def try_ip(ip):
try:
requests.get('http://wenshu.court.gov.cn/', proxies={"http":'http://' + ip})
except:
print ('connect failed')
else:
print ('success')
if __name__ == '__main__':
urls = ['http://www.xiladaili.com/gaoni/{}/'.format(str(i)) for i in range(1,2)]
for url in urls :
response = get_html(url)
num = get_infos(response)
for i in num:
t = threading.Thread(target=try_ip,args=(i,))
t.start()
t.join()
这上面是我用threading模块写的代码但是问题不知出在那里,加上这个之后速度还是很慢,而且检验IP地址出现问题,并不准确。
希望路过的看官留下您宝贵的意见。
298

被折叠的 条评论
为什么被折叠?



