愚公移山日记·15

最新推荐文章于 2020-05-18 20:53:34 发布

原创最新推荐文章于 2020-05-18 20:53:34 发布 · 162 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python

日记专栏收录该内容

42 篇文章

订阅专栏

愚公移山日记·15

学习进度

前天说到爬取一个网页的免费IP地址，由于昨天的学习进度很慢，仅仅弄检验IP的一点点，所以昨天没有发博客，今天本以为会把多线程解决掉，但是事与愿违，还是有点难度的。
检验IP的代码：

import requests
import re
from fake_useragent import UserAgent
def get_html(url):
    count = 0
    while True:
        headers = {'user-agent':UserAgent().random}
        response = requests.get(url,headers = headers)
        if response.status_code == 200:
            response.encoding = 'utf-8'
            return response
        else :
            count += 1
            if count == 3:
                return
            else:
                continue
 
def get_infos(response):
    num = re.findall(r'<tr>[\s\S]*?<td>(.*?)</td>',response.text)
    return num   
if __name__ == '__main__':
    urls = ['http://www.xiladaili.com/gaoni/{}/'.format(str(i)) for i in range(1,2)]
    for url in urls :
        response = get_html(url)
        num = get_infos(response)
        for i in num:
             try:
                 requests.get('http://wenshu.court.gov.cn/', proxies={"http":'http://' +  i})
             except:
                 print ('connect failed')
             else:
                 print ('success')

结果如下图
IP地址
这方法很慢只能一条一条的去逐个检验，很慢。
下面我来搞一下我今天的学习结果，虽然没有解决重要的问题，但是希望路过的看官，留下您宝贵的意见。

import requests
import re
from fake_useragent import UserAgent
import threading
def get_html(url):
    count = 0
    while True:
        headers = {'user-agent':UserAgent().random}
        response = requests.get(url,headers = headers)
        if response.status_code == 200:
            response.encoding = 'utf-8'
            return response
        else :
            count += 1
            if count == 3:
                return
            else:
                continue
def get_infos(response):
    num = re.findall(r'<tr>[\s\S]*?<td>(.*?)</td>',response.text)
    return num
def try_ip(ip):
    try:
        requests.get('http://wenshu.court.gov.cn/', proxies={"http":'http://' +  ip})
    except:
        print ('connect failed')
    else:
        print ('success')
             
if __name__ == '__main__':
    urls = ['http://www.xiladaili.com/gaoni/{}/'.format(str(i)) for i in range(1,2)]
    for url in urls :
        response = get_html(url)
        num = get_infos(response)
        for i in num:
            t = threading.Thread(target=try_ip,args=(i,))
            t.start()
            t.join()