python3 爬虫 xicidaili.com 实战

最新推荐文章于 2024-10-07 06:30:00 发布

hunter_wyh

最新推荐文章于 2024-10-07 06:30:00 发布

阅读量5.8k

点赞数 4

CC 4.0 BY-SA版权

分类专栏： big data 文章标签： python 爬虫浏览器数据测试

本文链接：https://blog.youkuaiyun.com/hunter_wyh/article/details/78410637

python3 爬虫 xicidaili.com 实战

python3 爬虫 xicidailicom 实战

第一步分析网站

打开链接 http://www.xicidaili.com/nn/1
这就是我们今天要爬虫的网站
目的是把从nn/1到nn/最大页数的数据给趴下来，保存为excel

源代码分析后，我们可以得到类似html_doc：

<html>
<head>
<title></title>
</head>
<body>
<table>
    <tr>
      <th>国家</th>
      <th>IP地址</th>
      <th>端口</th>
      <th>服务器地址</th>
      <th>是否匿名</th>
      <th>类型</th>
      <th>速度</th>
      <th>连接时间</th>
      <th>存活时间</th>

      <th>验证时间</th>
    </tr>

    <tr>
      <td><img src='http://fs.xicidaili.com/images/flag/cn.png' alt='Cn' /></td>
      <td>27.40.140.40</td>
      <td>61234</td>
      <td>
        <a>广东湛江</a>
      </td>
      <td>高匿</td>
      <td>HTTP</td>
      <td class='country'>
        <div title='0.248秒' class='bar'>
          <div class='bar_inner fast' style='width:92%'>

          </div>
        </div>
      </td>
      <td class='country'>
        <div title='0.049秒' class='bar'>
          <div class='bar_inner fast' style='width:99%'>

          </div>
        </div>
      </td>

      <td>1分钟</td>
      <td>17-10-31 09:01</td>
    </tr>

    <tr>
      <td><img src='http://fs.xicidaili.com/images/flag/cn.png' alt='Cn' /></td>
      <td>222.184.177.6</td>
      <td>23735</td>
      <td>
        <a>江苏南通</a>
      </td>
      <td>高匿</td>
      <td>HTTPS</td>
      <td class='country'>
        <div title='0.248秒' class='bar'>
          <div class='bar_inner fast' style='width:92%'>

          </div>
        </div>
      </td>
      <td class='country'>
        <div title='0.049秒' class='bar'>
          <div class='bar_inner fast' style='width:99%'>

          </div>
        </div>
      </td>

      <td>16分钟</td>
      <td>17-10-31 09:00</td>
    </tr>
</table>
</body>
</html>

第二步导包

from urllib import request
import os
import openpyxl
from bs4 import BeautifulSoup

首先这是Python3 标准，所以导入的是urllib.request
os用来获取本地文件
openpyxl，如果没有的话’pip install openpyxl’,excel插件
这里用BeautifulSoup解析HTML，本来考虑过使用正则表达式不过发现这个可能更容易实现

第三步访问测试

我们先访问http://www.xicidaili.com/nn/1，看看能不能get到HTML

# _*_ coding:utf‐8 _*
from urllib import request
import os
import openpyxl
from bs4 import BeautifulSoup

def get_html(url):
    html = request.urlopen(url).read().decode('utf-8')
    print(html)

if __name__=='__main__':
    url = 'http://www.xicidaili.com/nn/1'
    get_html(url)

失败。控制台输出：

Traceback (most recent call last):
  ...
urllib.error.HTTPError: HTTP Error 503: Service Temporarily Unavailable

请看下一步解决方法。

第四步伪装好人之模拟浏览器访问

上面报错503 服务器猜测访问者是坏人，所以拒绝了。那我们加个header，这是模拟浏览器访问该页面的作用。让自己像个好人。

# _*_ coding:utf‐8 _*
from urllib import request
import os
import openpyxl
from bs4 import BeautifulSoup

最低0.47元/天解锁文章

200万优质内容无限畅学

python3 爬虫 xicidaili.com 实战

python3 爬虫 xicidaili.com 实战

第一步 分析网站

第二步 导包

第三步 访问测试

第四步 伪装好人之模拟浏览器访问

第一步分析网站

第二步导包

第三步访问测试

第四步伪装好人之模拟浏览器访问