深度优先遍历爬取Wikipedia深度为3的所有URL_python全站深度优先获取url 实例-优快云博客

本文介绍了一种使用Python实现的Wikipedia爬虫方法，通过深度优先遍历算法爬取网站上的URL链接。文章详细展示了如何利用requests和BeautifulSoup库获取HTML内容，并使用正则表达式筛选出有效的词条链接。

——在Wikipedia中，呢我们首先是要分析这些链接存在的方式，然后爬取方法。在Wikipedia这个网站的首页呢我们可以看到这个网站的首页就有220k个超链接，而每个超链接里面也会有好多相应的链接。而深度优先遍历呢就是说，深度如果为2先首先去找出深度为 1的第一个超链接，然后去访问这个超链接，去爬取这个超链接上面相应的URL，直到这个网页上面的所有链接都被爬取完成之后在返回深度为一的界面去找第二个URL，一直循环开完成这个过程。
这个时候我们可以用最原始BeautfuleSoup来获取相应的HTML文件,然后去用find_All（“a”）发现所有的超链接，在利用循环获取到其的所有URL。
代码如下：

headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36'
                         ' (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
        }
url = 'https://en.wikipedia.org/wiki/'+url
r = requests.get(url,headers=headers)
html = r.text
soup = BeautfulSoup(html,'lxml')
link_list = soup.find_all("a")
for link in link_list:
    if 'href' in link.attrs:
        print(link.attrs['href'])

—–在这段代码运行出的结果中呢我们能够发现它里面不仅有URL，还有锚点、侧边栏、页脚、页眉等。然而通过分析我们可以看到所有词条的链接有以下两节特点：

URL链接是一=以/wiki/开头的相对路径；
URL链接不包括冒号、#、=、<、>.

我们可以利用正则表达式直接过滤这些链接，正则表达式为：
<a href="/wiki/([^:#=<>])".*?</a>。
下面根据递归来完成这个简答的单线程深度优先遍历爬取Wikipedia深度为3的URL。
这下这个之前我们现需要看自己的python库里面有没有requests和bs4的库，没有的可以用

pip install bs4
pip install requests

下载库文件，如果需要更新库文件可以用

python -m pip install --upgrade pip

更新相应的pip库。
下面来看递归代码：

这段代码是相应的环境变量、字符集、文件库的导入

#!/usr/bin/env python
#conding = UTF-8
import requests
import re
import time
from bs4 import BeautifulSoup

接下来就是重要代码了，这短代码呢，首先是我们需要用到的两个变量一个是遍历过的URL一个是总的URL计数，其次就是一个很简单的算法，自己去理解一下。

time1 = time.time()
exist_url = []
g_writecount = 0

def scrappy(url,depth = 1):
    global  g_writecount
    try:
        headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36'
                         ' (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
        }
        url = 'https://en.wikipedia.org/wiki/'+url
        r = requests.get(url,headers=headers)
        html = r.text
    except Exception as e:
        print('Failed downloading and saving', url)
        print(e)
        exist_url.append(url)
        return None
    exist_url.append(url)
    link_list = re.findall('<a href="/wiki/([^:#=<>]*?)".*?</a>',html)
    unique_list = list(set(link_list)-set(exist_url))
    for each in unique_list:
        g_writecount += 1
        output = "No."+ str(g_writecount) + "\tDepth:" + str(depth) + "\t" + url + ' -> ' +each +"\n"
        print(output)
        #把相应的URL存储到title.txt文件中去
        with open('title.txt','a+') as file:
            file.write(output)
            file.close()
        if depth < 2:
            scrappy(each,depth+1)
        if depth < 3:
            scrappy(each, depth + 1)
scrappy('Wikipedia')
time2  =time.time()
print("花费总时间为：",time2-time1)