<span style="font-size:18px;">import re
import urllib.request
import urllib
from collections import deque
queue = deque()
visited = set()
url = 'http://www.baidu.com'
queue.append(url)
cnt = 0
while queue:
url = queue.popleft() # 队首元素出队
visited |= {url} # 标记为已访问
print('已经抓取: ' , cnt, '个链接',' 正在抓取 : ' + url)
cnt += 1
urlop = urllib.request.urlopen(url)
if 'html' not in urlop.getheader('Content-Type'):#抓取到的并没有html格式的数据,则重新开始循环
continue
# 避免程序异常中止, 用try..catch处理异常
try:
data = urlop.read().decode('utf-8')
except:
continue
linkre = re.compile('href=\"(.+?)\"')
for x in linkre.findall(data):
if 'http' in x and x not in visited:
queue.append(x)
print('把 ' + x +'加入队列')</span>
简单Python3爬虫程序(1)简单架构:队列、集合、正则
最新推荐文章于 2023-10-09 09:40:16 发布