参考:https://www.cnblogs.com/goodhacker/p/3353146.html
采用广度优先搜索,使用python3语言进行网页爬虫
实验工具:jupyter notebook
起始页网址:https://www.cnblogs.com/goodhacker/p/3353146.html
目标网址:http://book.51cto.com/art/201012/236668.htm
源代码:
#encoding=utf-8
from bs4 import BeautifulSoup
import socket
import urllib.request as request
import zlib
import re
class MyCrawler:
def __init__(self,seeds):
#使用种子初始化url队列
self.linkQuence=linkQuence()
if isinstance(seeds,str):
self.linkQuence.addUnvisitedUrl(seeds)
if isinstance(seeds,list):
for i in seeds:
self.linkQuence.addUnvisitedUrl(i)
print("Add the seeds url \"%s\" to the unvisited ur