逐行读取url,然后bs4提取a标签内的文字,在建立两个列表,一个append()不停载入url,然后做分析,存在http就先写入本地。然后用random模块,随机选择几个要用的写入~~~
关于random库,基本使用方法如下
import random
list = [1,2,3,4,5,6,7,8,9]
sss = random.sample(list,6)
print sss
源代码如下~
#coding = utf-8
import re
import requests
import time
from bs4 import BeautifulSoup as asp
import random
headeraa = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)',}
hansb = open('urllist.txt','r') #将url放进urllist.txt
hanssb = hansb.readlines()
hansb.close()
print hanssb
zhzhzh = open('url.txt','a+') #开始写入
for urllists in hanssb:
urllistx = urllists.strip('\n')
print urllistx
time.sleep(3)
try:
sss5 = [] #创建列表,然后将找到的URl放进去,用random随机选取几个
bingasp = []
bingphp = []
bingaspx = []
bingjsp = []
han = requests.get(url=urllistx,headers=headeraa,timeout=10)
print han.status_code
print han.content
soup = asp(han.content)
hrefs = soup.find_all(href=re.compile(r'asp\?'))#寻找a标签,带href中带有ASP?
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingasp.append(href)
####
else:
sss5.append(href)
hrefs = soup.find_all(href=re.compile(r'php\?'))#寻找a标签,带href中带有PHP?
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingphp.append(href)
# bingphp1 = bingphp[0:3]
# for bingphp2 in bingphp1:
# zhzhzh.write(bingphp2 + '\n')
else:
sss5.append(href)
hrefs = soup.find_all(href=re.compile(r'aspx\?'))#寻找a标签,带href中带有ASPX?
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingaspx.append(href)
#####
else:
sss5.append(href)
hrefs = soup.find_all(href=re.compile(r'jsp\?'))#寻找a标签,带href中带有JSP?
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingjsp.append(href)
###
else:
sss5.append(href)
except:
print 'connect time out'
finally:
try:
bingphp1 = bingphp[0:3] #随机选取4个
for bingphp2 in bingphp1:
zhzhzh.write(bingphp2 + '\n')
bingaspx1 = bingaspx[0:3]
for bingaspx2 in bingaspx1:
zhzhzh.write(bingaspx2 + '\n')
bingasp1 = bingasp[0:3]
for bingasp2 in bingasp1:
zhzhzh.write(bingasp2 + '\n')
bingjsp1 = bingjsp[0:3]
for bingjsp2 in bingjsp1:
zhzhzh.write(bingjsp2 + '\n')
sss4 = random.sample(sss5,4)
print sss4
for sss3 in sss4:
print sss3
zhzhzh.write(urllistx + sss3 + '\n')
except:
print 'rxxr'
zhzhzh.close()
在爬行过程中难免会有不规范的URL,所以呢多了一个【】用来存放Http:// 这种可能存在的url地址
运行效果
最后的运行结果保存在当前目录下
可以看到蛮稳定的,这些url可以用来检测是否存在SQL注入