【python】提取一个网站带有参数的url，然后随机选取几个保存到本地+源码分析

该博客介绍了如何使用Python的requests和BeautifulSoup库从网站中提取带参数的URL，特别是asp、php、aspx和jsp链接。通过读取url列表，对每个URL进行HTTP请求，然后筛选出包含'http'的URL，将其存储到不同的列表中。利用random库随机选取部分URL，将其写入到本地文件url.txt中，用于潜在的SQL注入检测。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

逐行读取url，然后bs4提取a标签内的文字，在建立两个列表，一个append（）不停载入url，然后做分析，存在http就先写入本地。然后用random模块，随机选择几个要用的写入~~~

关于random库，基本使用方法如下

import random
list = [1,2,3,4,5,6,7,8,9]
sss = random.sample(list,6)
print sss

源代码如下~

#coding = utf-8
import re
import requests
import time
from bs4 import BeautifulSoup as asp
import random
headeraa = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)',}

hansb = open('urllist.txt','r') #将url放进urllist.txt
hanssb = hansb.readlines()
hansb.close()
print hanssb
zhzhzh = open('url.txt','a+') #开始写入

for urllists in hanssb:
urllistx = urllists.strip('\n')
print urllistx
time.sleep(3)
try:
sss5 = [] #创建列表，然后将找到的URl放进去，用random随机选取几个
bingasp = []
bingphp = []
bingaspx = []
bingjsp = []
han = requests.get(url=urllistx,headers=headeraa,timeout=10)
print han.status_code
print han.content
soup = asp(han.content)
hrefs = soup.find_all(href=re.compile(r'asp\?'))#寻找a标签，带href中带有ASP？
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingasp.append(href)
####
else:
sss5.append(href)


hrefs = soup.find_all(href=re.compile(r'php\?'))#寻找a标签，带href中带有PHP？
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingphp.append(href)
# bingphp1 = bingphp[0:3]
# for bingphp2 in bingphp1:
# zhzhzh.write(bingphp2 + '\n')
else:
sss5.append(href)


hrefs = soup.find_all(href=re.compile(r'aspx\?'))#寻找a标签，带href中带有ASPX？
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingaspx.append(href)
#####
else:
sss5.append(href)


hrefs = soup.find_all(href=re.compile(r'jsp\?'))#寻找a标签，带href中带有JSP？
for href in hrefs:
href = href.get('href')
if 'http' in href:
bingjsp.append(href)
###
else:
sss5.append(href)

except:
print 'connect time out'

finally:
try:
bingphp1 = bingphp[0:3] #随机选取4个
for bingphp2 in bingphp1:
zhzhzh.write(bingphp2 + '\n')
bingaspx1 = bingaspx[0:3]
for bingaspx2 in bingaspx1:
zhzhzh.write(bingaspx2 + '\n')
bingasp1 = bingasp[0:3]
for bingasp2 in bingasp1:
zhzhzh.write(bingasp2 + '\n')
bingjsp1 = bingjsp[0:3]
for bingjsp2 in bingjsp1:
zhzhzh.write(bingjsp2 + '\n')
sss4 = random.sample(sss5,4)
print sss4
for sss3 in sss4:
print sss3
zhzhzh.write(urllistx + sss3 + '\n')
except:
print 'rxxr'
zhzhzh.close()
在爬行过程中难免会有不规范的URL，所以呢多了一个【】用来存放Http：// 这种可能存在的url地址