【笔记】慕课-Python网络爬虫与信息提取-re库（4）

最新推荐文章于 2022-12-27 09:00:31 发布

原创最新推荐文章于 2022-12-27 09:00:31 发布 · 291 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

Python 专栏收录该内容

13 篇文章

订阅专栏

本文介绍了如何优化Python爬虫，包括利用Scrapy库提高爬虫速度，以及通过动态显示爬取进度提升用户体验。对原有代码进行了改进，使用r.apparent_encoding解析编码格式以节省时间。

对笔记：re库（3）中的代码进行优化

提高用户体验：加快爬虫速度（scrapy库）

使用requests库和beautifulsoup库难以大幅提高速度

提高速度：使用r.apparent_encoding来解析文本的编码格式需要一定时间

原代码：

def getHTMLText(url):
	try:
		r=requests.get(url,timeout=30)
		r.raise_for_status()
		r.encoding=r.apparent_encoding
		return r.text
	print()
		return""
		
def getStockList(lst,stockURL):		
	html=getHTMLText(stockURL)
	soup=BeautifulSoup(html,'html.parser')
	a=soup.find_all('a')	
	for i in a :
		try:
			href=i.attr['href']
			lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
		except:
			continue

优化代码

def getHTMLText(url,code='utf-8'):		#预先对其精细人工判定编码格式
	try:
		r=requests.get(url,timeout=30)
		r.raise_for_status()
		r.encoding=code
		return r.text
	print()
		return""

def getStockList(lst,stockURL):		
	html=getHTMLText(stockURL,'GB2312')		#东方财富网使用GB2312编码格式
	soup=BeautifulSoup(html,'html.parser')
	a=soup.find_all('a')	
	for i in a :	
		try:
			href=i.attr['href']
			lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
		except:
			continue

提高用户体验：动态显示爬取进度

原代码：

def getStockInfo(lst,stockURL,fpath):
	for stock in lst:
		url=stockURL+stock+".html"
		html=getHTMLText(url)
		try:
			if html==""
				continue
			infoDict={}
			soup=BeautifulSoup(html,'html.parser')
			stockInfo=soup.find('div',attrs={'class':'stock-bets'})
			name=stockInfo.find_all(attrs=('class':'bets-name'))[0]			#findall和find_all的区别？
			infoDict.update({'股票名称':name.text.split()[0]})
			keyList=stockInfo.find_all('dt')
			valueList=stockInfo.find_all('dd')
			for i in range(len(keyList)):
				key=keyList[i].text
				val=valueList[i].text
				infoDict[key]=val	
			with open(fpath,'a',encoding='utf-8')as f:
				f.write(str(infoDict)+'\n')
		except:
			traceback.print_exc()
			continue

优化代码：

def getStockInfo(lst,stockURL,fpath):
	count=0		#新增一个计数变量count
	for stock in lst:
		url=stockURL+stock+".html"
		html=getHTMLText(url)
		try:
			if html==""
				continue
			infoDict={}
			soup=BeautifulSoup(html,'html.parser')
			stockInfo=soup.find('div',attrs={'class':'stock-bets'})
			name=stockInfo.find_all(attrs=('class':'bets-name'))[0]			#findall和find_all的区别？
			infoDict.update({'股票名称':name.text.split()[0]})
			keyList=stockInfo.find_all('dt')
			valueList=stockInfo.find_all('dd')
			for i in range(len(keyList)):
				key=keyList[i].text
				val=valueList[i].text
				infoDict[key]=val	
			with open(fpath,'a',encoding='utf-8')as f:
				f.write(str(infoDict)+'\n')
				count=count+1
				print('\r当前速度：{:.2F}%'.format(count*100/len(lst)),end='')	#本来执行print会自动换行，使用end=''禁用该功能
				#转义符\r：能够将打印的字符串的最后的光标提到当前这一行的头部
				#下次在进行相关打印时，打印的信息就会覆盖之前打印的内容
				#实现一个不换行的动态变化的进度条
				#在IDLE中/r是被禁用的，可以在命令行中查看
		except:
			count=count+1
			print('\r当前速度：{:.2F}%'.format(count*100/len(lst)),end='')		#本来执行print会自动换行，使用end=''禁用该功能
			traceback.print_exc()
			continue