考研学校的爬虫(自己写的可能会有点菜

最新推荐文章于 2024-02-19 21:54:15 发布

原创最新推荐文章于 2024-02-19 21:54:15 发布 · 872 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

数据爬取专栏收录该内容

4 篇文章

订阅专栏

本文介绍了一种使用Python爬虫技术自动抓取特定大学研究生招生简章的方法，包括如何设置请求头避免被服务器识别为脚本、正则表达式解析网页内容以及如何将数据持久化到本地文件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最近因为准备考研所以就把要考研的学校的招生简章给爬了下来
开机启动
我爬
我怕我忘了最新的学校通告所以才写的

首先确定目标~~
然后到研究生院查看资料
这里我就不讲了然后
这里进入正题

爬取页面

先是将页面保存的函数

def getpage(url):
	rep=requests.get(url,headers=header)
	rep=rep.text
	#rep=rep.decode("ISO-8859-1")
	rep=rep.encode("utf-8")#这里我将其变成了utf-8的编码所以之后我们写如文件的时候就要用网页的编码可以用rep.encoding查看网页编码
	#print rep
	return rep

这里我转成utf-8编码是在cmd终端可以显示gbk编码的话在cmd会发生错误
用的是request模块header也就是发送的数据包头,防止服务器认为是脚本(通俗一点就是模拟浏览器发送的数据包)

解析页面获得关键点

我只需要招生简章所以我用一个正则匹配获取关键的字符就行了

def parsehtml(wangye):
	global information_2021
	compil=r"<td .*><a href='.*' target='.*' title='(.*?)'>"
	compil1=r"<td .*><a href='(.*)' target='.*' title='.*?'>"
	aim=re.findall(compil,wangye)
	aim2=re.findall(compil1,wangye)
	for i in aim2:
		if "2021" in i:
			information_2021.append(i)
	#print aim
	return aim

ps
这里我多加了一个事2021的招生简章如果出现就会存到另外一个数组因为我是2021的emem

将爬取到的字符写入文本

直接写入文本通过with open
用with open的话就不用写close函数了调用之后就会关闭
还有这里最好写入的编码和网页的编码一致就会写成不然文件就会是乱码
不然在一开始或许可以不变编码直接写入(ps我没试过)

def write_to_txt(list1):
	with open('jianzhang.txt','wb+') as f:
		for i in b:
			try:
				i=i.encode("ISO-8859-1")
				f.writelines(i+'\n')
			except Exception as e:
				print e

2021的招生简章保存成网页

def get_2021(list2):
	url="http://grs.hdu.edu.cn/"
	for i in list2:
		true_url=url+i
		try:
			reponse=requests.get(true_url,headers=header)
			reponse=reponse.text
			with open('./%s.html','w') as f1:
				f1.write(reponse)
		except Exception as e:
			print e

招生简章的网页都是通过上面拼接而成的所以就按正常的写就行了

然后将python脚本写一个批处理加入开机启动

我们通过cmd打开开机启动的文件夹直接输入shell:startup
在这里插入图片描述即可以打开写一个简单批处理

python E:\spider\day14\day14.py
pause

(最好将写文件的路径改一下不然就会出现权限的问题~~)

最后附脚本

#coding=utf-8
import requests
import re
import os
import sys
reload(sys)
sys.setdefaultencoding('utf8')
#sys.setdefaultencoding('gb18030')

header={
	"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36",
	"Referer":"http://grs.hdu.edu.cn/"
}

information_2021=[]

def getpage(url):
	rep=requests.get(url,headers=header)
	rep=rep.text
	#rep=rep.decode("ISO-8859-1")
	rep=rep.encode("utf-8")#这里我将其变成了utf-8的编码所以之后我们写如文件的时候就要用网页的编码可以用rep.encoding查看网页编码
	#print rep
	return rep


def parsehtml(wangye):
	global information_2021
	compil=r"<td .*><a href='.*' target='.*' title='(.*?)'>"
	compil1=r"<td .*><a href='(.*)' target='.*' title='.*?'>"
	aim=re.findall(compil,wangye)
	aim2=re.findall(compil1,wangye)
	for i in aim2:
		if "2021" in i:
			information_2021.append(i)
	#print aim
	return aim

def write_to_txt(list1):
	with open("E:\\spider\\day14\\jianzhang.txt",'wb+') as f:
		for i in b:
			try:
				i=i.encode("ISO-8859-1")
				f.writelines(i+'\n')
			except Exception as e:
				print e

def get_2021(list2):
	url="http://grs.hdu.edu.cn/"
	for i in list2:
		true_url=url+i
		try:
			reponse=requests.get(true_url,headers=header)
			reponse=reponse.text
			with open('E:\\spider\\day14\\%s.html','w') as f1:
				f1.write(reponse)
		except Exception as e:
			print e

if __name__=="__main__":
	url="http://grs.hdu.edu.cn/1722/list.htm"
	a=getpage(url)	
	b=parsehtml(a)
	write_to_txt(b)
	get_2021(information_2021)