python爬虫——爬取图书馆借阅数据

最新推荐文章于 2023-12-09 11:00:00 发布

隨兴

最新推荐文章于 2023-12-09 11:00:00 发布

阅读量6.9k

点赞数 2

分类专栏： python 文章标签： urllib BeautiulSoup

本文链接：https://blog.youkuaiyun.com/a429367172/article/details/86532683

版权

环境

python3.6
BeautifulSoup4 —— v4.6

分析

由于图书管理系统很多人密码都未改，为默认密码，刚好最近在学爬虫，想爬出来试试手，并没有任何恶意，侵删。

本次主要包含以下内容：

模拟用户登录的程序
BeautifulSoup文档学习内容
爬取html文件的小程序

模拟用户登录

方法一 requests

首先利用用户名和密码，构造post数据，发送到登录页面，以形成cookie。（注意数据类型也可能是json类型）。
然后根据cookie，利用get方式请求主页。
利用requests方式如下：

import requests
import json

url = "http://interlib.sdust.edu.cn/opac/reader/space"
url_history = "http://interlib.sdust.edu.cn/opac/loan/historyLoanList"
url_log = "http://interlib.sdust.edu.cn/opac/reader/doLogin"


#获取会话
req = requests.Session()

loginid = '' #用户名
passwd = '' #密码

#构造登录请求
data = {
    'rdid' : loginid,
    'rdPasswd' : passwd,
    'returnUrl': '',
    'password': ''
}

#post模拟登录
response = req.post(url_log,data = json.dumps(data))

#get进入主页
index1 = req.get(url_history)

print(index1.status_code)
print(index1.text)

方法二模拟登录后再携带得到的cookie访问

利用浏览器的开发者工具。转到network选项卡，并勾选Preserve Log（重要！）。

import json
import hashlib
import sys
import io
import urllib.request
import http.cookiejar

sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #改变标准输出的默认编码

headers = {'User-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'}

url = "http://interlib.sdust.edu.cn/opac/reader/space"
url_history = "http://interlib.sdust.edu.cn/opac/loan/historyLoanList"
url_log = "http://interlib.sdust.edu.cn/opac/reader/doLogin"


loginid = ''
passwd = ''

md5 = hashlib.md5(passwd.encode("utf-8"))
passwd = md5.hexdigest()

data = {
    "rdid" : loginid,
    "rdPasswd" : passwd,
    'returnUrl': '/loan/historyLoanList',
    'password': ''
}
# data = json.dumps(data)

data = urllib.parse.urlencode(data).encode('utf-8')


#构造登录请求
req = urllib.request.Request(url_l