Python网络数据采集9：穿越网页表单与登录窗口进行采集

最新推荐文章于 2024-04-26 08:45:00 发布

原创最新推荐文章于 2024-04-26 08:45:00 发布 · 1.3k 阅读

1 ·

CC 4.0 BY-SA版权

爬虫专栏收录该内容

20 篇文章

订阅专栏

本文介绍如何使用Python的Requests库发送POST请求并处理表单数据，包括提交基本表单、处理文件上传、管理登录及Cookies，并探讨了处理更复杂表单时遇到的问题。

Python3.8

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本

到目前为此，我们都是用GET方法去请求信息，这一章介绍POST方法，即把信息推送给服务器进行存储和分析。

表单可以帮助用户发出POST请求，当然，通过爬虫创建这些请求提交给服务器

9.1 Python Requests库

Requests库是一个擅长处理那些复杂的HTTP请求、cookie、header（响应头和请求头）等内容的Python第三方库

https://github.com/requests/requests

9.2 提交一个基本表单

大多数网页表单都是由一些HTML字段、一个提交按钮、一个在表单处理完之后跳转的“执行结果”（action的值）页面构成。

大多数主流网站都会有robots.txt文件里注明禁止爬虫接入登录表单

http://pythonscraping.com/pages/files/form.html

表单的源代码：

<form method="post" action="processing.php">
First name: <input type="text" name="firstname"><br>
Last name: <input type="text" name="lastname"><br>
<input type="submit" value="Submit">
</form>

切记：HTML表单的目的，只是帮助网站的访问者发送格式合理的请求，向服务器请求没有出现的页面。

# -*- coding: utf-8 -*-
import requests
params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)

O’Reilly Media 新闻订阅页面的表单

<form action="http://post.oreilly.com/client/o/oreilly/forms/
            quicksignup.cgi" id="example_form2" method="POST">
	<input name="client_token" type="hidden" value="oreilly" />
	<input name="subscribe" type="hidden" value="optin" />
	<input name="success_url" type="hidden" value="http://oreilly.com/store/
		newsletter-thankyou.html" />
	<input name="error_url" type="hidden" value="http://oreilly.com/store/
		newsletter-signup-error.html" />
	<input name="topic_or_dod" type="hidden" value="1" />
	<input name="source" type="hidden" value="orm-home-t1-dotd" />
	<fieldset>
		<input class="email_address long" maxlength="200" name=
			"email_addr" size="25" type="text" value=
			"Enter your email here" />
		<button alt="Join" class="skinny" name="submit" onclick=
			"return addClickTracking('orm','ebook','rightrail','dod'
			);" value="submit">Join</button>
	</fields

对应的代码

# -*- coding: utf-8 -*-
import requests
params = {'email_addr': 'ryan.e.mitchell@gmail.com'}
r = requests.post("http://post.oreilly.com/client/o/oreilly/forms/quicksignup.cgi", data=params)
print(r.text)

9.3 单选按钮、复选框和其他输入

HTML标准里提供了大量可用的表单字段：单选按钮、复选框和下拉选框等。在HTML5里面，还有其他的控件，像滚动条、邮箱、日期等。自定义的JavaScript字段可谓无所不能，可以实现取色器、日历以及开发者能想到的任何功能。

如果你不确定一个输入字段值的数据格式，有些工具可以跟踪浏览器正在通过网站发出或接受的GET和POST请求的内容。

最简单的方法就是用 Chrome 浏览器的审查元素(inspector)或开发者工具查看

9.4 提交文件和图像

<form action="processing2.php" method="post" enctype="multipart/form-data">
Submit a jpg, png, or gif: <input type="file" name="image"><br>
<input type="submit" value="Upload File">
</form>

import requests
files = {'uploadFile': open('../files/Python-logo.png', 'rb')}
r = requests.post("http://pythonscraping.com/pages/processing2.php",
files=files)
print(r.text)

9.5 处理登录和cookie

一旦网站验证了你的登录权证，它就会将它们保存在你的浏览器的cookie中，里面通常包含一个服务器生成的令牌、登录有效时限和状态跟踪信息。网站会把这个cookie当作信息验证的证据，在你浏览网站的每个页面是出示给服务器。

http://pythonscraping.com/pages/cookies/login.html，用户名任意，密码:password
用 Requests 库跟踪 cookie 同样很简单

# -*- coding: utf-8 -*-
import requests
params = {'username': 'ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", data=params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("------------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", cookies=r.cookies)
print(r.text)

对简单的访问这样处理没有问题,但是如果你面对的网站比较复杂,它经常暗自调整cookie, 或者如果你从一开始就完全不想要用 cookie, 该怎么处理呢? Requests 库的session 函数可以完美地解决这些问题:

# -*- coding: utf-8 -*-
import requests
session = requests.Session()
params = {'username': 'ryan', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print("------------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

Requests 是一个非常给力的库,程序员完全不用费脑子,也不用写代码,可能只逊色于Selenium
HTTP基本接入认证：在发明cookie之前，处理网站登录最常用的方法就是用HTTP基本接入认证。在一些安全性较高的网站或公司网站，以及一些API的使用上

http://pythonscraping.com/pages/auth/login.php

import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth('ryan', 'password')
r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=
auth)
print(r.text)