环境: python3、windows
模块:requests、BeautifulSoup
安装模块:
pip3 install BeautifulSoup4 pip3 install requests
一、以汽车之家为例子,来一段简单的爬虫代码。
rt requests
from bs4 import BeautifulSoup
# 找到所有新闻
# 标题,简介,url,图片
#get方式向汽车之家新闻页面发送请求,获取返回的页面信息
response = requests.get('http://www.autohome.com.cn/news/')
#get请求默认编码是utf8,而国内网站许多如汽车之家则需改成gbk
response.encoding = 'gbk'
#以python标准库解析html文档
soup = BeautifulSoup(response.text,'html.parser')
#查找id=xx的标签,以此基础查找所有li标签
li_list = soup.find(id='auto-channel-lazyload-article').find_all(name='li')
#通过f12查看新闻板块下此标签的li包含我们需要的信息。再将每一个需要的标签通过BeautifulSoup方法解析出来。
for li in li_list:
title = li.find('h3')
#h3标签中会有None,可能是广告,直接跳过
if not title:
continue
#简介
summary = li.find('p').text
#详细页url,找到a标签,a标签的所有属性都在attrs的字典里,可以attrs取值,也可以直接get方法取值
# url = li.find('a').attrs['href']
url = li.find('a').get('href')
#同理先拿到图片url,再通过url向服务器发送请求,写入本地
img_url = li.find('img').get('src')
img = requests.get(img_url)
#这里是伪代码,实际运行过程,文章标题会有许多的特殊字符,不可作为图片名称。可用其它名称,
#或者通过正则替换掉特殊字符。
file_name = title.text
with open(file_name+'.jpg','wb') as f:
f.write(img.content)
二、通过代码进行登录验证:
1.登录github:
首先我们进入github登录页面,输入错误的用户名以及密码,通过f12 NetWork一栏查看htttp请求状态
点击session,在Headers一栏,可以看到接收我们登录信息的URL是哪一个
此时,再查找服务端需要的Data信息,再最下方找到了Form Data
根据这个格式,我们向github服务端发送post请求:
import requests
from bs4 import BeautifulSoup
#获取token
r1 = requests.get('https://github.com/login')
s1 = BeautifulSoup(r1.text,'html.parser')
#同样是通过f12查看源码搜索token,找到了作为CSRF禁止跨站请求的token的标签,通过解析取得它的值
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
#有的网站会在第一次get请求时给客户端发送一组cookies,当客户端带着此cookies来进行验证才会通过,所以这里先获取未登录的cookies
r1_cookie_dict= r1.cookies.get_dict()
#将用户名密码token发送到服务端
r2 = requests.post('https://github.com/session',
data={
'utf8':'✓',
'authenticity_token':token,
'login':'Mitsui1993',
'password':'假装有密码',
'commit':'Sign in'
},
cookies = r1_cookie_dict
)
#获取登陆后拿到的cookies,并整合到一个dict里
r2_cookie_dict = r2.cookies.get_dict()
cookie_dict = {}
cookie_dict.update(r1_cookie_dict)
cookie_dict.update(r2_cookie_dict)
#带着cookies验证是否登录成功,查看登录后可见的页面
r3 = requests.get(
url='https://github.com/settings/emails',
cookies=cookie_dict
)
#text里包含我的用户名,由此判定已经登录成功。
print(r3.text)
2.通过requests对抽屉网进行点赞
import requests
#取得未登录第一次get请求的cookies
r1 = requests.get('http://dig.chouti.com')
r1_cookies = r1.cookies.get_dict()
#由于点赞前需要先登录,所以这里跟github一样,我们通过解析http请求知道需要发送的目标url以及所需数据
r2 = requests.post('http://dig.chouti.com/login',
data={
'phone':'8615xxxxx',
'password':'woshiniba',
'oneMonth':1
},
cookies = r1_cookies)
#获取登录后的cookies
r2_cookies = r2.cookies.get_dict()
#整合cookies
r_cookies = {}
r_cookies.update(r1_cookies)
r_cookies.update(r2_cookies)
#真正的点赞功能需要的是第一次get时的cookies里的gpsd,这也是为什么我们主张将登陆前后的cookies合并一起发送的原因,
#这将大大提高我们请求的容错率。
# r_cookies = {'gpsd':r1_cookies['gpsd']}
#点赞格式url格式linksId=后面为文章id
r3 = requests.post('http://dig.chouti.com/link/vote?linksId=13921736',
cookies = r_cookies)
#获得正确的状态码及返回信息,则正面已经成功。
print(r3.text)
三。requests模块与 模块的其它方法:


1 def request(method, url, **kwargs): 2 """Constructs and sends a :class:`Request <Request>`. 3 4 :param method: method for the new :class:`Request` object. 5 :param url: URL for the new :class:`Request` object. 6 :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. 7 :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`. 8 :param json: (optional) json data to send in the body of the :class:`Request`. 9 :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`. 10 :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`. 11 :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload. 12 ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')`` 13 or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string 14 defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers 15 to add for the file. 16 :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth. 17 :param timeout: (optional) How long to wait for the server to send data 18 before giving up, as a float, or a :ref:`(connect timeout, read 19 timeout) <timeouts>` tuple. 20 :type timeout: float or tuple 21 :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed. 22 :type allow_redirects: bool 23 :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy. 24 :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``. 25 :param stream: (optional) if ``False``, the response content will be immediately downloaded. 26 :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair. 27 :return: :class:`Response <Response>` object 28 :rtype: requests.Response 29 30 Usage:: 31 32 >>> import requests 33 >>> req = requests.request('GET', 'http://httpbin.org/get') 34 <Response [200]> 35 """ 36 复制代码