Python爬虫GET与POST详解-优快云博客

python学习第二步

获取网站信息
- GET和POST的区别
- 最后做一个糗事百科的爬取

获取网站信息

我是根据这篇网站进行修改的
https://blog.youkuaiyun.com/xiaolong_4_2/article/details/86497792

将data存入文件中
file = open("./1.html", ‘wb’) #表示存在当前目录下面
file.write(data)
file.close()
模拟浏览器访问url（使用opener对象的addhandlers属性
其中比较重要的部分如何获取那个header，我是目前是根据

获取网站信息其中主要注意我修改了header那部分

import urllib.request

"""模拟浏览器访问url的过程，对url进行访问
首先，创建opener对象（参数为handlers），
其次，使用opener对象的属性添加请求头（addheaders），
最后使用opener的open方法（）/urllib.request的urlopen（）方法请求页面数据
"""
# 页面的url
url = "http://www.2345.com"
opener = urllib.request.build_opener()
# 添加opener的请求头(列表形式)
header = ("User-Agent",
          'Accept-Ranges: bytes,Cache-Control: max-age=0,Connection: keep-alive,Content-Length: 114,Content-Type: text/html,Date: Tue, 03 Sep 2019 17:28:18 GMT,Etag: "5d2c473c-72",Expires: Tue, 03 Sep 2019 17:28:18 GMT,Last-Modified: Mon, 15 Jul 2019 09:28:28 GMT,Server: nginx')
opener.addheaders = [header]
# 使用opener的open()方法，访问url
data = opener.open(url).read()

# 将data存入文件中
file = open("./1.html", 'wb')
file.write(data)
file.close()

GET和POST的区别

w3school标准答案
在这里插入图片描述

GET直接显示明文（包括账号或者密码）
POST不会在页面上显示（但是想看提交了什么也就不太方便）

使用get请求爬取页面
其中注意一下：value1=urllib.request.quote(value)

import urllib.request
"""使用get请求进行访问url
首先，构建对应的url(该url包含get请求的字段名和字段内容等信息，且url必须是get请求的形式）
其次，使用urllib.request.Request(url)创建一个Request对象，
最后使用urlopen()方法访问该url
"""
 
#提取出对应key的value值
value="古风"
#对对应的value值进行网址编码
value1=urllib.request.quote(value)
#构造可以改变的url
url="https://www.sogou.com/sogou?pid=sogou-site-488e4104520c6aab&ie=utf8&query="+value1
#创建Reauest对象
req=urllib.request.Request(url)
#访问对应的url
ht=urllib.request.urlopen(req)
#将页面读取到data
data=ht.read()
 
#将data存入文件中
file=open("F:/5.html",'wb')
file.write(data)
file.close()

使用post请求访问页面

import urllib.request
import  urllib.parse
"""使用post请求进行访问url（即点击提交按钮后，跳转到的页面就是处理post表单数据的url）
首先，设置对应的url（即post表单的数据会被提交并进行处理的url），
其次，构件post的表单数据(是一个字典对象)--查看网页源代码的表单的属性，并使用urllib.request.urlencode()，
然后，创建Request对象，
最后，使用urllib.request.urlopen(Request对象)方法访问一个url
"""
 
#处理表单数据的url
url="http://www.iqianyue.com/mypost/"
#将post表单的属性表示为字典对象,并进行编码
post={"name":"aas","pass":"123456"}
post1=urllib.parse.urlencode(post).encode('utf-8')
#创建一个Request对象
req=urllib.request.Request(url,post1)
#访问url
ht=urllib.request.urlopen(req)
data=ht.read()
 
#将data存入文件中
file=open("F:/6.html",'wb')
file.write(data)
file.close()

最后做一个糗事百科的爬取

此功能还有问题后面修复

import requests
from bs4 import BeautifulSoup
# 获取html文档
def get_html(url):
    response = requests.get(url)
    response.endcoding = 'utf-8'
    return response.text
def get_certain_joke(html):
    soup = BeautifulSoup(html, 'lxml')
    joke_content = soup.select('div#content')[0].get_text()
    return joke_content
url_joke = "https://www.qiushibaike.com"
html = get_html(url_joke)
joke_content = get_certain_joke(html)
print(joke_content)