-
Http与Https区别协议
- https安全级协议http升级版
- http不加密传输信息,明文传输,https采用ssl/tls加密传输,安全性高
- http连接方式80端口,https连接方式443端口
- http连接无状态,https采用ssl/tls协议构建可加密传输,身份认证
- 技巧:
- 在爬取https网页发现爬取内容不理想,可以去掉s,使用http去爬取
-
urllib快速爬取网址
-
三种方式:
-
一种方式:爬取到内存中
-
导入urllib模块:
-
import urllib.request
-
爬取网址:
-
打开网址访问网址百度为例:
-
urllib.request.urlopen('<http://www.baidu.com'>)
-
读取访问内容.read()
-
解码decode(‘utf - 8’,‘ignore’)
-
url =urllib.request.urlopen('<http://www.baidu.com').read().decode('utf>- 8','ignore')print(len(url))print(url)
-
小技巧:
-
http访问出原网页全部内容,https访问出不准确内容
-
-
-
-
二种方式:爬取到内存中
-
导入urllib模块:import urllib.request
-
爬取网址:
-
声明被请求网址:url = 'http://www.baidu.com’
-
封装在urllib.request.Request(网址):req = urllib.request.Request(url)
-
访问网址:
data = urllib.request.urlopen(req).read().decode(‘utf - 8’,‘ignore’)
url = 'http://www.baidu.com’req = urllib.request.Request(url)data = urllib.request.urlopen(req).read().decode(‘utf - 8’,‘ignore’)
-
-
-
三种方式:爬取到硬盘中
- 导入urllib模块:import urllib.request
- 声明被请求网址:url = 'http://www.baidu.com’
- 封装在模块里,会生成一个html文件:urllib.request.urlretrieve(“网址”,filename=“保存的路径”)
url = 'http://www.baidu.com’urllib.request.urlretrieve(url,filename=r"F:\pycharm\百度爬虫\视频总结\baidu.html")
-
-
-
urllib读取状态码
- 访问网址:file = urllib.request.urlopen(‘网址’)
- getcode()获取到网页状态码:print(file.getcode())
-
urllib获取cookie
- 模拟登陆cookie处理
- cookie处理,在进入网页登陆后,访问已登陆url,cookie会自动去获取
- 导入urllib cookie处理模块:import http.cookiejar
import urllib.requestimport http.cookiejarcjar = http.cookiejar.CookieJar()opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))urllib.request.install_opener(opener)print(str(cjar))
-
urllib获取POST请求
- 测试网址,如果POST请求返回传的数据:http://www.iqianyue.com/mypost/
- 导入urllib POST处理模块:import urllib.parse
import urllib.requestimport urllib.parseurl = 'http://www.iqianyue.com/mypost/'postdata = urllib.parse.urlencode( { ‘name’:‘ning’, ‘pass’:‘ning’, }).encode(‘utf - 8’)req = urllib.request.Request(url,postdata)print(req)data = urllib.request.urlopen(req).read()# print(data)fh = open(“F:/pycharm/百度爬虫/视频总结/post.html”,“wb”)fh.write(data)fh.close()
-
urllib——requests伪装浏览器访问
- 通过Opener添加headers
- 两种访问方式:
- 第一种:
- 导入urllib模块:import urllib.request
- 头部必须是元组格式:
headers = (“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36”) - 调用urllib的build_opener封装为对象,用build_oprner下面的属性addheaders代替headers,addheaders必须是列表包含元组[("",""),("","")]
opener = urllib.request.build_opener()opener.addheaders=[headers] - opener对象直接访问url:
data = opener.open(url).read() - 总代码:
import urllib.requestheaders = (“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36”)url = 'https://www.qiushibaike.com/’#调用urllib的build_opener方法,封装headers伪装成浏览器访问opener = urllib.request.build_opener()opener.addheaders=[headers]data = opener.open(url).read()
- 第二种 设置为全局访问url时会自动调用改方法;
- 导入urllib模块:import urllib.request
- 头部必须是元组格式:
headers = (“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36”) - 调用urllib的build_opener封装为对象,用build_oprner下面的属性addheaders代替headers
opener = urllib.request.build_opener()opener.addheaders=[headers] - install_opener设置为全局,去访问url:
urllib.request.install_opener(opener) - opener对象直接访问url:
data = opener.open(url).read() - 总代码:
import urllib.requestheaders = (“User-Agent”,“Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36”)url = 'https://www.qiushibaike.com/’#调用urllib的build_opener方法,封装headers伪装成浏览器访问opener = urllib.request.build_opener()opener.addheaders=[headers]#install_opener设置为全局,去访问urlurllib.request.install_opener(opener)data = urllib.request.urlopen(url).read()print(data)
- 第一种:
- 两种访问方式:
- 通过Opener添加字典形式headers,多个参数伪装
- 导入模块:import urllib.request
- 字典形式的头部:
headers={“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0”, “Content-Type”:“application/javascript”, } - 调取urllib里面的build_opener封装,再掉build_opener属性获取到迭代出来的头部
opener = urllib.request.build_opener()headall = []for key,value in headers.items(): print(key) print(value) item = (key,value) headall.append(item)opener.addheaders=headallurllib.request.install_opener(opener) - 访问url:
data = urllib.request.urlopen(url).read() - 总代码:
import urllib.requesturl = 'https://www.qiushibaike.com/'headers={“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0”, “Content-Type”:“application/javascript”, }opener = urllib.request.build_opener()headall = []for key,value in headers.items(): print(key) print(value) item = (key,value) headall.append(item)opener.addheaders=headallurllib.request.install_opener(opener)data = urllib.request.urlopen(url).read()print(data)
- 字典形式的头部:
- 导入模块:import urllib.request
- 通过urllib.request属性Request添加headers
- 导入模块:import urllib.request
- 封装url:封装好的url 调用add_header()
url="https://www.qiushibaike.com/"req = urllib.request.Request(url)req.add_header(‘User-Agent’, ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0’) - 访问url:
|reqdata=urllib.request.urlopen(req).read().decode(“utf-8”,“ignore”) - 总代码:
import urllib.requesturl="https://www.qiushibaike.com/"req = urllib.request.Request(url)req.add_header(‘User-Agent’, ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0’)reqdata=urllib.request.urlopen(req).read().decode(“utf-8”,“ignore”)
- 通过Opener添加headers
-
urllib_IP代理
- urllib伪装IP操作方法,默认本机IP:
- 导入模块:import urllib.request
- 使用urllib.request.ProxyHandler获取IP去伪装
proxy = urllib.request.ProxyHandler({“http”:thisip}) - 再用urllib.request.build_opener封装,install_opener设置为安装全局
oprner = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)urllib.request.install_opener(oprner) - 总代码:
import randomimport urllib.requestipo = [ “144.123.70.245:9999”, “121.15.254.156:888”, “59.52.185.7:808”, “123.169.34.249:9999”, “202.112.51.51:8082”,]def ip(ipo): thisip = random.choice(ipo) print(thisip) #使用urllib.request.ProxyHandler获取IP去伪装 proxy = urllib.request.ProxyHandler({“http”:thisip}) print(proxy) #再用urllib.request.build_opener封装 oprner = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(oprner)ip(ipo)
- 小批量随机ip
- 设置为全局访问url时会自动调用改方法
- 导入模块:import random
- 总代码调用:
import randomimport urllib.requestipo = [ “144.123.70.245:9999”, “121.15.254.156:888”, “59.52.185.7:808”, “123.169.34.249:9999”, “202.112.51.51:8082”,]def ip(ipo): thisip = random.choice(ipo) print(thisip) #使用urllib.request.ProxyHandler获取IP去伪装 proxy = urllib.request.ProxyHandler({“http”:thisip}) print(proxy) #再用urllib.request.build_opener封装 oprner = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(oprner)ip(ipo)
- urllib伪装IP操作方法,默认本机IP:
-
urllib_mysql写入
- 细节:
- 如果插入数据库数据是乱码,找到pymysql安装位置文件,代开connections.py文件搜索charset,加入utf8,注意无-------
- 导入mysql:import pymysql
- 链接mysql数据库用户:
conn=pymysql.connect(host=“127.0.0.1”,user=“root”,passwd=“root”,db=“ts”) - 执行语句写入:
conn.query( “INSERT INTO lesson(title,teacher,stu) VALUES(’” + str(title) + “’,’” + str(teacher) + “’,’” + str(price) + “’)”) - 如果数据是插入加上:conn.commit()
- 总代码:
import pymysqlconn=pymysql.connect(host=“127.0.0.1”,user=“root”,passwd=“root”,db=“ts”)# 执行sql语句-无返回conn.query( “INSERT INTO lesson(title,teacher,stu) VALUES(’” + str(title) + “’,’” + str(teacher) + “’,’” + str(price) + “’)”)conn.commit()#插入执行
- 细节:
-
urllib_糗事百科demo
- 总代码:
import urllib.requestimport reimport randomimport timeuapools=[ “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393”, “Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0”, “Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)”, ]def UA(): opener=urllib.request.build_opener() #每次随机一个头部 thisua=random.choice(uapools) ua=(“User-Agent”,thisua) opener.addheaders=[ua] #opener设为全局带头部 urllib.request.install_opener(opener) print(“当前使用UA:”+str(thisua))UA()for i in range(0,35): UA() #拼接翻页urlthisurl=“http://www.qiushibaike.com/8hr/page/”+str(i+1)+"/?s=4948859" #访问urldata=urllib.request.urlopen(thisurl).read().decode(“utf-8”,“ignore”) #获取到段子 pat=’. ?(.?).*?’ rst=re.compile(pat,re.S).findall(data) #据列表长度,用range方法循环索引出列表每一个元素 for j in range(0,len(rst)): print(rst[j]) print("-------")
- 总代码:
-
网页状态码意思:
- 200 正常访问
- 403 被禁止意思
- 404 没有找到网页
e.S).findall(data) #据列表长度,用range方法循环索引出列表每一个元素 for j in range(0,len(rst)): print(rst[j]) print("-------")
-
网页状态码意思:
- 200 正常访问
- 403 被禁止意思
- 404 没有找到网页
- 500 对方内部服务器错误
urllib使用
最新推荐文章于 2022-03-07 00:14:31 发布