Python实训笔记

最新推荐文章于 2022-12-02 19:33:39 发布

weixin_45797932

最新推荐文章于 2022-12-02 19:33:39 发布

阅读量510

点赞数 2

分类专栏： Python基础知识文章标签： python

本文链接：https://blog.youkuaiyun.com/weixin_45797932/article/details/112723284

版权

Python基础知识专栏收录该内容

4 篇文章

订阅专栏

文章目录

HTTP
1.了解Http
2.详解Goole以及优缺
3.PIP包管理
4.requests基本语法
5.debug模式
6.html解析-正则
7.html解析-bs库
8.html解析-xpath

HTTP

例如：随着人工智能的不断发展，Python学习这门技术也越来越重要，很多人都开启了学习Python，本文就介绍了Python的基础内容。

提示：以下是本篇文章正文内容，下面案例可供参考

1.了解Http

概述
HTTP 全称是 HyperText Transfer Protocal （超文本传输协议），从 1990 年开始就在 WWW 上广泛应用，是现今在 WWW 上应用最多的协议，HTTP 是应用层协议，当你上网浏览网页的时候，浏览器和 web 服务器之间就会通过 HTTP 在 Internet 上进行数据的发送和接收。HTTP 是一个基于请求/响应模式的、无状态的协议。即我们通常所说的 Request/Response
特点：
支持客户端/服务器模式

简单快速：客户向服务器请求服务时，只需传送请求方法和路径。由于 HTTP 协议简单，使得 HTTP 服务器的程序规模小，因而通信速度很快

灵活：HTTP 允许传输任意类型的数据对象。正在传输的类型由 Content-Type 加以标记

无连接：无连接的含义是限制每次链接只处理一个请求。服务器处理完哭护的请求，并收到客户的应答后，即断开链接，采用这种方式可以节省传输时间

无状态：HTTP 协议是无状态协议。无状态是指协议对于事物处理没有记忆能力。缺少状态意味着如果后续处理需要前面的信息，则它必须重传，这样可能会导致每次连接传送的数据量增大。另一方面，在服务器不需要先前信息时它的应答就较快
http图解

2.详解Goole以及优缺

在这里插入图片描述

3.PIP包管理

内置库
包/库:别人写好的代码，直接引用，加快开发效率。
内置包:python解释器内置常用功能库。
解释器安装目录/Lib文件夹下，os time urllib等。
文件夹里有__ .init. __ .py，就成了一个包。

代码如下（示例）：

import urllib
from urllib import request
response = request . urlopen( 'http://baidu.com )

关于http 模拟htmZ源代码解析
虽然内置库urllib库和htm1库可以做。但内置库比较简单核心，不能满足所有需求。
python2时代urllib urllib2
有第三方程序员做了一个新http请求库，比官方更方便，urllib3
又有一个程序员，在urllib3基础上进一步封装和优化，requests
python3时代内置库统一为urllib,
结论:建议直接第三方requests
三方库
pypi. org上丰富的各种功能的库。
pip包管理工具
在服务器上没有图形界面的浏览器。开发语言第三方库往往用命令行包管理工具。
解释器/script/pip.exe

pip -V  #看pip版本
pip sarch requests #搜索包信息
pip install requests #【重要】安装第三方库

pip uninstall requests # 删除已安装的库
pip List   #展示所有已安装过的库
pip freeze > requests #把项目用到的库信息导出到一个文件

换源
软件源source:清单里维护了上万的某某软件→某某url下载库下载地址关系。
但官方pypi. org下载速度慢。国内一些大学、大公司同步镜像。

方式1:临时换
ipip install requests -i http://simply.aliyun.com/simple/

方式2:永久换
系统用户文件夹下新建. pip文件夹和pip. conf文件，写入配置。
pip config global
方式3(推荐): pycharm设置里面配。
settings/ interpreter/+号/manage repositories/ 添加国内源地址。

参考
pip换源(https://www.cnblogs.com/believepd/p/10499844.html)

豆瓣https://pypi.doubanio.com/simple/

阿里云http://mirrors.aliyun.com/pypi/simple/           
清华大学https://pypi.tuna.tsinghua.edu.cn/simple/

4.requests基本语法

import requests

baidu_index = 'https://www.baidu.com'
baidu_search_url = 'http://www.baidu.com/s' #百度是部署在http上的

#伪造请求头，基本反爬措施
headers = {
    #'cookies' : '',    # 跟公共参数，用户会话有关
    #'referer' : '',    #从哪一个页面来
    #浏览器标识，容易伪造，但没有，肯定是容易被服务器识别出来
    'User-agert' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}

params = {
    'wd' : '天气',
    'ie' : 'utf-8'
}
response = requests.get(url=baidu_search_url,params=params,headers=headers)
#状态码
status_code = response.status_code
if status_code ==200:
    #网页数据 bytes
    content = response.content
    #网页数据 str。 一般直接去text属性，但是少数情况解码错误乱码
    text = response.text
    #百度需要自行解码
    text = content.decode('utf-8')
    print(text)
    url = response.url
    headers = response.headers

5.debug模式

#调试运行模式
#1 打断点
import requests

response = requests.get(url='https://www.baidu.com')
#状态码  
status_code = response.status_code
if status_code ==200:
    #网页数据 bytes
    content = response.content
    #网页数据 str。 一般直接去text属性，但是少数情况解码错误乱码
    text = response.text
    text = content.decode('utf-8')
    print(text)
    url = response.url
    headers = response.headers

实验结论：方便寻找错误

6.html解析-正则

#我们已经用request 模拟请求，拿到网页源代码，str字符串，里面html格式.
#需要解析
html ='<html><body><h1>标题</h1></body></html>'
start_index = html.find('<h1>')
end_index = html.find('</h1>')
print(html[start_index:end_index])

#解析方式一：正则 regex,专门针对字符串处理的语法

import re

text1 = 'asdpythonfghjkl;'
pattern1 = re.compile(r'python')
matcher1 = re.search(pattern1,text1)
print(matcher1[0])

text2 = '<h1>hello world</h1>'
pattern2 = re.compile(r'<h1>.+</h1>')
matcher2 = re.search(pattern2,text2)
print(matcher2[0])

text3 = 'SeLectSELEct '
text4 = 'cat hat pat mat '
text5 = '969501808@qq . com '
#注册验证邮箱、用户名a-z0-9, 6-10位。

#手册 https://tool.oschina.net/uploads/apidocs/jquery/regexp.html
#常用正则https://www.cnblogs.com/qq364735538/p/11099572.html
#总结:字符申处理能力强大，但语法多规则难写。
text6 = """
<html>
aaahelloaa
bbb
world
aaa
</html>
"""
text7 = """
<html>
aaa<h1>aa
bbb
world
aaa
</html>
"""
pattern10 = re. compile(r'hello.*?world', re.S)
print(pattern10. findall(text6))
pattern11 = re. compile(r'<h1>(.*?)<h1>', re.S)
print(pattern11. findall(text7))

实验结果：

7.html解析-bs库

#网页HTML本身就是树状层状结构，按照层次去找
# beautiful_ soup库beautifulsoup 是python2时代的库，小坑: 适合"python3的是beautifulsoup4"
# pip install beautifulsoup4
from bs4 import BeautifulSoup   #小坑:代码包名字和包元信息名字不一致

html="""
<html>
    <body>
        <a id="aaa" href="http//www.baidu.com">百度一下</a>
        <a></a>
        <h1>hello</h1>
    </body>
</html>
"""
#先把字符串解析成HTML结构，内置库html.parser   三方库lxml
bs = BeautifulSoup(html,'html.parser')    #'lxml'
# print(bs.a)
# print(bs.find_all('a'))
# print(bs.a['href'])
# print(bs.a)
#获取父子标签

#总结：

8.html解析-xpath

# xpath表达式有自己的语法，但没正则那么复杂，类似bs4库按照HTML层次查找
#pip install lxml
from lxml import etree
html = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>lxml中xpath的用法</title>
</head>
<body>
    <ul>
        <li><a href="https://www.baidu.com" class="first_a">百度一下</a></li>
        <li><a href="https://mail.qq.com" id="second_a">QQ邮箱</a></li>
        <li><a href="https://www.taobao.com">淘宝网</a></li>
        <li>
            <a href="https://pypi.python.org" class="first_a">Python官网</a>
            <a href="https://pypi.python.org" class="second_a">Python</a>
        </li>
    </ul>
    <p class="one">first_p_tag</p>
    <p id="second">second_p_tag</p>
    <div class="one">
        first_div_tag
        <p class="first second third">11111111</p>
        <a href="#">22222222</a>
    </div>
</body>
</html>
"""

#把长字符转html 文档树
dom = etree.HTML(html)
# print(dom)
# / 表示往下一层  //忽略任意层父级目录
#/body/ul/li/a
#默认 全文搜索，匹配到则返回空列表，否则【element，element】
print(dom.xpath('//a'))
print(dom.xpath('//ul/li/a'))


# 取html 元素里的属性
# @href 取元素属性值
print(dom.xpath('//a/@href'))
# 取元素内容
# /text()
print(dom.xpath('//a/text()'))
#属性过滤
print(dom.xpath('//a[@id="second_a"]/text()')[0])
#其他语法不常用