Python网络爬虫与正则表达式基础-优快云博客

本文链接：https://blog.youkuaiyun.com/WBYLX/article/details/120894504

10月21日学习总结

一、爬虫的介绍

1.法不禁止即为许可，爬虫可以写，但是注意：
    ~隐匿身份
    ~不要被举证有破坏动产的行为
    ~不要将代码到处公开
    ~尽可能遵守爬虫协议 ---> robots.txt ---> 君子协议

2.爬虫的分类：
    ~通用爬虫（搜索引擎 ---> 什么数据都拿过来）
    ~定向爬虫（确定目标，只爬取某个领域的数据）

3.编写爬虫程序
    ~获取网页源代码 ---> HTML code
    ~解析页面，提取内容 ---> ???
    ~数据持久化 ---> CSV、Excel、数据库、大数据平台

    ---> 数据分析、数据可视化、数据挖掘、建模预测

4.URL ---> 统一资源定位符 ---> 网址
    Universal Resource Locator ---> 唯一标识一个（网络）资源

    http://www.baidu.com
    https://www.baidu.com:443/index.html
    https://14.215.177.38:443/index.html
    https://www.baidu.com:443/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png
    URI = URL + URN

5.HTTP / HTTPS ---> 超文本传输协议 ---> 请求响应式协议
    Hyper-Text Transfer Protocol

    ~HTTP请求

    请求行：请求动作 资源路径 协议版本
        ~ GET
        ~ POST
    请求头：键值对（元数据，跟用户请求和浏览器相关的信息）
        ~ User-Agent
    空行：\r\n
    消息体：浏览器发动给服务器的数据

    ~HTTP响应

    响应行：协议版本 响应状态码
        ~ 200
        ~ 404
    响应头：键值对（元数据，跟响应和服务器的相关信息）
    空行：\r\n
    消息体：服务器发给浏览器的数据（网页、图片、音视频、JSON）


    2xx ---> 成功
    3xx ---> 重定向
    4xx ---> 请求有问题
    5xx ---> 服务器故障

6. urllib / requests

7. 正则表达式 (regular expression) ---> 定义字符串匹配模式的工具
    re

二、应用

import requests
# 拿网页源代码
resp = requests.get('https://www.sohu.com/')
if resp.status_code == 200:
    # print(resp.content.decode('utf-8'))
    print(resp.encoding)   # UTF-8 拿编码
    print(resp.text)

# 拿图片
resp = requests.get('https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png')
if resp.status_code == 200:
    with open('baidu_logo.png', 'wb') as file:
        file.write(resp.content)

import requests

resp = requests.get('http://movie.douban.com/top250')
print(resp.status_code)   #418
# 拿网页源代码
resp = requests.get(
	url='https://movie.douban.com/top250',
	headers={
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'
    }
)
if resp.status_code == 200:
	print(resp.text)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vGPG3eK2-1634819863313)(C:\Users\wby\AppData\Roaming\Typora\typora-user-images\image-20211021195144431.png)]

三、正则表达式

1.从第一个字符开始匹配

match - 检查字符串是否与正则表达式匹配

fullmatch - 检查字符串是否与正则表达式完全匹配

2.元字符

^ - 字符串的开始

$ - 字符串的结束

\d - 数字

\D - 非数字

\w - 英文大小写字母、数字、下划线

\W - 不是英文字母、数字、下划线

\s - 空白字符

\S - 不是空白字符

3.字符集

[a-z] - 方括号中的字符任取其一

4.量词

{n} - 刚好出现n次

{n,m} - 最少n次，最多m次

{n，} - 最少n次

{，n} - 最多n次

import re

tel = input('请输入手机号：')
# matcher = re.fullmatch(r'1[3-9]\d{9}', tel)
matcher = re.match(r'^1[3-9]\d{9}$', tel)
# 如果字符串跟正则表达式匹配，返回一个Match对象，否则返回None
print(matcher)
# 请输入手机号: 17338822713
<re.Match object; span=(0, 11), match='17338822713'>

5.前瞻和回顾

前瞻（像后面看）

(?=\d) - 后面必须是数字

(?!\d) - 后面不能是数字

回顾（向前面看）

(?<=\d) - 前面必须是数字

(?<!\d) - 前面不能是数字

应用

import re


content = """余婷的手机号是13011223344，不是15800224567，也不是110。
骆昊的135的手机号已经停用了，请拨打13899887766这个号码。
骆昊的银行存款么有1350099887766元。"""
items = re.findall(r'(?<!\d)1[3-9]\d{9}(?!\d))
for item in items:
	print(item)

四、应用

import requests
import re
from urllib.parse import urljoin

resp = requests.get('https://www.sohu.com/')
if resp.status_code == 200:
	list1 = re.findall(r'<a.*? href="(.*?)".*?title="(.*?)".*?')
	for herf, title in list1:
		if not href.startswith('https://'):
			href = urljoin('https:www.sohu.com', href)
		print(title, href)

五、作业

import requests
import re


resp = requests.get(
    url='https://movie.douban.com/top250',
    headers={
        'Accept-Language': 'zh-CN,zh;q=0.9',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chr'
                      'ome/94.0.4606.81 Safari/537.36'
    }
)
if resp.status_code == 200:
    movie_list = re.findall(r'<span class="title">(?!&nbsp;/&nbsp;)(.*?)</span>', resp.text)
    grade_list = re.findall(r'<span class="rating_num" property="v:average">(.*?)</span>', resp.text)
    saying_list = re.findall(r'<span class="inq">.*?</span>', resp.text)
    print(movie_list)
    print(grade_list)
    print(saying_list)

span class=“rating_num” property=“v:average”>(.?)’, resp.text)
saying_list = re.findall(r’.?’, resp.text)
print(movie_list)
print(grade_list)
print(saying_list)