【Python】学习笔记 -- CH6. 字符串及正则表达式

Python字符串及正则表达式学习

最新推荐文章于 2025-11-27 18:56:38 发布

原创最新推荐文章于 2025-11-27 18:56:38 发布 · 2.6k 阅读

47 ·

CC 4.0 BY-SA版权

文章标签：

#python #正则表达式

Python学习专栏收录该内容

14 篇文章

订阅专栏

CH6. 字符串及正则表达式

字符串常用操作

字符串是Python中的不可变数据类型

方法名	描述说明
`str.lower()`	将`str`字符串全部转成小写字母，结果为一个新的字符串
`str.upper()`	将`str`字符串全部转成大写字母，结果为一个新的字符串
`str.split(sep=None)`	将`str`按照指定的分隔符`sep`进行分隔，结果为列表类型
`str.count(sub)`	结果`为sub`这个字符串在str中出现的次数
`str.find(sub)`	查询`sub`这个字符串在str中是否存在，如果不存在结果返回-1，如果存在，结果为`sub`首次出现的索引
`str.index(sub)`	功能与`find()`相同，区别在于要查询的子串`sub`不存在时，程序报错
`str.startswith(s)`	查询字符串`str`是否以子串`s`开头
`str.endswith(s)`	查询字符串`str`是否以子串`s`结尾
`str.replace(old, new, count)`	使用`new`替换字符串`str`中所有`old`字符串，结果是一个新的字符串，`count`为替换次数
`str.center(width, fillchar)`	字符串`str`在指定的宽度范围内居中，可以使用`fillchar`进行填充，默认填充空格
`str.join(iter)`	在`iter`中的每个元素后面都增加一个新的字符串`str`
`str.strip(chars)`	从字符串中去掉左侧和右侧`chars`中列出的字符串，去掉指定字符与顺序无关
`str.lstrip(chars)`	从字符串中去掉左侧`chars`中列出的字符串
`str.rstrip(chars)`	从字符串中去掉右侧`chars`中列出的字符串

格式化字符串的三种方式

占位符
- %s ：字符串格式
- %d ：十进制整数格式
- %f ：浮点数格式
- …
f-string
- Python3.6 引入的格式化字符串的方式，使用{}标明被替换的字符
str.format()方法
- 模板字符串.format(逗号分隔的参数)

# 使用占位符格式化字符串
name = '张三'
age = 18
score = 99.5
print('姓名：%s，年龄：%d，成绩：%.1f' % (name, age, score))

# f-string模式
print(f'姓名：{name}，年龄：{age}，成绩：{score}')

# 使用format方法
print('姓名：{0}，年龄：{1}，成绩：{2}'.format(name, age, score))

'''
姓名：张三，年龄：18，成绩：99.5
姓名：张三，年龄：18，成绩：99.5
姓名：张三，年龄：18，成绩：99.5
'''

格式化字符串的详细格式

:	填充	对齐方式	宽度	,	.精度	类型
引导符号	用于填充单个字符	`<` 左对齐 `>` 右对齐 `^` 居中对齐	字符串的输出宽度	数字的前千位分隔符	浮点数小数部分的精度或字符串的最大输出长度	整数类型:b\d\o\x\X 浮点数类型：e\E\f%

s='helloworld'
print('{0:*<20}'.format(s)) # 字符串s的显示宽度为20，左对齐，空白部分用*填充
print('{0:*>20}'.format(s)) # 右对齐
print('{0:*^20}'.format(s)) # 居中对齐
print(s.center(20, '*'))    # 居中对齐的方法

# 千位分隔符（只适用于整数和浮点数）
print('{0:,}'.format(987654321))
print('{0:,}'.format(987654321.1234))

# 浮点数小数部分的精度
print('{0:.2f}'.format(3.141592653))
print('{0:.3f}'.format(3.141592653))

# 字符串类型，表示最大显示长度
print('{0:.5}'.format(s))

# 整数类型，不同进制
a = 425
print('二进制:{0:b},十进制:{0:d},八进制:{0:o},十六进制:{0:x},十六进制:{0:X}'.format(a))

# 浮点数类型
b=3.1415926
print('{0:.2f},{0:.2E},{0:.2e},{0:.2%}'.format(b))

'''
helloworld**********
**********helloworld
*****helloworld*****
*****helloworld*****
987,654,321
987,654,321.1234
3.14
3.142
hello
二进制:110101001,十进制:425,八进制:651,十六进制:1a9,十六进制:1A9
3.14,3.14E+00,3.14e+00,314.16%
'''

字符串编码和解码

字符串的编码
- 将str类型转换为bytes类型，需要使用到字符串的encode()方法
- 语法格式
  - str.encode(encoding='utf-8', errors='strict/ignore/replace')
字符串的解码
- 将bytes类型转换为str类型，需要使用到bytes类型的decode()方法
- 语法格式
  - bytes.decode(self: bytes, encoding='utf-8', errors='strict/ignore/replace')

s = '我爱你'

scode = s.encode(errors='replace') # 默认是utf-8, utf-8中文占3个字节
print(scode)

scode_gbk = s.encode(encoding='gbk', errors='replace') # gbk编码中文占2个字节
print(scode_gbk)

s2 = '耶☝'
scode2 = s2.encode('gbk', errors='ignore')
print(scode2)

# scode = s2.encode('gbk', errors='strict') # 会报错

scode2 = s2.encode('gbk', errors='replace')
print(scode2)

print(bytes.decode(scode_gbk, 'gbk'))
print(bytes.decode(scode, 'utf-8'))
print(scode2.decode('gbk'))

'''
b'\xe6\x88\x91\xe7\x88\xb1\xe4\xbd\xa0'
b'\xce\xd2\xb0\xae\xc4\xe3'
b'\xd2\xae'
b'\xd2\xae?'
我爱你
我爱你
耶?
'''

数据的验证

数据的验证是指程序对用户输入的数据进行“合法”性验证

方法名	描述
str.isdigit()	所有字符都是数字（十进制的阿拉伯数字）
str.isnumeric()	所有字符都是数字
str.isalpha()	所有字符都是字母（包括中文字符）
str.isalnum()	所有字符都是数字或字母（包括中文字符）
str.islower()	所有字符都是小写
str.isupper()	所有字符都是大写
str.istitle()	首字母大写
str.isspace()	所有字符都是空白字符（\n、\t等）

print('123'.isdigit()) # True
print('一二三'.isdigit()) # False
print('0b1010'.isdigit()) # False

print('123'.isnumeric()) # True
print('一二三'.isnumeric()) # True
print('0b1010'.isnumeric()) # False
print('壹贰叁'.isnumeric()) # True
print('ⅠⅡⅢ'.isnumeric()) # True

print('hello你好123'.isalpha()) # False
print('hello你好一二三'.isalpha()) # True
print('hello你好ⅠⅡⅢ'.isalpha()) # False

print('hello你好123'.isalnum()) # True
print('hello你好一二三'.isalnum()) # True
print('hello你好ⅠⅡⅢ'.isalnum()) # True

print('HelloWorld'.islower()) # False
print('helloworld'.islower()) # True
print('Hello你好'.islower()) # True

print('HelloWorld'.isupper()) # False
print('HELLOWORLD'.isupper()) # True
print('HELLO你好'.isupper()) # True ，中文既是大写也是小写

print('Hello'.istitle()) # True
print('HelloWorld'.istitle()) # False
print('hello'.istitle()) # False
print('Hello World'.istitle()) # True
print('Hello world'.istitle()) # False

print('\t'.isspace()) # True
print(' '.isspace()) # True
print('\n'.isspace()) # True

字符串的处理

字符串的拼接

使用+号拼接

s1 = 'hello'
s2 = 'world'
print(s1+s2) # helloworld

使用str.join()方法进行拼接字符串

print(''.join(['hello', 'world'])) # helloworld
print('*'.join(['hello', 'world'])) # hello*world
print('你好'.join(['hello', 'world'])) # hello你好world

直接拼接

print('hello''world') # helloworld

使用格式化字符串进行拼接

参见格式化字符串章节（按住ctrl+点击鼠标左键跳转）

字符串的去重

遍历去重

s = 'helloadjakhdiquwnandslajdsio'
new_s = ''
for item in s:
    if item not in new_s:
        new_s += item
print(new_s) # heloadjkiquwns

遍历使用索引

new_s2 = ''
for i in range(len(s)):
    if s[i] not in new_s2:
        new_s2 += s[i]
print(new_s2)

通过集去重+列表排序

new_s3 = set(s)
lst = list(new_s3)  # 集合是无序的
lst.sort(key=s.index) # 按字符串的索引来排序
print(''.join(lst))

正则表达式

正则表达式相关符号

元字符
- 具有特殊意义的专用字符
- 例如^和$分别表示匹配的开始和结束

元字符	描述	举例	结果
`.`	匹配任意字符（除了`\n`）	`'p\nytho\tn'`	p, y, t, h, o, \t, n
`\w`	匹配字母、数字、下划线	`'python\n123'`	p, y, t, h, o, n, 1, 2, 3
`\W`	匹配非字母、数字、下划线	`'python\n123'`	\n
`\s`	匹配任意空白字符	`'python\t123'`	\t
`\S`	匹配任意非空白字符	`'python\t123'`	p, y, t, h, o, n, 1, 2, 3
`\d`	匹配任意十进制数	`'python\t123'`	1, 2, 3

限定符
- 用于限定匹配的次数

限定符	描述	举例	结果
`?`	匹配前面的字符0次或1次	`'colour?r'`	color或colour
`+`	匹配前面的字符1次或多次	`'colour+r'`	colour或colouu…r
`*`	匹配前面的字符0次或多次	`'colour*r'`	color或colouu…r
`{n}`	匹配前面的字符n次	`'colour{2}r'`	colouur
`{n,}`	匹配前面的字符最少n次	`'colour{2,}r'`	colouur或colouuu…r
`{n,m}`	匹配前面的字符最少n次，最多m次	`'colour{2,4}r'`	colouur或colouuuur或colouuuur

其他字符

其他字符	描述	举例	结果
区间字符`[]`	匹配[]中所指定的字符	`[.?!]` `[0-9]`	匹配标点符号点、问号、感叹号，匹配数字0-9
排除字符`^`	匹配不在[]中所指定的字符	`[^0-9]`	匹配除了数字0-9之外的字符
选择字符`	`	用于匹配在\|左右的任意字符	`\d{18}
转义字符`\`	同Python中的转义字符	`\.'`	将.作为普通字符使用
`[\u4e00-\u9fa5]`	匹配任意一个汉字
分组`()`	改变限定符的作用	`six	fourth (six

处理模块：re模块

Python中的内置模块
用于实现Python中的正则表达式操作

函数	功能描述
`re.match(pattern, string, flags=0)`	用于从字符串的开始位置进行匹配，如果起始位置匹配成功，结果为Match对象，否则结果为None
`re.search(pattern, string, flags=0)`	用于在整个字符串中搜索第一个匹配的值，如果匹配成功，结果为Match对象，否则结果为None
`re.findall(pattern, string, flags=0)`	用于在整个字符串搜索所有符合正则表达式的值，结果为一个列表类型
`re.sub(pattern, repl, string, count, flags=0)`	用于实现对字符串中指定子串的替换
`re.split(pattern, string, maxsplit, flags=0)`	字符串中的split()方法功能相同，都是分隔字符串

match函数

import re

pattern = '\\d\\.\\d+' # 模板字符串
s='I study Python 3.12 every day' # 待匹配字符串
match = re.match(pattern, s, re.I)  # re.I 忽略大小写
print(match) # 从头开始查找匹配, None

s2 = '3.12Python I study every day'
match2 = re.match(pattern, s2)  # re.I 忽略大小写
print(match2)  # <re.Match object; span=(0, 4), match='3.12'>

print('开始位置：', match2.start())
print('结束位置：', match2.end())
print('匹配区间的位置元素：', match2.span())
print('待匹配的字符串：', match2.string)
print('匹配的数据：', match2.group())

'''
None
<re.Match object; span=(0, 4), match='3.12'>
开始位置： 0
结束位置： 4
匹配区间的位置元素： (0, 4)
待匹配的字符串： 3.12Python I study every day
匹配的数据： 3.12
'''

search函数

import re

pattern = '\\d\\.\\d+' # 模板字符串

# search
s1 = 'I study Python 3.12 every day Python2.7 I love you'
match1 = re.search(pattern, s1)
print(match1)

s2 = '8.25 I study every day'
match2 = re.search(pattern, s2)
print(match2)

s3 = 'I study Python every day'
match3 = re.search(pattern, s3)
print(match3)

'''
<re.Match object; span=(15, 19), match='3.12'>
<re.Match object; span=(0, 4), match='8.25'>
None
'''

findall函数

match4 = re.findall(pattern, s1) # 返回结果是一个列表，
print(match4)

# ['3.12', '2.7']

sub函数

import re

pattern = '黑客|破解|反爬'
s = '我想学习Python黑客技术，想破解一些VIP视频，Python可以实现无底线反爬吗？'
new_s = re.sub(pattern, '*****', s)
print(new_s)

'''
我想学习Python*****技术，想*****一些VIP视频，Python可以实现无底线*****吗？
'''

split函数

s2 = 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=xfl&oq=xiang&rsv_pq=c178a51b02c90d4e'
pattern2 = '[?|&]'
lst = re.split(pattern2, s2)
print(lst)

'''
['https://www.baidu.com/s', 'ie=utf-8', 'f=8', 'rsv_bp=1', 'tn=baidu', 'wd=xfl', 'oq=xiang', 'rsv_pq=c178a51b02c90d4e']
'''

字符串处理实战

实战1

判断车牌归属地
- 使用列表存储N个车牌号码，通过遍历列表及字符串的切片操作判断车牌的归属地

# 判断车牌归属地

lst = [
    '京A88888',
    '津B66666',
    '吉A66777'
]

for item in lst:
    area = item[0:1]
    print(item, "归属地为：", area)
    
'''
京A88888 归属地为： 京
津B66666 归属地为： 津
吉A66777 归属地为： 吉
'''

实战2

统计字符串中出现指定字符的次数
- 声明一个字符串，内容为"HelloPython,HelloJava,hellophp"，用户从键盘录入要查询的字符（不区分大小写），要求统计出要查找的字符在字符串中出现的次数

# 统计字符串中出现指定字符的次数

str = 'HelloPython,HelloJava,hellophp'
ch = input("请输入要统计的字符：")
cnt = str.upper().count(ch.upper())
print(f'{ch}在{str}中一共出现了{cnt}次')

'''
请输入要统计的字符：H
H在HelloPython,HelloJava,hellophp中一共出现了5次
'''

实战3

格式化输出商品的名称和单价
- 使用列表存储一些商品数据，使用循环遍历输出商品信息，要求对商品的编号进行格式化为6位，单价保留2位小数，并在前面添加人民币符号输出

# 格式化输出商品的名称和单价

lst = [
    ['01', '电风扇', '美的', 500],
    ['02', '洗衣机', 'TCL', 1000],
    ['03', '微波炉', '老板', 400]
]

print('编号\t名称\t品牌\t单价')
for info in lst:
    for i in range(0, len(info)):
        if i <= 2:
            print(info[i], end='\t')
        else:
            print(info[i])

print('\n编号\t名称\t品牌\t单价')
for info in lst:
    for i in range(0, len(info)):
        if i == 0:
            print('{0:0>6}'.format(info[i]), end='\t')
        elif i == 3:
            print('￥{0:.2f}'.format(info[i]))
        else:
            print(info[i], end='\t')

'''
编号    名称    品牌    单价
01      电风扇  美的    500
02      洗衣机  TCL     1000
03      微波炉  老板    400

编号    名称    品牌    单价
000001  电风扇  美的    ￥500.00
000002  洗衣机  TCL     ￥1000.00
000003  微波炉  老板    ￥400.00
'''

实战4

提取文本中所有图片的链接地址
- 从给定的文本中使用正则表达式提取出所有图片链接地址

# 提取文本中图片链接地址

import re

pattern = r'//\w*\.\w*\.\w*/'
str = 'https://img1.baidu.com/it/u=2721,1962&fm=26&fmt=auto'

match = re.search(pattern, str)
print(match)
new_str = match.group()
print(new_str[2:-1])

'''
<re.Match object; span=(6, 23), match='//img1.baidu.com/'>
img1.baidu.com
'''