Python字符串操作与正则表达式：从基础到实战-优快云博客

Python字符串操作与正则表达式：从基础到实战

WhirlwindTourOfPython The Jupyter Notebooks behind my OReilly report, "A Whirlwind Tour of Python" 项目地址: https://gitcode.com/gh_mirrors/wh/WhirlwindTourOfPython

前言

字符串处理是编程中最基础也最重要的技能之一。在Python中，字符串操作因其简洁高效而广受开发者喜爱。本文将深入探讨Python中的字符串操作技巧，并介绍强大的正则表达式工具，帮助读者掌握数据处理中的核心技能。

Python字符串基础

Python中的字符串可以使用单引号或双引号定义，两者完全等效：

str1 = 'Python字符串'
str2 = "Python字符串"
print(str1 == str2)  # 输出: True

对于多行字符串，可以使用三引号语法：

multiline_str = """
第一行
第二行
第三行
"""

字符串大小写转换

Python提供了多种方法来调整字符串的大小写：

sample = "tHe qUICk bROWn fOx."

# 转换为全大写
print(sample.upper())  # 输出: 'THE QUICK BROWN FOX.'

# 转换为全小写
print(sample.lower())  # 输出: 'the quick brown fox.'

# 每个单词首字母大写
print(sample.title())  # 输出: 'The Quick Brown Fox.'

# 仅第一个单词首字母大写
print(sample.capitalize())  # 输出: 'The quick brown fox.'

# 大小写互换
print(sample.swapcase())  # 输出: 'ThE QuicK BrowN FoX.'

字符串空白处理

处理字符串中的空白字符是数据清洗的常见需求：

line = '   前后有空白   '

# 去除两端空白
print(line.strip())  # 输出: '前后有空白'

# 去除右侧空白
print(line.rstrip())  # 输出: '   前后有空白'

# 去除左侧空白
print(line.lstrip())  # 输出: '前后有空白   '

也可以去除特定字符：

num = "000000435"
print(num.strip('0'))  # 输出: '435'

字符串填充与对齐

Python提供了多种方法来对齐和填充字符串：

text = "居中显示"

# 居中显示，总宽度30字符
print(text.center(30))  
# 输出: '            居中显示            '

# 左对齐，右侧填充
print(text.ljust(30, '-'))  
# 输出: '居中显示------------------------'

# 右对齐，左侧填充0
print("435".rjust(10, '0'))  
# 输出: '0000000435'

# 专用零填充方法
print("435".zfill(10))  
# 输出: '0000000435'

子字符串查找与替换

查找和替换是字符串处理的常见操作：

sentence = 'the quick brown fox jumps over the lazy dog'

# 查找子字符串位置
print(sentence.find('fox'))  # 输出: 16
print(sentence.index('fox'))  # 输出: 16

# 从右侧查找
print(sentence.rfind('the'))  # 输出: 31

# 检查开头/结尾
print(sentence.startswith('the'))  # 输出: True
print(sentence.endswith('cat'))  # 输出: False

# 替换子字符串
print(sentence.replace('fox', 'cat'))  
# 输出: 'the quick brown cat jumps over the lazy dog'

字符串分割

分割字符串是解析数据的常见需求：

data = "apple,orange,banana,grape"

# 简单分割
print(data.split(','))  
# 输出: ['apple', 'orange', 'banana', 'grape']

# 最多分割2次
print(data.split(',', 2))  
# 输出: ['apple', 'orange', 'banana,grape']

# 从右侧分割
print(data.rsplit(',', 1))  
# 输出: ['apple,orange,banana', 'grape']

# 分区操作
print("hello.world.py".partition('.'))  
# 输出: ('hello', '.', 'world.py')

正则表达式基础

正则表达式提供了更强大的模式匹配能力。Python通过re模块支持正则表达式：

import re

# 简单匹配
pattern = r'\d+'  # 匹配一个或多个数字
text = "有42个苹果和3个橙子"
print(re.findall(pattern, text))  # 输出: ['42', '3']

# 替换操作
print(re.sub(r'\d+', 'N', text))  
# 输出: '有N个苹果和N个橙子'

# 分割操作
print(re.split(r'\W+', 'Words, words, words.'))  
# 输出: ['Words', 'words', 'words', '']

常用正则表达式模式

掌握一些常用模式能显著提高效率：

\d 匹配数字
\w 匹配单词字符(字母、数字、下划线)
\s 匹配空白字符
. 匹配任意字符(除换行符)
* 匹配0次或多次
+ 匹配1次或多次
? 匹配0次或1次
{n} 匹配恰好n次
{n,} 匹配至少n次
{n,m} 匹配n到m次

实战应用示例

提取电子邮件地址：

text = "联系我：user@example.com 或 admin@site.org"
pattern = r'[\w.-]+@[\w.-]+'
print(re.findall(pattern, text))
# 输出: ['user@example.com', 'admin@site.org']

验证手机号码格式：

def is_valid_phone(phone):
    return bool(re.match(r'^1[3-9]\d{9}$', phone))

print(is_valid_phone('13800138000'))  # 输出: True
print(is_valid_phone('12345678901'))  # 输出: False

清理HTML标签：

html = "<p>这是一个<b>示例</b>文本</p>"
clean_text = re.sub(r'<[^>]+>', '', html)
print(clean_text)  # 输出: '这是一个示例文本'

性能考虑

对于大量文本处理，考虑以下优化：

预编译正则表达式：

pattern = re.compile(r'\d+')
result = pattern.findall(text)

使用非贪婪匹配(*?, +?)避免过度匹配
对于简单操作，优先使用字符串方法而非正则表达式

总结

Python提供了丰富的字符串操作方法，从基础的大小写转换、空白处理到复杂的模式匹配。正则表达式作为强大的补充，能够处理更复杂的文本模式识别需求。掌握这些技能，将使你在数据清洗、文本分析和日常编程中游刃有余。

记住，实践是掌握这些技能的关键。建议读者尝试将这些技术应用到实际项目中，逐步积累经验，提升字符串处理能力。

WhirlwindTourOfPython The Jupyter Notebooks behind my OReilly report, "A Whirlwind Tour of Python" 项目地址: https://gitcode.com/gh_mirrors/wh/WhirlwindTourOfPython

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考