53、正则表达式与Keras入门

最新推荐文章于 2025-12-04 17:53:34 发布

rust6ferris

最新推荐文章于 2025-12-04 17:53:34 发布

阅读量36

点赞数

CC 4.0 BY-SA版权

分类专栏： NLP与机器学习入门指南文章标签：正则表达式 Python re模块

本文链接：https://blog.youkuaiyun.com/rust6ferris/article/details/152431516

NLP与机器学习入门指南专栏收录该内容

62 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

正则表达式与Keras入门

正则表达式基础

简单匹配

正则表达式是处理文本的强大工具。例如， \d+ 表示匹配一个或多个数字。以下代码展示了如何使用正则表达式匹配日期格式：

import re
date1 = "12/31/2023"
date2 = "abc"
if re.match(r'\d+/\d+/\d+', date1):
    print('date1 matches this pattern')
else:
    print('date1 does not match this pattern')
if re.match(r'\d+/\d+/\d+', date2):
    print('date2 matches this pattern')
else:
    print('date2 does not match this pattern')

输出结果：

date1 matches this pattern
date2 does not match this pattern

反转字符串中的单词

下面的代码展示了如何反转字符串中的一对单词：

import re
str1 = 'one two'
match = re.search('([\w.-]+) ([\w.-]+)', str1)
str2 = match.group(2) + ' ' + match.group(1)
print('str1:', str1)
print('str2:', str2)

输出结果：

str1: one two
str2: two one

更复杂的正则表达式

以下表达式可以匹配由数字、大写字母或小写字母组成的字符串（即不包含特殊字符）：

^[a-zA-Z0-9]$

使用字符类重写后的表达式：

^[\w\W\d]$

使用re模块修改文本字符串

使用re.split()方法分割文本字符串

re.split() 方法使用正则表达式将字符串分割成列表。以下是示例代码：

import re
line1 = "abc def"
result1 = re.split(r'[\s]', line1)
print('result1:', result1)
line2 = "abc1,abc2:abc3;abc4"
result2 = re.split(r'[,:;]', line2)
print('result2:', result2)
line3 = "abc1,abc2:abc3;abc4 123 456"
result3 = re.split(r'[,:;\s]', line3)
print('result3:', result3)

输出结果：

result1: ['abc', 'def']
result2: ['abc1', 'abc2', 'abc3', 'abc4']
result3: ['abc1', 'abc2', 'abc3', 'abc4', '123', '456']

使用数字和分隔符分割文本字符串

以下代码展示了如何使用包含数字、点和空格的正则表达式分割文本字符串：

import re
line1 = '1. Section one 2. Section two 3. Section three'
line2 = '11. Section eleven 12. Section twelve 13. Section thirteen'
print(re.split(r'\d+\. ', line1))
print(re.split(r'\d+\. ', line2))

输出结果：

['', 'Section one ', 'Section two ', 'Section three']
['', 'Section eleven ', 'Section twelve ', 'Section thirteen']

使用re.sub()方法替换文本字符串

re.sub() 方法可以找到正则表达式匹配的所有子字符串，然后用不同的字符串替换它们。以下是示例代码：

import re
p = re.compile('(one|two|three)')
print(p.sub('some', 'one book two books three books'))
print(p.sub('some', 'one book two books three books', count=1))
line = 'abcde'
line2 = re.sub('', '\n', line)
print('line2:', line2)

输出结果：

some book some books some books
some book two books three books
line2: 
a
b
c
d
e

匹配文本字符串的开头和结尾

以下代码展示了如何使用 startswith() 和 endswith() 函数查找子字符串：

import re
line2 = "abc1,Abc2:def3;Def4"
result2 = re.split(r'[,:;]', line2)
for w in result2:
    if w.startswith('Abc'):
        print('Word starts with Abc:', w)
    elif w.endswith('4'):
        print('Word ends with 4:', w)
    else:
        print('Word:', w)

输出结果：

Word: abc1
Word starts with Abc: Abc2
Word: def3
Word ends with 4: Def4

查找特定格式的文本字符串

以下代码展示了如何使用正则表达式查找特定格式的文本字符串：

import re
line1 = "abcdef"
line2 = "123,abc1,abc2,abc3"
line3 = "abc1,abc2,123,456f"
if re.match("^[A-Za-z]*$", line1):
    print('line1 contains only letters:', line1)
if line1[:-1].isalpha():
    print('line1 contains only letters:', line1)
if re.match("^[\w]*$", line1):
    print('line1 contains only letters:', line1)
if re.match(r"^[^\W\d_]+$", line1, re.LOCALE):
    print('line1 contains only letters:', line1)
if re.match("^[0-9][0-9][0-9]", line2):
    print('line2 starts with 3 digits:', line2)
if re.match("^\d\d\d", line2):
    print('line2 starts with 3 digits:', line2)
if re.match(".*[0-9][0-9][0-9][a-z]$", line3):
    print('line3 ends with 3 digits and 1 char:', line3)
if re.match(".*[a-z]$", line3):
    print('line3 ends with 1 char:', line3)

输出结果：

line1 contains only letters: abcdef
line1 contains only letters: abcdef
line1 contains only letters: abcdef
line1 contains only letters: abcdef
line2 starts with 3 digits: 123,abc1,abc2,abc3
line2 starts with 3 digits: 123,abc1,abc2,abc3

编译标志

编译标志可以修改正则表达式的工作方式。在 re 模块中，标志有长名称（如 IGNORECASE ）和单字母短形式（如 I ）。可以使用 | 符号指定多个标志，例如 re.I | re.M 可以同时设置 I 和 M 标志。你可以查看Python在线文档了解所有可用的编译标志。

复合正则表达式

以下代码展示了如何使用管道符号 | 指定两个正则表达式：

import re
line1 = "This is a line"
line2 = "That is a line"
if re.match("^[Tt]his", line1):
    print('line1 starts with This or this:')
    print(line1)
else:
    print('no match')
if re.match("^This|That", line2):
    print('line2 starts with This or That:')
    print(line2)
else:
    print('no match')

输出结果：

line1 starts with This or this:
This is a line
line2 starts with This or That:
That is a line

统计字符串中字符类型的数量

以下代码展示了如何使用正则表达式统计字符串中数字、字母和其他字符的数量：

import re
charCount = 0
digitCount = 0
otherCount = 0
line1 = "A line with numbers: 12 345"
for ch in line1:
    if re.match(r'\d', ch):
        digitCount = digitCount + 1
    elif re.match(r'\w', ch):
        charCount = charCount + 1
    else:
        otherCount = otherCount + 1
print('charcount:', charCount)
print('digitcount:', digitCount)
print('othercount:', otherCount)

输出结果：

charcount: 16
digitcount: 5
othercount: 6

正则表达式分组

你可以对正则表达式的子表达式进行分组，并通过符号引用它们。以下是一些示例：
- 匹配零个或一个由三个连续字母或数字组成的字符串：

^([a-zA-Z0-9]{3,3})?

匹配美国的电话号码：

^\d{3,3}[-]\d{3,3}[-]\d{4,4}

匹配美国的邮政编码：

^\d{5,5}([-]\d{5,5})?

部分匹配电子邮件地址：

import re
str = 'john.doe@google.com'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())

输出结果：

doe@google

简单字符串匹配

以下代码展示了如何定义正则表达式来匹配各种文本字符串：

import re
searchString = "Testing pattern matches"
expr1 = re.compile(r"Test")
expr2 = re.compile(r"^Test")
expr3 = re.compile(r"Test$")
expr4 = re.compile(r"\b\w*es\b")
expr5 = re.compile(r"t[aeiou]", re.I)
if expr1.search(searchString):
    print('"Test" was found.')
if expr2.match(searchString):
    print('"Test" was found at the beginning of the line.')
if expr3.match(searchString):
    print('"Test" was found at the end of the line.')
result = expr4.findall(searchString)
if result:
    print('There are %d words(s) ending in "es":' % (len(result)))
    for item in result:
        print(" " + item)
result = expr5.findall(searchString)
if result:
    print('The letter t, followed by a vowel, occurs %d times:' % (len(result)))
    for item in result:
        print(" " + item)

输出结果：

"Test" was found.
"Test" was found at the beginning of the line.
There are 1 words(s) ending in "es":  matches
The letter t, followed by a vowel, occurs 3 times:  Te  ti  te

正则表达式的其他主题

除了前面介绍的Python搜索/替换功能，你还可以执行贪婪搜索和替换。你可以通过互联网搜索了解这些功能以及如何在Python代码中使用它们。

正则表达式练习

给定一个文本字符串，找出以元音开头或结尾的单词列表（将大小写元音视为不同字母），并按字母顺序和频率降序显示该列表。
给定一个文本字符串，找出包含小写元音或数字或两者都包含，但不包含大写字母的单词列表，并按字母顺序和频率降序显示该列表。
英语中有一条拼写规则，即 “i 在 e 前，除非在 c 后”，编写一个Python脚本来检查文本字符串中拼写错误的单词。
在英语中，主语代词不能跟在介词后面。编写一个Python脚本来检查文本字符串中的语法错误，搜索介词 “between”、“for” 和 “with”，以及主语代词 “I”、“you”、“he” 和 “she”，并修改并显示使用正确语法的文本。
找出文本字符串中长度最多为四个的单词，然后打印这些字符的所有子字符串。

Keras入门

什么是Keras

Keras是一个高级神经网络API，它被很好地集成到了TensorFlow 2中，位于 tf.keras 命名空间。Keras非常适合定义模型来解决各种任务，如线性回归、逻辑回归以及涉及卷积神经网络（CNN）、循环神经网络（RNN）和长短期记忆网络（LSTM）的深度学习任务。

在TF 2中使用Keras命名空间

TF 2提供了 tf.keras 命名空间，其中包含以下命名空间：
- tf.keras.layers ：包含Keras模型中的各种层。
- tf.keras.models ：包含不同类型的Keras模型。
- tf.keras.optimizers ：包含优化器（如Adam等）。
- tf.keras.utils ：包含实用类。
- tf.keras.regularizers ：包含正则化器（如L1和L2）。

目前有三种创建Keras模型的方法：
- Sequential 类：这是最直观和简单的方法，允许你指定一个层列表，大多数层可在 tf.keras.layers 命名空间中找到。
- 函数式API：涉及将层作为类似函数的元素以 “管道式” 方式传递，提供了一些额外的灵活性。
- 模型类：提供最大的灵活性，需要定义一个Python类来封装Keras模型的语义，该类是 tf.model.Model 类的子类，必须实现 __init__ 和 call 两个方法。

使用tf.keras.layers命名空间

最常见和最简单的Keras模型是 Sequential 类，它由 tf.keras.layers 命名空间中的各种层组成，常见的层如下：
- tf.keras.layers.Conv2D() ：用于卷积神经网络（CNN）的二维卷积层。
- tf.keras.layers.MaxPooling2D() ：用于CNN的二维最大池化层。
- tf.keras.layers.Flatten() ：将输入展平为一维向量。
- tf.keras.layers.Dense() ：全连接层。
- tf.keras.layers.Dropout() ：用于防止过拟合的丢弃层。
- tf.keras.layers.BatchNormalization() ：批量归一化层。
- tf.keras.layers.Embedding() ：嵌入层。
- tf.keras.layers.RNN() ：简单循环神经网络层。
- tf.keras.layers.LSTM() ：长短期记忆网络层。
- tf.keras.layers.Bidirectional （例如BERT）：双向LSTM层，常用于解决自然语言处理（NLP）任务。

Keras示例

后续内容将包含使用Keras进行线性回归、训练MNIST数据集的多层感知器（MLP）神经网络、训练CIFAR-10数据集的神经网络，以及执行 “提前停止” 的Keras模型示例。

总结

通过本文，你学习了如何创建各种类型的正则表达式，包括基本的数字、字母正则表达式和更复杂的字符类表达式。还学习了如何使用Python的 re 库编译正则表达式并检查它们是否匹配文本字符串的子字符串。此外，还介绍了Keras的基本概念、命名空间和常见层，以及创建Keras模型的不同方法。希望这些内容能帮助你在文本处理和深度学习领域取得更好的成果。

Keras代码示例

线性回归示例

下面是一个使用Keras进行线性回归的简单示例，假设我们有一个简单的CSV文件，其中包含输入特征和对应的目标值。

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd

# 加载数据
data = pd.read_csv('simple_data.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# 构建模型
model = keras.Sequential([
    keras.layers.Dense(1, input_shape=(X.shape[1],))
])

# 编译模型
model.compile(optimizer='adam', loss='mse')

# 训练模型
model.fit(X, y, epochs=100, batch_size=32)

# 进行预测
new_X = np.array([[1.5]])
prediction = model.predict(new_X)
print('预测值:', prediction)

基于MNIST数据集的MLP神经网络

MNIST是一个广泛使用的手写数字数据集，下面是一个使用Keras训练MLP神经网络的示例。

import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

# 加载MNIST数据集
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# 数据预处理
train_images = train_images / 255.0
test_images = test_images / 255.0

# 构建模型
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 训练模型
model.fit(train_images, train_labels, epochs=5)

# 评估模型
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('测试准确率:', test_acc)

基于CIFAR - 10数据集的神经网络训练

CIFAR - 10是一个包含10个不同类别图像的数据集，下面是一个使用Keras训练神经网络的示例。

import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

# 加载CIFAR - 10数据集
cifar10 = keras.datasets.cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()

# 数据预处理
train_images = train_images / 255.0
test_images = test_images / 255.0

# 构建模型
model = keras.Sequential([
    keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    keras.layers.MaxPooling2D((2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 训练模型
model.fit(train_images, train_labels, epochs=10, batch_size=32)

# 评估模型
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('测试准确率:', test_acc)

执行“提前停止”的Keras模型

“提前停止” 是一种在模型训练过程中，当模型的性能提升不再明显时停止训练的技术。下面是一个简单的示例。

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping

# 加载数据（这里以MNIST为例）
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# 数据预处理
train_images = train_images / 255.0
test_images = test_images / 255.0

# 构建模型
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 定义提前停止回调
early_stopping = EarlyStopping(monitor='val_loss', patience=3)

# 训练模型并使用提前停止
history = model.fit(train_images, train_labels, epochs=20,
                    validation_data=(test_images, test_labels),
                    callbacks=[early_stopping])

总结与展望

正则表达式总结

正则表达式是处理文本的强大工具，通过本文我们学习了简单匹配、反转字符串中的单词、使用 re 模块修改文本字符串（包括分割、替换等操作）、匹配文本字符串的开头和结尾、复合正则表达式、分组以及统计字符类型数量等内容。同时，还给出了一些正则表达式的练习，帮助大家巩固所学知识。

Keras总结

Keras作为一个高级神经网络API，集成在TensorFlow 2中，为我们提供了便捷的方式来构建和训练各种神经网络模型。我们了解了Keras的命名空间、常见层以及创建模型的不同方法，并通过线性回归、MNIST数据集、CIFAR - 10数据集的示例代码，展示了如何使用Keras进行实际的模型训练。此外，还介绍了“提前停止” 技术，帮助我们在模型训练过程中避免不必要的计算。

展望

正则表达式在文本处理、数据清洗、信息提取等领域有着广泛的应用，大家可以进一步探索其更高级的用法，如贪婪匹配、回溯引用等。而Keras在深度学习领域的应用前景也非常广阔，未来可以尝试使用Keras构建更复杂的模型，如深度卷积生成对抗网络（DCGAN）、变分自编码器（VAE）等，以解决更具挑战性的问题。同时，结合不同的数据集和任务，不断优化模型的性能，提高预测的准确性。

流程图：Keras模型训练流程

graph TD;
    A[加载数据] --> B[数据预处理];
    B --> C[构建模型];
    C --> D[编译模型];
    D --> E[训练模型];
    E --> F{是否使用提前停止};
    F -- 是 --> G[使用EarlyStopping回调];
    F -- 否 --> H[正常训练];
    G --> I[评估模型];
    H --> I[评估模型];
    I --> J[进行预测];

表格：正则表达式与Keras对比

技术	应用场景	关键操作	示例代码
正则表达式	文本处理、数据清洗、信息提取	匹配、分割、替换	`re.match()` , `re.split()` , `re.sub()`
Keras	深度学习、神经网络建模	构建模型、编译模型、训练模型	`keras.Sequential()` , `model.compile()` , `model.fit()`