Chapter 2 Regular Expressions, Text Normalization, Edit Distance

本文是《语音与语言处理》ed3的读书笔记,主要讨论了正则表达式的基本模式、组合与优先级,以及文本规范化中的词化、归一化和编辑距离的概念。正则表达式用于指定文本搜索字符串,而文本规范化涉及将文本转换为更方便的标准形式。编辑距离是一种衡量两个字符串相似性的指标,基于插入、删除和替换操作的数量。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Chapter 2 Regular Expressions, Text Normalization, Edit Distance

Speech and Language Processing ed3读书笔记

text normalization: converting text to a more convenient, standard form.

  • tokenization: separate words within a sentence.
  • lemmatization: the task of determining that two words have the same root, despite their surface differences.
  • stemming: strip suffixes from the end of the word.
  • sentence segmentation: breaking up a text into individual sentences, using cues like
    periods or exclamation points.

edit distance: measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other.

2.1 Regular Expressions

regular expression (RE): an algebraic notation for characterizing a set of strings, a language for specifying text search strings.

python code

#find the first match
import re
key="<html><body><h1>hello world</h1></body></html>"
p1="(?<=<h1>).+?(?=</h1>)"
pattern1 =re.compile(p1)
matcher1=re.search(pattern1,key)
print(matcher1.group(0))
#find all matches
import re
key="Column 1 Column 2 Column 3 Columna"
p1="\\bColumn\\b"
pattern1 =re.compile(p1)
print(pattern1.findall(key))

2.1.1 Basic Regular Expression Patterns

Regular expressions are case sensitive.

[]: means or

/[wW]/: w or W

/[A-Z]/: an upper case letter

/[a-z]/: a lower case letter

When a caret ^ is the first symbol within a [], it means negation:

/[^A-Z]/: not an upper case letter

/[^Ss]/: neither ‘S’ nor ‘s’

/[^\.]/: not a period

/[e^]/: either ‘e’ or ‘^’

/a^b/: the pattern ‘a^b’

? means “the preceding character or nothing”:

/woodchucks?/: woodchuck or woodchucks

/colou?r/: color or colour

**Kleene *** (pronounced “cleany star”): zero or more occurences of the immediately previous character or regular expression

/a*/: any string of zero or more as

/aa*/

/[ab]*/

Kleene +: one or more occurrences of the immediately preceding character or regular expression

/./: a wildcard expression that matches any single character (except a carriage return)

/beg.n/: begin, begun …

/aardvark.*aardvark/: to find any line in which a particular word, for example, aardvark, appears twice.

Anchors are special characters that anchor regular expressions to particular places in a string. The most common anchors are the caret ^ and the dollar sign $. The caret ^ matches the start of a line. The pattern /^The/ matches the word The only at the start of a line.

Thus, the caret ^ has three uses:

to match the start of a line,

to indicate a negation inside of square brackets,

and just to mean a caret.

The dollar sign $ matches the end of a line. So the pattern $ is a useful pattern for matching a space at the end of a line, and /^The dog\.$/ matches a line that contains only the phrase The dog.

\b matches a word boundary. /\bthe\b/ matches the word the but not the word other.

\B matches a non-boundary

\b and \B seems not work in python. Use \\b and \\B

2.1.2 Disjunction, Grouping, and Precedence

/|/: /cat|dog/ to match either the string cat or the string dog.

/gupp(y|ies)/ to match the string guppy or the string guppies

/Column [0-9]+ *)*/ to match the string Column 1 Column 2 Column 3

operator precedence hierarchy: from highest to lowerest

Parenthesis ()
Counters *** + ? {}**
Sequences and anchors the ^my end$
Disjunction |

Thus, because counters have a higher precedence than sequences, /the*/ matches theeeee but not thethe. Because sequences have a higher precedence than disjunction, /the|any/ matches the or any but not theny.

we say that patterns are greedy, expanding to cover as much of a string as they can.
There are, however, ways to enforce non-greedy matching, using another meaning of the ? qualifier. The operator *? is a Kleene star that matches as little text as possible. The operator +? is a Kleene plus that matches as little text as possible.

2.1.3 A Simple Example

The process we just went through was based on fixing two kinds of errors: false positives, strings that we incorrectly matched like other or there, and false negatives, strings that we incorrectly missed, like The. Addressing these two kinds of errors comes up again and again in implementing speech and language processing systems. Reducing the overall error rate for an application thus involves two antagonistic efforts:

  • Increasing precision (minimizing false positives)
  • Increasing recall (minimizing false negatives)

2.1.4 A More Complex Example

2.1.5 More Operators

RE Expansion Match First Matches
\d [0-9] any digit Party of 5 ‾ \underline{5} 5
\D [^0-9] any non-digit Blue moon
\w [a-zA-Z0-9 ] any alphanumeric/underscore Daiyu
\W [^\w] a non-alphanumeric !!!!
\s [ \r\t\n\f] whitespace (space, tab)
\S [^\s] Non-whitespace in Concord
RE Match
* zero or more occurrences of the previous char or expression
+ one or more occurrences of the previous char or expression
? exactly zero or one occurrence of the previous char or expression
{n} n occurrences of the previous char or expression
{n,m} from n to m occurrences of the previous char or expression
{n,} at least n occurrences of the previous char or expression
{,m} up to m occurrences of the previous char or expression

2.1.6 Regular Expression Substitution, Capture Groups, and ELIZA

substitution

s/regexp1/pattern: replace a string characterized by a regular expression regexp1 with pattern.

number operator

s/([0-9]+)/<\1>: add angle brackets to integers. For example, change the 35 boxes to the <35> boxes.

/the (.*)er they were, the \1er they will be/

will match the bigger they were, the bigger they will be but not the bigger they were, the faster they will be.

\1 will be replaced by whatever string matched the first item in parentheses.

This use of parentheses to store a pattern in memory is called a capture group. Every time a capture group is used (i.e., parentheses surround a pattern), the resulting match is stored in a numbered register. If you match two different sets of parentheses, \2 means whatever matched the second capture group. Thus

/the (.*)er they (.*), the \1er they \2/

will match the faster they ran, the faster we ran but not the faster they ran, the faster we ate.

Parentheses thus have a double function in regular expressions; they are used to group terms for specifying the order in which operators should apply, and they are used to capture something in a register. Occasionally we might want to use parentheses for grouping, but don’t want to capture the resulting pattern in a register. In that case we use a non-capturing group, which is specified by putting the commands ?: after the open paren, in the form (?: pattern ).

/(?:some|a few) (people|cats) like some \1/

will match some cats like some cats but not some cats like some a few.

2.1.7 Lookahead assertions

The operator (?= pattern) is true if pattern zero-width occurs, but is zero-width, i.e. the match pointer doesn’t advance. The operator (?! pattern) only returns true if a pattern does not match, but again is zero-width and doesn’t advance the cursor.

/(?<= pattern)/ matches a string that begins with pattern

/(? = pattern)/ matches a string that ends with pattern

2.2 Words

corpus (plural corpora): a computer-readable collection of text or speech.

Punctuation is critical for finding boundaries of things (commas, periods, colons) and for identifying some aspects of meaning (question marks, exclamation marks, quotation marks). For some tasks, like part-of-speech tagging or parsing or speech synthesis, we sometimes treat punctuation marks as if they were separate words.

An utterance is the spoken correlate of a sentence.

This utterance has two kinds of disfluencies. The broken-off word main- is fragment called a fragment. Words like uh and um are called fillers or filled pauses.

We also sometimes keep disfluencies around. Disfluencies like uh or um are actually helpful in speech recognition in predicting the upcoming word, because they may signal that the speaker is restarting the clause or idea, and so for speech recognition they are treated as regular words. Because people use different disfluencies they can also be a cue to speaker identification.

A lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. The wordform is the full inflected or derived form of the word.

Types are the number of distinct words in a corpus; if the set of words in the vocabulary is V , the number of types is the vocabulary size ∣ V ∣ |V| V. Tokens are the total number N N N of running words.

The relationship between the number of types ∣ V ∣ |V| V and number of tokens N N N is called Herdan’s Law (Herdan, 1960) or Heaps’ Law (Heaps, 1978) Heaps’ Law after its discoverers (in linguistics and information retrieval respectively). It is shown in Eq. 2.1, where k k k and b b

### 正则表达式的概念及其在自然语言处理中的应用 正则表达式是一种用于匹配字符串的强大工具,在自然语言处理中被广泛应用于模式识别和文本提取。通过定义特定的规则集,可以高效地查找、替换或验证文本数据[^1]。例如,在分词过程中,正则表达式可以帮助去除标点符号或其他不需要的内容。 ```python import re text = "Natural language processing (NLP) is a field of artificial intelligence." cleaned_text = re.sub(r'[^\w\s]', '', text) # 去除标点符号 print(cleaned_text) ``` 上述代码展示了如何利用正则表达式移除非字母字符,从而简化文本预处理过程。 --- ### 分词技术的概念及实现方法 分词是指将连续的自然语言文本切分成具有语义含义的基本单位的过程。对于中文等无空格的语言尤为重要。常见的分词算法包括基于规则的方法、统计模型以及深度学习模型。以下是使用 Python 的 `jieba` 库进行中文分词的一个简单例子: ```python import jieba sentence = "我喜欢自然语言处理" words = list(jieba.cut(sentence)) print(words) ``` 该示例说明了如何借助第三方库快速完成分词任务。值得注意的是,分词的效果可能受到领域特异性的影响,因此需要针对具体应用场景调整策略[^2]。 --- ### 编辑距离算法的核心原理与实际用途 编辑距离(Edit Distance),也称为 Levenshtein 距离,衡量两个字符串之间的差异程度。其基本思想是计算从一个字符串转换到另一个字符串所需的最少操作次数(插入、删除或替换)。这一度量常用于拼写校正、语音识别等领域。 下面是一个简单的动态规划实现: ```python def levenshtein_distance(s1, s2): m, n = len(s1), len(s2) dp = [[0] * (n + 1) for _ in range(m + 1)] for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j for i in range(1, m + 1): for j in range(1, n + 1): cost = 0 if s1[i - 1] == s2[j - 1] else 1 dp[i][j] = min( dp[i - 1][j] + 1, # 删除 dp[i][j - 1] + 1, # 插入 dp[i - 1][j - 1] + cost # 替换 ) return dp[m][n] distance = levenshtein_distance("kitten", "sitting") print(distance) ``` 此代码片段实现了经典的 Levenshtein 算法,并返回两字符串间的最小编辑代价。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值