目录
什么是语言?
目前为止,我们见过了一些处理单词、句子和文档等符号序列的方法:
- 语言模型
- 隐马尔可夫模型
- 循环神经网络
但是,这些模型都没有涉及到语言的本质,因为它们可以用于处理任何符号序列,而不仅限于单词、句子等。
Formal Language Theory 形式语言理论
形式语言理论 (Formal Language Theory) 为我们提供了一种定义语言的框架,它是一种数学框架。
- Studies classes of languages and their computational properties 研究语言的类及其计算性质
-
Language: set of strings 一种语言 = 字符串 (strings) 的集合
-
String: sequence of elements from a finite alphabet 一个字符串 = 来自一个 有限字母集 (alphabet) 的元素 (element) 所组成的序列
- 字母集可以视为 词典 (vocabulary)
- 元素可以视为 单词 (words)
动机
形式语言理论研究的是语言的 类别 (classes) 和它们的计算特性。这门课中,我们将主要介绍以下两种形式语言:
* 正则语言 (Regular Language)
* 上下文无关语言 (Context Free Language)
这两种语言构成了形式语言理论中的前两个类别,之后还有更复杂的 上下文敏感语言 (Context Sensitive Language) 等,但是这门课中我们不会对其进行过多展开。
主要目的是为了解决 从属问题 (membership problem):一个字符串是否属于某种语言。
那么,我们应该怎样做呢?我们可以定义该语言的 语法 (grammar),然后检查该字符串是否符合该语法规则。
例子
- E.g. of language:
- Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
{01, 001, 011, 0001, ....}
belongs to this language{1, 0, 00, 11, 100, ....}
does not belong to this language
- Even-length sequences from alphabet
{a, b}
: 来自字母集{a, b}
的偶数长度的序列{aa, ab, ba, bb, aaaa, ....}
belongs to this language{aaa, aba, bbb, ....}
does not belong to this language
- 以 wh- 类型的单词作为开头,问号 ?结尾的英文句子
- {what?, where my pants?, …} belongs to this language
- Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
除了从属问题之外的问题
- 从属问题(Membership)
- 某个字符串是否属于某种语言?是/否
- Beyond membership problem:
- Scoring: 记分(Scoring)
- Graded membership: How acceptable is a string
- 具有记分等级的从属关系
- 某个字符串在多大程度上可以被接受?(语言模型)
- Transduction: 转导(Transduction)
- Translate one string into another
- 将一个字符串转变为另一个字符串(词干提取 stemming)
- Scoring: 记分(Scoring)
Regular Language
Regular Languages 正则语言
-
The simplest class of languages 正则语言(Regular language):语言中最简单的类别。
-
Any regular expression is a regular language 任何 **正则表达式(regular expression)**都是一种正则语言。
- Describes what strings are part of the language. E.g.
0(0|1)*1
描述了什么样的字符串是该语言的一部分
- Describes what strings are part of the language. E.g.
-
Formally, a regular expression includes the following operations/definitions: 正式地,一个正则表达式包含以下运算:
- Symbol drawn from alphabet Σ 从字母集中抽样得到的符号(任意字母): Σ
- Empty string ε 空字符串: ε
- Concatenation of two regular expression 两个正则表达式的连接
RS
- Alternation of two regular expressions 两个正则表达式的交替
R | S
- Kleene star for 0 or more repeats 星号表示出现 0 次或者重复多次
R*
- Parenthesis
()
to define scope of operations 圆括号定义运算的有效范围
-
E.g.
- Binary strings that start with 0 and ends with 1:
0(0|1)*1
- Even-length sequences from alphabet {a, b}:
((aa)|(ab)|(ba)|(bb))*
- English sentences that start with wh-word and end in ?:
((what)|(where)|(why)|(which)|(whose)|(whom))
Σ*?
- Binary strings that start with 0 and ends with 1:
-
Properties of Regular Languages: 正则语言的性质
- Closure: If we take regular languages L1 and L2 and merge them, is the resulting language regular? 封闭性:如果我们取正则语言L1和L2并将它们合并,那么结果语言是正则的吗?
- Regular languages are closed under these conditions/operations: 在以下情况/操作下,正则语言是封闭的:
- Concatenation and union 连接和并集
- Intersection: strings that are valid in both L1 and L2 交集:在L1和L2中都有效的字符串
- Negation: strings that are not in L 求反:不在L中的字符串
- Extremely versatile. Can have regular languages for different properties of language, and use the together. 极其灵活。可以有具有语言不同属性的正则语言,并一起使用。
Finite State Acceptor
Finite State Acceptor 有限状态接受器
-
FSA consists: FSA包含:
- Alphabet of input symbols Σ 输入符号的字母表 Σ
- Set of states
Q
状态集 Q - Start state
q0
∈Q
起始状态 q0 ∈ Q - Final states
F
⊆Q
终止状态 F ⊆ Q - Transition function: symbol and state -> next state 转换函数:符号和状态 -> 下一个状态
-
Accepts strings if there is a path from
q0
to a final state with transitions matching each symbol 如果存在一条从q0到最终状态的路径,并且转换符合每个符号,就接受字符串- Djisktra’s shortest-path algorithm, complexity O(V logV + E) Djisktra的最短路径算法,复杂度为 O(V logV + E)
-
E.g.:
- Input alphabet :
{a, b}
输入字母表 : {a, b} - States:
{q0, q1}
状态: {q0, q1} - Start, final states:
q0, {q1}
起始,终止状态: q0, {q1} - Transition function:
{(q0, a) -> q0, (q0, b)-> q1, (q1, b) -> q1}
转换函数: {(q0, a) -> q0, (q0, b)-> q1, (q1, b) -> q1} - Regular expression defined by this FSA:
a*bb*
此FSA定义的正则表达式: abb
- Input alphabet :
Derivational Morphology 派生形态学
-
Use of affixes to change word to another grammatical category 使用词缀改变单词到另一个语法类别
-
E.g.:
- grace -> graceful -> gracefully
- grace -> disgrace -> disgracefully
- allure -> alluring -> alluringly
- allure -> *allureful
- allure -> *disallure
-
FSA for Morphology: 形态学的FSA:
- Want to accept valid forms (grace -> graceful) 希望接受有效的形式(grace -> graceful)
- Reject invalid ones (allure -> *allureful) 拒绝无效的(allure -> *allureful)
- generalize to other words 推广到其他单词
Weighted FSA 加权FSA
-
Some words are more plausible than others: 有些单词比其他单词更可能:
- fishful vs. disgracelyful
- musicky vs. writey
-
Weighted FSA: graded measure of acceptability: 加权FSA:可接受性的分级度量:
- Start state weight function: λ: Q -> R 起始状态权重函数:λ: Q -> R
- Final state weight function: ρ: Q -> R 终止状态权重函数:ρ: Q -> R
- Transition function: δ:(Q, Σ, Q) -> R 转换函数:δ:(Q, Σ, Q) -> R
-
Shortest-Path: 最短路径:
- Total score of a path 路径的总得分:
- Use shortest-path algorithm to find π with minimum cost. Complexity: O(V logV + E) 使用最短路径算法找到具有最小成本的π。复杂度:O(V logV + E)
- Total score of a path 路径的总得分:
Finite State Transducer
Finite State Transducer (FST) 有限状态转换器 (FST)
-
Often do not want to just accept or score strings. But want to translate them into another string. 我们常常不只是想要接受或评分字符串。而是想将它们转换成另一个字符串。
-
FST add string output capability to FSA FST向FSA添加了字符串输出功能
- Includes an output alphabet 包括输出字母表
- Transitions now take input symbol and emit output symbol (Q, Σ, &Sigma, Q) 转换现在接受输入符号并发出输出符号 (Q, Σ, Σ, Q)
-
Can be weighted (WFST) : Graded scores for transition 可以被加权 (WFST):转换的分级分数
-
E.g. Edit distance as WFST: distance to transform one string to another 例如,编辑距离作为WFST:将一个字符串转换为另一个字符串的距离
FST for Inflectional Morphology 变格形态学的FST
- Verb inflection in Spanish must match the subject in person and number 西班牙语的动词变格必须与主语在人称和数上匹配
- Goal of morphological analysis: 形态分析的目标:
- canto -> cantar + VERB + present + 1P + singular
Non-Regular Languages 非正则语言
- Arithmetic expressions with balanced parentheses 带有平衡括号的算术表达式
- (a + (b * (c / d)))
- Can have arbitrarily many opening parentheses 可以有任意多的开括号
- Need to remember how many open parentheses to produce the same number of closed parentheses 需要记住多少个开括号以生成同样数量的闭括号
- Can not be done with finite number of states 不能用有限数量的状态完成
Center Embedding 中心嵌入
-
Center embedding of relative clauses 中心嵌入的关系从句
- The cat loves Mozart
- The cat the dog chased loves Mozart
- The cat the dog the rat bit chased loves Mozart
- The cat the dog the rat the elephant admired bit chased loves Mozart
-
Need to remember the n subject nouns, to ensure n verbs follow 需要记住n个主题名词,以确保跟随n个动词
-
Requires context-free grammar 需要上下文无关语法