pcfg 自然语言处理_自然语言处理的笔记

本文详细介绍了自然语言处理的各个方面,包括音韵、形态、句法和语义层面的分析,以及从上下文无关文法(PCFG)到依赖解析的各种模型和技术。探讨了语言模型如n-gram、RNN,以及识别、解码和参数估计等关键步骤。此外,还提到了在实际应用中处理噪声和稀疏性的方法,以及在处理野生语言时的挑战。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

MVA M2 课程,不定期整理更新。

Lecture 1 - part 1

linguistic data

phonological level - sentence-level analysis, sounds

graphemic level -- chinese characters

morphological level (how words are built) 组合词, 词根,

syntactic level: phrase structure, sub+v+obj / sub+obj+v, free word order, relative order

semantic level : the meaning of words (multiple meanings...) 颜色谱?词义包含

linguistic context : context,

extra-linguistic context : extra, the images, ect.

Diversity

variation

ambiguity

sparsity

phonological diversity : arm

sociolinguistic variation

lexical ambiguity: homonymy

prononciation ambiguity

segmentation ambiguity

tokens and form

amalgams: des = de les

à l'instar du

lemma : class of forms equivalent belonging to a same morphological paradigm

lemma + morphological form

syntactic ambiguity :

fruit flies like a banana

to get the dependency structure (POS tags)

automatic syntactic partitioning --- parsing

eg

pizza with anchovies

garden-path sentences

semantic ambiguity polysemy

hyponymy

corpora

corpus = body of text stored in a machine readable form

annotated, serving as training, development or test data

tree banks

...

zipf's law (distribution) : order words by decreasing frequency. and plot the frequency of the words

long tail (1% - 20% - 80%)

Lecture 1 - part 2

Deconstructing siri

- identify the talker

- recognize the words

- understand the query

- respond orally

lecture 2 - part 1

Chapter 9.3

1. filters and spectrum

2. modeling

classical way : by the Bayesian Model : language model and acoustic model (glotting ect.)

we can also condition on sequence of phonemes instead of words

use HMM

pheno hmm

yet hmm may crash - it also depends on the next one

one HMM model for a 3-phe

model the feature space : uncorrelated dimension of MFCCs

use a gaussian mixture : to model any kind of distribution

3. identification

female is very different from male

use different filters from the beginning

should have some new speakers

nowadays should be speaker independence

multitask training

4 denoising: wild speaker

data augmentation: white noise, other wild noise and add to the input voice, speed control

clean speech vs. noisy speech

5 next class : language models, end-2-end, frontier, different kinds of languages

lecture 3 - part 1

1. transform to a dictionary to a graph of phonim: use the seq2seq-p2g

2. language model : n-gram models : from w 1,... w i-1 to wi-N+1 to wi-1

the 4 gram seems better (for words)

- count words or letters ci/N

- unknown words: no matter how large the vocabulary for training, there is no-seen words

in the test set: lexicon is infinite

- in the n-gram model: use as a simple and dirty solution as unknown words:the corpus

getting large, the unknown words set will be small

- the n-gram will generally generate a sparse matrix : many combinasion will not appear

the problem grows with n (exponentially)

use smoothing: ci + 1 / N + V (works for evite the NAN)

backoff : approximate an n-gram by some combination of n-1, n-2 grams;

and interpolation the matirx

clustering, if the word not found, up-hierarchical to use the possibility of class

- compare different models:

plug into word recognizers

perplexity (混杂度):

extrinsic : improved WER

3. acoustic model + speech input

decoding latices

...

in a RNN language model --> a character based model , yet it grows very quickly for rnn

it will only succeed based on certain context ()

4. the decoding problem

Dynamic Programming :

- define subproblem

given n, find the num:

viterbi: the best : max O | Q,theta known, get Q

nbest : find the n-best path; instead of taking the minimum, take the n best path

- no dynamic programming with RNN : no simple recurrence relation between pi and pi-1

5. Parameter estimation P(O|theta), get theta using EM

6. end-to-end problem

we can add a right to left layer as bi-directional rnn

7. finite state machine

language-model : tranduceur and acceptors

pronunciation model acceptor, convert a sequence of phones to a sequence of words

8. basic operation on WFST

hard to build everything from scratch

combination is possible

OpenFST and Kaldi libraries

3 exercices

mispell detector

Lecture 4 -- language processing in the wild

6. Language III: Language Processing in the wild (2h+1h)

Algorithms: text normalization, coreference, distributional semantics, word embeddings

Human processing: conversational & casual language

Assignment: Evaluating Topic Models

Given a dataset of documents and human topic annotation, correlate different topic models with human judgements.

Readings:

J. Chang, J. Boyd-Graber, C. Wang, S. Gerrish, and D. Blei (2009).

Reading Tea Leaves: How Humans Interpret Topic Models. Neural Information Processing Systems.

Lecture 4

the shorter the windows, the more syntactic the representation 1-3

the longer the windows, the more semantic the representation 4-10

first order cooccurrence syntagmatic association(组合关系):

wrote + book or poem

second order cooccurrence -- paradigmatic association (范例关系)

similar neighbours : wrote / remarked / said + sentences ...

positive pointwise mutual information (PMI)

- 'the' and 'of' : very frequent yet not discriminative

- choose the words that are more informative

- This method often used to combine phrases

- too rare words cause the problem : rareness is amplified

sentence : self-contained syntactic structure

word

token

wordform : 句法上的单位结构 (POS)

named entity : utterance

semantic words: cannot inferred from composition

word correction

Damereau-Levenshtein distance

02/19/2018

Lecture 5 Parsers

Except for the usual lecture notes, other useful notes link are as following. [2] is the advanced nlp course for paper reading seminar. [4] has an explanation for how to deal with unknown words. [5] has described how to use NLTK-PCFG with CYK parser in detail. [6] has an introduction to How to evaluate and improve the parser. [9] offers a list for POS abbreviation for french words. [10] offers a full-version language tagger.

========= intro ===============

Syntax and formal grammars

(句法)

What structure do we need .

- introspection (内省)

- corpus study (interesting patterns)

- psycholinguistic or neurolinguistics

formal grammars : what formal devices do we need to represent such structures

treebanks: collection of sentences annotated with syntactic structures (trees). (heavily used)

syntactic lexicons : wordforms or lexemes. (important symbolically)

parsing : parse tree. parser : machine which do parsing

structures in tress.

- grammatical sentences

- ungrammatical sentence (oral, typo, ect.)

Find a way to distinguish the two.

how to label the tree's nodes ?

- a non-bracketed word : head

- labeled by POS (word form cluster) and phrases (three or two words' POSPhrase)

constituency tree :

- dependencies : dependency tree (no info about the constituent in the sentence)

- constituency : POS (the same level words lost dependency)

they can inter-transform sometimes, using Head percolation table : define rules of finding heads for each constituent.

non-projective case : crossing edges (difficult to deal with)

The constituency based and dependency based structure is two sides of the syntactic structure of a sentence, in the following we will limit ourselves to projective tree-like structures sentences.

=========== mathematical representation =============

Formal Grammarslanguage

language = a set of words over an alphabet T (字母表), called the vocabulary

In other words, a language is a subset of T* (T 构成的全集), by which we have :

- language set is finite or infinite

- T* is infinite yet countable

- # of language defined on T* is non-enumerableGrammar

A grammar G = (V, T, S, P)

where V is a finite set of objects called variables or non-terminal symbols. eg: {V,NP, DET...}

T is a finite set of objects called terminal symbols eg: {abcd...}

S is a special symbol called the start variable

P is a finite set of productions, the rewriting rules

e.g.define G = ({S}, {a, b}, S, P), where P as

The nonterminal symbol set is a singular start variable. Then:

NB:you can find a detailed description in the bookAlgebra by Michael Artin.

Context-free grammars (CFGs)

Can be called Grammaires hors-contexte, grammaires non-contextuelles, grammaires algébriques, ect.

Bottom-up method : First Parse

1 construct lexicon

2 first parse

--> construct a derived tree

The nodes with 0 or 2+ anchors

Top-down method: Substitution

Start by replacing the non-terminal symbols in the start symbol and goes top-down

乔姆斯基谱系

Watchthis.

1- 正则语言

2-

3-

Tree adjunction grammars (TAGs)

1 arguments and modifiers

arguments : pierre, pomme de teere

modifiers : at , in ...

Tree Insertion Grammars

the sequence of writing is called sentential form, the rule : production or rewriting rule

chomsky hierarchy :

nlp : mildly context-sensitive languages

substitution rule: equivalent elementary trees - combine the elementary trees .

pomme de terre problem :

lexicalisation : non-anchored nodes. (one terminal symbol)

substitution operation.

adjunction operation (附加操作)

initial tree

auxiliary tree (same label of root note)

a spine of this the auxiliary tree

jump of the auxiliary tree : split the node , the auxiliary then merge into

the gap

full line for a substitution operation

dashed line for an adjunction operation

wrapping tree: the middle operation (add bracket !)

for more look at pumping lemma for CFGs TAGs..

Complexity:

forbid wrapping auxiliary trees,

TIG : tree insertion grammar

major ways out : metagrammars ; extract it from a tree banks .

PCFGs: probabilistic CFGs (PCFGs) are a direct extension of CFGs,

all the possible way to write a symbol must sum to 1

multiple analyses : redundancies

use a representation form that captures the redandancies

- instantiation :

the probability computation can be done along the parse forest

PCFGs (independent in different part of the forest)

preliminary introduction for assignment:

parsing - CYK algorithm for CFG

deterministic : one tree for one sentence

nlk: any kind of CFG , including the ambiguous ones: 3 main kinds

- early ? 1970

- generalized LR algorithm

- CYK algo.

CYK

a recognizer than a parser

is the input string in the language defined by CYK

normal form: all rewriting rules are one of the three:

S -> AB, S->a, S-> epsilon

it is a bottom-up algorithm based on dynamic programming

lecture 6

MST parser

- create all dependencies

- weigh

- optimise and extract the best

Arc standard

- at each step, construct transitions, and a score to select the best one

- linear complexity !

maltparser

oracle that correctly predicts the next transition .

approximate by using a linear classifier . wf(c, t)

- sensitive to error propagation

local learning (using the history)

mstparser

for short dependency, Malt is a little better, on longer dependency MST is better

Beam search O(n)

Beam size increase, usually better yet not two large .

92% Yue Zhang

- Online recording and projective

Non-projective : no crossing arcs

yet a non-projective can be extended to projective by reordering words

Given a dependency tree, add swap operation.

LAS: labeled one

UAS: unlabeled one

tagger errors: alleviate by doing tagging and parsing at same time

shifting and tagging operation

Arc eager:

- cannot guarantee the number of operations (additional)

- formally the output cannot guarantee to be one single tree (but some trees).

extending the arc eager algorithm

Disfluencies

10% of words in the speech is meaningless

reparandum (edited phrase)

filled pauses, uh, um

eg: I want a flight to Boston, uh, I mean to Denver.

The best option is using a joint architecture, and do disfluency detection and parsing

Rasooli and Tetrault 2013 - 2014

Add three new action : remove the symbol words

============= Neural transition based ============

Chen and Manning 2014

replace the feature-based action by a neural network

1-2 pages report , no coding, try and investigate the machine translation

next day: facebook researcher

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值