Natural Language Processing Knowledge

本文介绍了自然语言处理(NLP)的关键技术和应用,包括垃圾邮件检测、情感分析、命名实体识别等任务,并探讨了NLP面临的挑战,如非标准语言和世界知识的应用。此外,还讨论了NLP中的基本技能,例如正则表达式、词法分析以及各种机器学习方法。

Natural Language Processing

[Based on Stanford NLP course]

Preference

Language Technology

  • Mostly solved:
    • Spam detection
    • Part-of-speech (POS) tagging
    • Named Entity Recognition (NER)
  • Making good progress
    • Sentiment analysis
    • Co-reference resolution
    • Word sense disambiguation (WSD)
    • Parsing
    • Machine translation (MT)
    • Information Extraction (IE)
  • Still really hard
    • Question answering (QA)
    • Paraphrase
    • Summarization
    • Dialog

Why NLP difficult?

  • Non-standard Language
  • segmentation issues
  • idioms
  • neologisms
  • world knowledge
  • tricky entity names

Basic skills

  • Regular Expressions
  • Tokenization
  • Word Normalization and Stemming
  • Classifier
    • Decision Tree
    • Logistic Regression
    • SVM
    • Neural Nets

Edit Distance

  • Used for
    • Spell correction
    • Computational Biology
  • Basic operations
    • Insertion
    • Deletion
    • Substitution
  • Algorithm
    • Levenshtein
    • Back trace
    • Needleman-Wunsch
    • Smith-Waterman

Language Model

Probabilistic Language Models

  • Machine translation
  • Spell correction
  • Speech Recognition
  • Summarization
  • Question answering

Markov Assumption

$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i | \omega_{i-k} \dots \omega_{i-1}) $$

Unigram Model

$$ P(\omega_1 \omega_2 \dots \omega_n) \approx \prod_i P(\omega_i) $$

Bigram Model

$$ P(\omega_i | \omega_1 \omega_2 \dots \omega_{i-1}) \approx \prod_i P(\omega_i | \omega{i-1}) $$

Add-k Smoothing

$$ P_{Add-k}(\omega_i|\omega{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+k}{c(\omega_{i-1})+kV} $$

Unigram prior smoothing

$$ P_{Add-k}(\omega_i|\omega_{i-1})=\tfrac{c(\omega_{i-1},\omega_i)+m(\tfrac{1}{V})}{c(\omega_{i-1})+m} $$

Smoothing Algorithm

  • Good-Turing
  • Kneser-Ney
  • Witten-Bell

Spelling Correction

  • tasks:
    • Spelling error detection
    • Spelling error correction
      • Autocorrect
      • Suggest a correction
      • Suggestion lists
  • Real word spelling errors:
    • For each word $w$, generate candidate set
    • Choose best candidate
    • Find the correct word $w$
  • Candidate generation
    • words with similar spelling
    • words with similar pronunciation
  • Factors that could influence p(misspelling | word)
    • source letter
    • target letter
    • surrounding letter
    • the position in the word
    • nearby keys on the keyboard
    • homology on the keyboard
    • pronunciation
    • likely morpheme transformations

Text Classification

Used for:

  • Assigning subject categories, topics, or genres
  • Spam detection
  • Authorship identification
  • Age/gender identification
  • Language identification
  • Sentiment analysis

Methods: Supervised Machine Learning

  • Naive Bayes
  • Logistic Regression
  • Support Vector Machines
  • k-Nearset Neighbors

Naive Bayes

$$ C_{MAP}=arg\max_{c\in C}P(x_1,x_2,\dots,x_n|c)P© $$

  • Laplace (add-1) Smoothing
  • Used for Spam Filtering
  • Training data:
    • No training data: manually written rules
    • Very little data:
      • Use Naive Bayes
      • Get more labeled data
      • Try semi-supervised training methods
    • A reasonable amount of data:
      • All clever Classifiers:
        • SVM
        • Regularized Logistic Regression
      • User-interpretable decision trees
    • A huge amount of data:
      • At a cost
        • SVM (train time)
        • kNN (test time)
        • Regularized Logistic Regression can be somewhat better
      • Naive Bayes
  • Tweak performance
    • Domain-specific
    • Collapse terms
    • Upweighting some words

F Measure

Precision: % of selected items that are correct

Recall: % of correct items that are selected

/correctnot correct
selectedtpfp
not selectedfntn

$$ F=\tfrac{1}{\alpha \tfrac{1}{P} +(1-\alpha)\tfrac{1}{R}}=\tfrac{(\beta2+1)PR}{\beta2P+R} $$

Sentiment Analysis

  • Typology of Affective States
    • Emotion
    • Mood
    • Interpersonal stances
    • Attitudes (Sentiment Analysis)
    • Personality traits
  • Baseline Algorithm
    • Tokenization
    • Feature Extraction
    • Classification
      • Naive Bayes
      • MaxEnt (better)
      • SVM (better)
  • Issues
    • HTML, XML and other markups
    • Capitalization
    • Phone numbers, dates
    • Emoticons

Sentiment Lexicons

  • semi-supervised learning of lexicons
    • use a small amount of information
    • to bootstrap a lexicon
    • adjectives conjoined by “and” have same polarity
    • adjectives conjoined by “but” do not
  • Process
    • label seed set of adjectives
    • expand seed set to conjoined adjectives
    • supervised classifier assigns “polarity similarity” to each word pair
    • clustering for partitioning
  • Turney Algorithm
    • extract a phrasal lexicon from reviews
    • learn polarity of each phrase
    • rate a review by the average polarity of its phrases
  • Advantages
    • domain-specific
    • more robust
  • Assume classes have equal frequencies:
    • if not balanced: need to use F-scores
    • severe imbalancing also can degrade classifier performance
    • solutions:
      • Resampling in training
      • Cost-sensitive learning
        • penalize SVM more for misclassification of the rare thing
  • Features:
    • Negation is important
    • Using all words works well for some tasks (NB)
    • Finding subsets of words may help in other tasks

Features

Joint and Discriminative

  • Joint (generative) models: place probabilities over both observed data and the hidden stuff ---- P(c,d)
    • N-gram models
    • Naive Bayes classifiers
    • Hidden Markov models
    • Probabilistic context-free grammars
    • IBM machine translation alignment models
  • Discriminative (conditional) models: take the data as given, and put a probability over hidden structure given the data ---- P(c|d)
    • Logistic regression
    • Conditional loglinear
    • Maximum Entropy models
    • Conditional Random Fields
    • SVMs
    • Perceptron

Features

  • Feature Expectations:
    • Empirical count
    • Model expectation
  • Feature-Based Models:
    • Text Categorization
    • Word-Sense Disambiguation
    • POS Tagging

Maximum Entropy

$$ \log P(C|D,\lambda)=\sum_{(c,d)\in (C,D)}\log P(c|d,\lambda)=\sum_{(c,d)\in(C,D)}\log \tfrac{exp \sum_{i} \lambda_if_i(c,d)}{\sum_{c’} exp\sum_i \lambda_if_i(c’,d)} $$

  • Find the optimal parameters
    • Gradient descent (GD), Stochastic gradient descent (SGD)
    • Iterative proportional fitting methods: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS)
    • Conjugate gradient (CG), perhaps with preconditioning
    • Quasi-Newton methods - limited memory variable metric (LMVM) methods, in particular, L-BFGS
  • Feature Overlap
    • Maxent models handle overlapping features well
    • Unlike NB, there is no double counting
  • Feature Interaction
    • Maxent models handle overlapping features well, but do not automatically model feature interactions
  • Feature Interaction
    • If you want to interaction terms, you have to add them
    • A disjunctive feature would also have done it
  • Smoothing:
    • Issues of scale
      • Lots of features
      • Lots of sparsity
      • Optimization problems
    • Methods
      • Early stopping
      • Priors (MAP)
      • Regularization
      • Virtual Data
      • Count Cutoffs

Named Entity Recognition (NER)

  • The uses:
    • Named entities can be indexed, linked off, etc.
    • Sentiment can be attributed to companies or products
    • A lot of IE relations are associations between named entities
    • For question answering, answers are often named entities
  • Training:
    • Collect a set of representative training documents
    • Label each token for its entity class or other
    • Design feature extractors appropriate to the text and classes
    • Train a sequence classifier to predict the labels form the data
  • Inference
    • Greedy
      • Fast, no extra memory requirements
      • Very easy to implement
      • With rich features including observations to the right, it may perform quite well
      • Greedy, we make commit errors we cannot recover from
    • Beam
      • Fast, beam sizes of 3-5 are almost as good as exact inference in many cases
      • Easy to implement
      • Inexact: the globally best sequence can fall off the beam
    • Viterbi
      • Exact: the global best sequence is returned
      • Harder to implement long-distance state-state interactions
  • CRFs
    • Training is slower, but CRFs avoid causal-competition biases
    • In practice usually work much the same as MEMMs

Relation Extraction

  • How to build relation extractors
    • Hand-written patterns
      • High-precision and low-recall
      • Specific domains
      • A lot of work
    • Supervised machine learning
      • MaxEnt, Naive Bayes, SVM
      • Can get high accuracies with enough training data
      • Labeling a large training set is expensive
      • Brittle, don’t generalize well to different genres
    • Semi-supervised and unsupervised
      • Bootstrapping (using seeds)
        • Find sentences with these pairs
        • Look at the context between or around the pair and generalize the context to create patterns
        • Use the patterns for grep for more pairs
      • Distance supervision
        • Doesn’t require iteratively expanding patterns
        • Uses very large amounts of unlabeled data
        • Not sensitive to genre issues in training corpus
      • Unsupervised learning from the web
        • Use parsed data to train a “trustworthy tuple” classifier
        • Single-pass extract all relations between NPs, keep if trustworthy
        • Assessor ranks relations based on text redundancy

POS Tagging

  • Performance
    • About 97% currently
    • But baseline is already 90%
    • Partly easy because
      • Many words are unambiguous
      • You get points for them and for punctuation marks
    • Difficulty
      • ambiguous words
      • common words can be ambiguous
  • Source of information
    • knowledge of neighboring words
    • knowledge of word probabilities
    • word
    • lowercased word
    • prefixes
    • suffixes
    • capitalization
    • word shapes
  • Summary
    • the change from generative to discriminative model does not by itself result in great improvement
    • the higher accuracy of discriminative models comes at the price of much slower training

Parsing

  • Treebank
    • reusability of the labor
      • many parser, POS taggers, etc.
      • valuable resource for linguistics
    • broad coverage
    • frequencies and distributional information
    • a way to evaluate systems
  • Statistical parsing applications
    • high precision question answering
    • improving biological named entity finding
    • syntactically based sentence compression
    • extracting interaction in computer games
    • helping linguists find data
    • source sentence analysis for machine translation
    • relation extraction systems
  • Phrase structure grammars (= context-free grammars, CFGs) in NLP
    • G = (T, C, N, S, L, R)
      • T is a set of terminal symbols
      • C is a set of preterminal symbols
      • N is a set of nonterminal symbols
      • S is the start symbol
      • L is the lexicon, a set of items of the form X -> x
      • R is the grammar, a set of items of the form X -> $\gamma$
      • e is the empty symbol

Probabilistic Parsing

  • Probabilistic - or stochastic - context-free grammars (PCFGs)
    • G = (T, N, S, R, P)
      • P is a probability function
    • Chomsky Normal Form
      • Reconstructing n-aries is easy
      • Reconstructing unaries/empties is trickier
      • Binarization is crucial for cubic time CFG parsing
    • Cocke-Kasami-Younger (CKY) Constituency Parsing
      • Unaries can by incorporated into the algorithm
      • Empties can be incorporated
      • Binarization is vital
    • Performance
      • Robust
      • Partial solution for grammar ambiguity
      • Give a probabilistic language model
      • The problem seems to be that PCFGs lack the lexicalization of a trigram model

Lexicalized Parsing

  • Charniak
    • Probabilistic conditioning is “top-down” like a regular PCFG, but actual parsing is bottom-up, somewhat like the CKY algorithm we saw
  • Non-Independence
    • The independence assumptions of a PCFG are often too strong
    • We can relax independence assumptions by encoding dependencies into the PCFG symbols, by state splitting (sparseness)
  • Accurate Unlexicalized Parsing
    • Grammar rules are not systematically specified to the level of lexical items
    • Closed vs. open class words
  • Learning Latent Annotations
    • brackets are known
    • base categories are known
    • induce subcategories
    • clever split/merge category refinement
    • EM, like Forward-Backward for HMMs, but constrained by tree

Dependency Parsing

  • Methods
    • Dynamic programming (like in the CKY algorithm)
    • Graph algorithm
    • Constraint Satisfaction
    • Deterministic Parsing
  • Sources of information
    • Bilexical affinities
    • Dependency distance
    • Intervening material
    • Valency of heads
  • MaltParser
    • Greedy
    • Bottom up
    • Has
      • a stack
      • a buffer
      • a set of dependency arcs
      • a set of actions
    • Each action is predicted by a discriminative classifier (SVM)
    • No search
    • Provides close to state of the art parsing performance
    • Provides very fast linear time parsing
  • Projective
    • Dependencies from a CFG tree using heads, must be projective
    • But dependency theory normally does allow non-projective structure to account for displaced constituents
    • The arc-eager algorithm only builds projective dependency trees
  • Stanford Dependencies
    • Projective
    • Can be generated by postprocessing headed phrase structure parses, or dependency parsers like MaltParser or the Easy-First Parser

Information Retrieval

  • Used for
    • web search
    • e-mail search
    • searching your laptop
    • corporate knowledge bases
    • legal information retrieval

Classic search

  • User task
  • Info need
  • Query & Collection
  • Search engine
  • Results

Initial stages of text processing

  • Tokenization
  • Normalization
  • Stemming
  • Stop words

Query processing

  • AND
    • “merge” algorithm
  • Phrase queries
    • Biword indexes
      • false positives
      • bigger dictionary
    • Positional indexes
      • Extract inverted index entries for each distinct term
      • Merge their doc:position lists to enumerate all positions
      • Same general method for proximity searches
    • A positional index is 2-4 as large as a non-positional index
    • Caveat: all of this holds for “English-like” language
    • These two approaches can be profitably combined

Ranked Retrieval

  • Advantage
    • Free text queries
    • large result sets are not an issue
  • Query-document matching scores
    • Jaccard coefficient
      • Doesn’t consider term frequency
      • Length normalization needed
    • Bag of words model
      • Term frequency (tf)
      • Log-frequency weighting
      • Inverse document frequency (idf)

tf-idf weighting

$$ W_{t,d}=(1+\log tf_{t,d})\times \log_{10}(N/df_t) $$

  • Best known weighting scheme in information retrieval

Distance: cosine(query, document)

$$ \cos(\vec q,\vec d)=\tfrac{\vec q \bullet \vec d}{|\vec q||\vec d|}=\tfrac{\vec q}{|\vec q|}\bullet \tfrac{\vec d}{|\vec d|}=\tfrac{\sum{|V|}_{i=1}q_id_i}{\sqrt{\sum{|V|}{i=1}q_i2}\sqrt{\sum{|V|}{i=1}d^2_i}} $$

Weighting

  • Many search engines allow for different weightings for queries vs. documents
  • A very standard weighting scheme is: Inc.Itc
  • Document: logarithmic tf, no idf and cosine normalization
  • Query: logarithmic tf idf, cosine normalization

Evaluation

  • Mean average precision (MAP)

Semantic

Situation

  • Reminder: lemma and wordform
  • Homonymy
    • Homographs
    • Homophones
  • Polysemy
  • Synonyms
  • Antonyms
  • Hyponymy and Hypernymy
  • Hyponyms and Instances

Applications of Thesauri and Ontologies

  • Information Extraction
  • Information Retrieval
  • Question Answering
  • Bioinformatics and Medical Informatics
  • Machine Translation

Word Similarity

  • Synonymy and similarity
  • Similarity algorithm
    • Thesaurus-based algorithm
    • Distributional algorithms

Thesaurus-based similarity

$LCS(c_1,c_2)=$ The most informative (lowest) node in the hierarchy subsuming both $c_1$ and $c_2$$$ Sim_{path}(c_1,c_2)=\tfrac{1}{pathlen(c_1,c_2)} $$$$ Sim_{resnik}(c_1,c_2)=-\log P(LCS(c_1,c_2)) $$$$ Sim_{lin}(c_1,c_2)=\tfrac{1\log P(LCS(c_1,c_2))}{\log P(c_1)+\log P(c_2)} $$$$ Sim_{jiangconrath}(c_1,c_2)=\tfrac{1}{\log P(c_1)+\log P(c_2)-2\log P(LCS(c_1,c_2))} $$$$ Sim_{eLesk}(c_1,c_2)=\sum_{r,q\in RELS}overlap(gloss(r(c_1)),gloss(q(c_2))) $$

  • Evaluating
    • Intrinsic
      • Correlation between algorithm and human word similarity ratings
    • Extrinsic (task_based, end-to-end)
      • Malapropism (Spelling error) detection
      • WSD
      • Essay grading
      • Taking TOEFL multiple-choice vocabulary tests
  • Problems
    • We don’t have a thesaurus for every language
    • recall
      • missing words
      • missing phrases
      • missing connections between senses
      • works less well for verbs, adj.

Distributional models of meaning

  • For the term-document matrix: tf-idf
  • For the term-context matrix: Positive Pointwise Mutual Information (PPMI) is common

$$ PMI(w_1,w_2)=\log_2\tfrac{P(w_1,w_2)}{P(w_1)P(w_2)} $$

  • PMI is biased toward infrequent events
    • various weighting schemes
    • add-one smoothing

Question Answering

  • Question processing
    • Detect question type, answer type (NER), focus, relations
    • Formulate queries to send a search engine
  • Passage Retrieval
    • Retrieval ranked documents
    • Break into suitable passages and re-rank
  • Answer processing
    • Extract candidate answers
    • Rank candidates
Approaches
  • Knowledge-based
    • build a semantic representation of the query
    • Map from this semantics to query structured data or resources
  • Hybrid
    • build a shallow semantic representation of the query
    • generate answer candidate using IR methods
    • Score each candidate using richer knowledge sources
Answer type taxonomy
  • 6 coarse classes
    • Abbreviation
    • Entity
    • Description
    • Human
    • Location
    • Numeric
  • 50 finer classes
  • Detection
    • Hand-written rules
    • Machine Learning
    • Hybrids
  • Features
    • Question words and phrases
    • Part-of-speech tags
    • Parse features
    • Named Entities
    • Semantically related words
Keyword selection
  1. Non-stop words
  2. NNP words in recognized named entities
  3. Complex nominals with their adjectival modifiers
  4. Other complex nominals
  5. Nouns with their adjectival modifiers
  6. Other nouns
  7. Verbs
  8. Adverbs
  9. QFW word (skipped in all previous steps)
  10. Other words
Passage Retrieval
  • IR engine retrieves documents using query terms
  • Segment the documents into shorter units
  • Passage ranking
    • number of named entities of the right type in passage
    • number of query words in passage
    • number of question N-grams also in passage
    • proximity of query keywords to each other in passage
    • longest sequence of question words
    • rank of the document containing passage
Features for ranking candidate answers
  • answer type match
  • pattern match
  • question keywords
  • keyword distance
  • novelty factor
  • apposition features
  • punctuation location
  • sequences of question terms
Common Evaluation Metrics
  • Accuracy
  • Mean Reciprocal Rank (MRR)

$$ MRR = \tfrac{\sum_{i=1}^N \tfrac{1}{rank_i}}{N} $$

Summarization

  • Applications
    • outlines or abstracts
    • summaries
    • action items
    • simplifying
  • Three stages
    • content selection
    • information ordering
    • sentence realization
  • salient words
    • tf-idf
    • topic signature
      • mutual information
      • log-likelihood ratio (LLR)

$$weight(w_i)=\begin{cases}1,& if -2\log \lambda(w_i)>10 \ 0,& otherwise\end{cases}$$

  • Supervised content selection problem
    • hard to get labeled training data
    • alignment difficult
    • performance not better than unsupervised algorithm
  • ROUGE (Recall Oriented Understudy for Gisting Evaluation)
    • Intrinsic metric for automatically evaluating summaries
      • based on BLEU
      • not as good as human evaluation
      • much more convenient

$$ ROUGE-2=\tfrac{\sum_{x\in {RefSummaries}}\sum_{bigrams:i\in S}\min(count(i,X),count(i,S))}{\sum_{x\in{RefSummaries}}\sum_{bigrams:i\in S}count(i,S)} $$

  • Maximal Marginal Relevance (MMR)
    • Iteratively (greedily)
      • Relevant: high cosine similarity to the query
      • Novel: low cosine similarity to the summary
    • Stop when desired length
  • Information Ordering
    • Chronological ordering
    • Coherence
    • Topical ordering
### 实现基于K-means分裂的线性多元二叉决策树分类器 构建基于K-means分裂的线性多元二叉决策树分类器涉及多个方面,包括数据预处理、特征选择以及模型训练等。在此过程中,K-means算法用于决定如何分割节点中的样本集。 #### 数据准备与初始化 为了创建这样的分类器,首先需要准备好输入的数据集。这些数据应该被标准化或归一化以确保不同尺度下的属性不会影响最终的结果[^1]。接着定义一些基本参数如最大深度、最小样本数来控制树的增长。 #### 节点划分策略 对于每一个内部节点,在其对应的子空间内应用K-means聚类(通常设置簇的数量为2),从而找到最佳切分超平面。具体来说就是通过计算各类别的中心点作为初始质心来进行迭代优化直到收敛;之后依据距离最近原则分配各实例到相应的类别下形成两个新的分支[^2]。 #### 构建决策路径 当完成上述操作后便可以沿着这条由根至叶形成的路径做出预测——即新来的观测值会按照相同的方式逐步落入某个叶子结点之中并据此给出所属标签。 ```python from sklearn.cluster import KMeans import numpy as np class KMTreeClassifier: def __init__(self, max_depth=None, min_samples_split=2): self.max_depth = max_depth self.min_samples_split = min_samples_split def fit(self, X, y): self.tree_ = self._build_tree(X, y) def _build_tree(self, X, y, depth=0): n_samples, n_features = X.shape # 基本终止条件 if (len(set(y)) == 1 or n_samples < self.min_samples_split or (self.max_depth and depth >= self.max_depth)): return {'leaf': True, 'value': max(set(y), key=list(y).count)} kmeans = KMeans(n_clusters=2) clusters = kmeans.fit_predict(X) left_indices = clusters == 0 right_indices = ~left_indices node = { 'split_point': kmeans.cluster_centers_, 'feature_index': None, 'threshold': None, 'left_child': self._build_tree(X[left_indices], y[left_indices], depth + 1), 'right_child': self._build_tree(X[right_indices], y[right_indices], depth + 1), 'leaf': False } return node def predict(self, X_test): predictions = [] for instance in X_test: pred = self._traverse(instance, self.tree_) predictions.append(pred['value']) return np.array(predictions) def _traverse(self, x, node): while not node.get('leaf'): dist_to_left_center = np.linalg.norm(x - node['split_point'][0]) dist_to_right_center = np.linalg.norm(x - node['split_point'][1]) if dist_to_left_center <= dist_to_right_center: node = node['left_child'] else: node = node['right_child'] return node ``` 此代码片段展示了如何利用Python实现一个简单的KM Tree Classifier框架。值得注意的是这里仅提供了一个基础版本,并未考虑诸如剪枝之类的高级特性。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值