NLP - text classification

最新推荐文章于 2022-11-19 23:47:50 发布

Endlessio

最新推荐文章于 2022-11-19 23:47:50 发布

阅读量270

点赞数 1

文章标签： nlp

本文链接：https://blog.youkuaiyun.com/qq_41615319/article/details/114533632

版权

Architecture

在这里插入图片描述

Disadvantage of Shallow Learning

need feature engineering
disregard the natural sequential structure / contexted information, hard to learn semantic information

Advantage of Deep Learning

avoid designing rules and features by human
semantically meaning

Text Classification Methods:

extracting features from row text data and predicting the categories of text data

Shallow Learning Methods

preprocess data: word segmentation, data cleaning, data statistics
text representation: 使text形成计算机更易计算/理解的方式：Bag-of-words (BOW), N-gram, term frequency-inverse document frequency (TF-IDF), word2vec and GloVe
- BOW: representing each text with a dictionary-sized vector
  - 缺点：cannot properly capture more complex linguistic phenomena in sentiment analysis
  - “white blood cells destroying an infection” and “an infection destroying white blood cells” have the same bag-of-words representation，但前者positive reaction后者negative reaction
- N-gram: considers the information of adjacent words and builds a dictionary by considering the adjacent words
- TF-IDF: word frequency and inverses the document frequency to model the text
- word2vec: employs local context information to obtain word vectors
- GloVe: with both the local context and global statistical features – trains on the nonzero elements in a word-word co-occurrence matrix
feed the text representation to classifier
- SVM
- Naive Bayes
- KNN
- Decision Tree

Deep Learning Methods

Recursive Neural Network Based Methods

Recursive Neural Network [https://zybuluo.com/hanbingtao/note/626300
- motivation
  - RNN循环神经网络是将长度不定的输入分割为长度相等的小块输入，如一句话拆分为多个单词，一次输入一个词来处理任意长度的句子。但单纯的序列是不够的，如考虑一句话在不同断句下可能有不同意思，为了表明这些不一样的意思，可以构建树结构/图结构去储存信息。此时需要用到的是递归神经网络去讲一个树/图结构编码为相应向量再去计算向量远近
  - forward-propagation
    - 子节点编码产生父节点，父节点维度与子节点相同。
    - 子节点与父节点之间的神经元为全连接结构。
  - backward-propagation：BPTS
    - 父节点—>子节点
Recursive Autoencoder (RAE)
- Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
- Contribution
  - Instead of using a BOW, hierarchical structure and uses compositional semantics to understand sentiment.
  - 有标签和无标签的数据都能使用
  - 不限于positive/negative sentiment, predict a multidimensional distribution over several complex, interconnected sentiments.
- Neural Word Representation: Word Embedding
  - 每个word vector： $n\times1$ ，|V|个vocabulary，embedding matrix L： $n\times|V|$
    - 方法一：每个word vector从mean为0的Gaussian Distribution中随机采样得到，然后逐步调整去拟合label的distribution
    - 方法二：用jointly learn an embedding of words into a vector space and use these vectors to predict how likely a word occurs given its context的model，采用gradient ascent的方式学习从co-occurrence学习syntactic and semantic information
- 如何对于给定的由m个word组成的句子选取相应的embedding
  - 每个word有自己的index k，k标记word所在的位置，考虑one-hot向量 $b_k$ =[0,0,…1,0,…,0]
  - 选取embedding可以看做是矩阵乘法： $Lb_k$ ， $n\times|V|\times|V|\times1 = n\times1$
  - 之前用的binary number representation，现在是用vector组成list，为了方便continuous sigmoid unit。
- 如何得到reduced dimensional vector representation
  - 传统方法
    - 父节点由两个子节点得到，以 $x_3, x_4$ 到 $y_1$ 为例
    - $y_1 = activate\_function(W^1[x_3,x_4]+b)$ 。其中, $x_3,x_4]$ 表示两个vector的cat； $W^1$ 为相应系数矩阵， $n\times2n$ ；b为bias，激活函数通常为tahn
    - loss来自于用得到的 $y_1$ 重构 $[\hat{x_3},\hat{x_4}]$ ，用重构的与原来的算Euclidean Distance。
  - 本文方法: unsupervised
    - assume sentence有m个word， $x_1,x_2,...,x_m]$ ，遍历所有相邻pair $x_i,x_{i+1}]$ ，储存reconstruction error，取最小error的pair construct parent $p$ ，替换two children为parent $p$ 直至全部替换。e.g. $x_1,x_2,x_3,x_4]--->[x_1,x_2,p_{(3,4)}]$
    - 这里reconstruction error的问题，已经合并过多次的父节点通常比合并次数少/未合并的节点具有更大的weight。所以这里添加相应系数
    - length normalization： $y_1 = y_1\div||y_1||$
  - 本文方法: semi-supervised
    - motivation：extend RAE to predict a sentence/phrase level distribution
    - RAE天然的具有了捕捉phrase feature的能力，因为parent node也有自己的distributed vector，而上层的parent可以被看做phrase feature
    - 加cross-entropy error
      - predict parent label的概率 $d_K$ , $softmax(W^{label}p)$ , 一共K个label, $\sum d_k = 1$ , $d_k$ 可认为是给定两个子节点时父节点为K类的conditional probability。
      - 则cross-entropy为
      - 此时损失函数为normalize+加权后的reconstruction error 和 cross-entropy error: $\alpha E_{reconstruction-error}+(1-\alpha)E_{cross-entropy}$
MV-RNN
RNTN
DeepReNN

Multilayer Perceptron Based Methods

Paragraph-Vec: Distributed Representations of Sentences and Documents
- 概要：认为BOW存在问题-未关注word order+未关注semantics，所以提出Paragraph Vector to learn fix length feature representations from variable length，并在一些text classification 和 sentiment analysis task

Datasets

Sentiment Analysis (SA): binary class and multi-class
News Classification (NC): recognizing news topics and recommending related news according to user interest.
Question Answering (QA)
- Extractive QA: multiple candidate answers for each question to choose which one is the right answer
- Generative QA: 自己生成答案，通常不被认为是text classification
Natural Language Inference (NLI): identify whether the meaning of one text can be deduced from another

Evaluation Metrics

Single-label: divides the text into one of the most likely categories
- Acc + Error Rate
- Precision + Recall + F1: for unbalanced test set
- Exact Match (EM): for QA measuring the prediction that matches all the ground-truth answers precisely
- Mean Reciprocal Rank (MRR): for QA and Information Retrieval tasks
- Hamming-loss (HL): assesses the score of misclassified instance-label pairs where a related label is omitted or an unrelated is predicted
Multi-label
- Micro-F1
- Macro-F1
- Precision at Top K (P@K)