A Text Classifier based on Log-Linear Model
A simple model from scratch.
Dataset
Implementation
Preprocess
- Sample data to reduce dataset size
- Merge the Title and Content
- With the help of nltk:
- Remove punctuations and numbers
- Remove URLs
- Split the Content into words
- Filter stopwords
- Stem and Lemmarize
Feature Extraction
Use TF-IDF as the Feature
- Keep only the most frequent words for a reasonable feature size
- Calculate TF-IDF
T F ( t , d ) = count ( t , d ) ∑ k count ( k , d ) TF(t,d) = \frac{\text{count}(t, d)}{\sum_k \text{count}(k, d)} TF(t,d)=∑kcount(k,d)count(t,d)
, where count( t t t, d d d) means the count of term t t t in document d d d.
I D F ( t , D ) = log N + 1 n u m ( t , D ) + 1 + 1 IDF(t, D) = \log \frac{N + 1}{num(t, D) + 1} + 1 IDF(t,D)=lognum(t,D)+1N+1+1
, where num( t t t, D D D) means the number of documents in D D D that contains term t t t, and D D D is the set of all documents.
T F − I D F = T F × I D F TF-IDF = TF \times IDF TF−IDF=TF×IDF
Note that L2 normalisation is applied to the final TF-IDF for a better performance.
Log-Linear Model
-
Logistic Regression Model
y ^ = softmax ( X N , F W F , C + b C ) \hat{y} = \text{softmax}\big( X_{N,F}W_{F,C}+b_{C} \big) y^=softmax(XN,FWF,C+bC)
, where N N N is the size of train data, F F F is the number of features and C C C is the number of classes. -
Cross Entropy Loss
loss = − 1 N ∑ i = 1 N ∑ j = 1 C y i , j log y ^ i , j \text{loss} = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^{C} y_{i,j} \log \hat{y}_{i,j} loss=−N1i=1∑Nj=1∑Cyi,jlogy^i,j
, where y i , j = 1 y_{i,j}=1 yi,j=1 if train text i i i belongs to class j j j, and 0 0 0 otherwise. -
Gradients
d W = 1 N X T ⋅ ( y ^ − y ) d b = 1 N ∑ i = 1 N ( y ^ − y ) dW = \frac{1}{N} X^T \cdot\big(\hat{y} - y\big) \\ db =\frac{1}{N} \sum_{i=1}^{N} \big(\hat{y} - y\big) dW=N1XT⋅(y^−y)db=N1i=1∑N(y^−y)
Update Algorithm
Gradient Descend with a shrinking learning rate.
Evalutaion
- Accuracy
Accuracy = T P + T N T P + T N + F P + F N \text{Accuracy} = \frac{TP+TN}{TP+TN+FP+FN} Accuracy=TP+TN+FP+FNTP+TN - F1 Score (macro)
Precision = T P T P + F P Recall = T P T P + F N F1 Score = 2 × P R P + R Macro F1 Score = 1 C ∑ i = 1 C F1 Score \text{Precision} = \frac{TP}{TP+FP} \\ \text{Recall} = \frac{TP}{TP+FN} \\ \text{F1 Score} = 2 \times \frac{PR}{P+R}\\ \text{Macro F1 Score} = \frac{1}{C}\sum_{i=1}^{C}\text{F1 Score} Precision=TP+FPTPRecall=TP+FNTPF1 Score=2×P+RPRMacro F1 Score=C1i=1∑CF1 Score
608

被折叠的 条评论
为什么被折叠?



