A Text Classifier based on Log-Linear Model
A simple model from scratch.
Dataset
Implementation
Preprocess
- Sample data to reduce dataset size
- Merge the Title and Content
- With the help of nltk:
- Remove punctuations and numbers
- Remove URLs
- Split the Content into words
- Filter stopwords
- Stem and Lemmarize
Feature Extraction
Use TF-IDF as the Feature
- Keep only the most frequent words for a reasonable feature size
- Calculate TF-IDF
T F ( t , d ) = count ( t , d ) ∑ k count ( k , d ) TF(t,d) = \frac{\text{count}(t, d)}{\sum_k \text{count}(k, d)} TF(t,d)=∑kcount(k,d)count(t,d)
, where count( t t t, d d d) means the count of term t t t in document d d d.
I D F ( t , D ) = log N + 1 n u m ( t , D ) + 1 + 1 IDF(t, D) = \log \frac{N + 1}{num(t, D) + 1} + 1 IDF(t,D)=lognum(t,D)+1N+1+1
, where num( t t t,