手写基于Logistic Regression的文章分类器(AG News)

A Text Classifier based on Log-Linear Model

A simple model from scratch.

code repository

Dataset

AG News

Implementation

Preprocess

  • Sample data to reduce dataset size
  • Merge the Title and Content
  • With the help of nltk:
    • Remove punctuations and numbers
    • Remove URLs
    • Split the Content into words
    • Filter stopwords
    • Stem and Lemmarize

Feature Extraction

Use TF-IDF as the Feature

  • Keep only the most frequent words for a reasonable feature size
  • Calculate TF-IDF
    T F ( t , d ) = count ( t , d ) ∑ k count ( k , d ) TF(t,d) = \frac{\text{count}(t, d)}{\sum_k \text{count}(k, d)} TF(t,d)=kcount(k,d)count(t,d)
    , where count( t t t, d d d) means the count of term t t t in document d d d.
    I D F ( t , D ) = log ⁡ N + 1 n u m ( t , D ) + 1 + 1 IDF(t, D) = \log \frac{N + 1}{num(t, D) + 1} + 1 IDF(t,D)=lognum(t,D)+1N+1+1
    , where num( t t t,
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值