Rethinking "A refinement..."

提出一种新的层次文档分类方法,通过构造基于互信息的层次结构来提高分类准确性。该方法相较于扁平分类器,能够更好地捕捉相近主题间的相似性,简化分类任务。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

a paper "Hierarchically classifying documents using very few words" gives a better explanation about the question why refinement works without overfitting. this paper proposes a new classification method in the manner of hierarchy. the procedure is same as "A refinement approach to handling model misfit in text categorization"(binary classifier) but more complex and manual(note that this is not a binary classifier). the hierarchy is constructed by mutual information and feature selection. following is the main idea:

"...The flattened classifier loses the intuition that topics that are close to each other in the hierarchy have a lot more in common with each other, in general, than topics that are very apart.Therefore, even when it is difficult to find the precise topic of a document, it may be easy to decide whether it is about "agriculture" or about "computers".
...
The key insight is that each of these subtasks is significantly simpler than the original task..."

corresponding to "A refinement...", its procedure is implicit: there is no mutual information to deciding features contained in nodes like decision tree, rahter, like boosting, operating on misclassified examples. the effect should be same: get rid of confusing, noisy and irrelevant examples(or words) by selecting misclassification examples(don't need to considering correct classfifed examples). for binary classification, this explanation is problematic: the category number is one. I think the explanation should be: raher than sematic words noisy, noisy in binary classification due to data skew, the words in training examples is not uniform distribution, so the item P(w|c) is not normlaized. keeping in mind misclassification examples can alleviate this situation.

next problem is overfitting, according to above explanation, it is inevitable. because the words distribution reflected by classifier is just training examples distribution. may be the experiment in "A refinement..." is biased, specially the second data collection Usenet.

 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值