Rethinking "A refinement..."

最新推荐文章于 2024-08-01 09:31:49 发布

原创最新推荐文章于 2024-08-01 09:31:49 发布 · 806 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#classification #hierarchy #training #each #features #tree

IR & NLP & TC 专栏收录该内容

5 篇文章

订阅专栏

提出一种新的层次文档分类方法，通过构造基于互信息的层次结构来提高分类准确性。该方法相较于扁平分类器，能够更好地捕捉相近主题间的相似性，简化分类任务。

a paper "Hierarchically classifying documents using very few words" gives a better explanation about the question why refinement works without overfitting. this paper proposes a new classification method in the manner of hierarchy. the procedure is same as "A refinement approach to handling model misfit in text categorization"(binary classifier) but more complex and manual(note that this is not a binary classifier). the hierarchy is constructed by mutual information and feature selection. following is the main idea:

"...The flattened classifier loses the intuition that topics that are close to each other in the hierarchy have a lot more in common with each other, in general, than topics that are very apart.Therefore, even when it is difficult to find the precise topic of a document, it may be easy to decide whether it is about "agriculture" or about "computers".
...
The key insight is that each of these subtasks is significantly simpler than the original task..."

corresponding to "A refinement...", its procedure is implicit: there is no mutual information to deciding features contained in nodes like decision tree, rahter, like boosting, operating on misclassified examples. the effect should be same: get rid of confusing, noisy and irrelevant examples(or words) by selecting misclassification examples(don't need to considering correct classfifed examples). for binary classification, this explanation is problematic: the category number is one. I think the explanation should be: raher than sematic words noisy, noisy in binary classification due to data skew, the words in training examples is not uniform distribution, so the item P(w|c) is not normlaized. keeping in mind misclassification examples can alleviate this situation.

next problem is overfitting, according to above explanation, it is inevitable. because the words distribution reflected by classifier is just training examples distribution. may be the experiment in "A refinement..." is biased, specially the second data collection Usenet.