数据不平衡是一个非常经典的问题,数据挖掘、计算广告、NLP等工作经常遇到。该文总结了可能有效的方法,值得参考:
- Do nothing. Sometimes you get lucky and nothing needs to be done. You can train on the so-called natural (or stratified) distribution and sometimes it works without need for modification.
- Balance the training set in some way:
- Oversample the minority class.
- Undersample the majority class.
- Synthesize new minority classes.
- Throw away minority examples and switch to an anomaly detection framework.
- At the algorithm level, or after it:
- Adjust the class weight (misclassification costs).
- Adjust the decision threshold.
- Modify an existing algorithm to be more sensitive to rare classes.
- Construct an entirely new algorithm to perform well on imbalanced data.
应对数据不平衡
本文探讨了数据不平衡问题在数据挖掘、计算广告及NLP等领域的挑战,并提出了多种解决策略,包括调整训练集分布、修改算法敏感度及构建专门算法等。
907

被折叠的 条评论
为什么被折叠?



