railsconf时,在oreilly展台见到一本'Programming Collective Intelligence'的书,其实是讲data mining的。比其他的教科书类书易懂的多。下面摘抄了一下有用的内容:
=knn=
+ new data can be added at any time--does not require any computation at all; the data is simply added to the set.
- it requires all the trainning data to be present in order to make predictions. In a dataset with millions of examples, this is not just a space issue but also a time issue.
=svm=
+ after training they are very fast to classify new observations.
- black box technique. A SVM may give great answers, but you will never really know why.
- require retrainning if the data changes
=neural network=
+ allow incremental training and generally don't require a lot space to store the trained models.
- black box technique
=decision tree=
+ easy to interpret trained model, brings important factors to the top of the tree.
- Have to start from scartch each time (decision trees that support incremental training are an active area of research)
- tree can becomes extremely large and complex and would be slow to make classification.
=naive bayesian=
+ speed is good for training and querying, even with large data set
+ incremental
+ easy to interpret what the classifier has actually learned
- unable to deal with outcomes that change based on combinations of features.
=knn=
+ new data can be added at any time--does not require any computation at all; the data is simply added to the set.
- it requires all the trainning data to be present in order to make predictions. In a dataset with millions of examples, this is not just a space issue but also a time issue.
=svm=
+ after training they are very fast to classify new observations.
- black box technique. A SVM may give great answers, but you will never really know why.
- require retrainning if the data changes
=neural network=
+ allow incremental training and generally don't require a lot space to store the trained models.
- black box technique
=decision tree=
+ easy to interpret trained model, brings important factors to the top of the tree.
- Have to start from scartch each time (decision trees that support incremental training are an active area of research)
- tree can becomes extremely large and complex and would be slow to make classification.
=naive bayesian=
+ speed is good for training and querying, even with large data set
+ incremental
+ easy to interpret what the classifier has actually learned
- unable to deal with outcomes that change based on combinations of features.
本文对比了几种常见智能算法的特点及应用场景:KNN适用于实时数据添加但需全量训练数据;SVM预测速度快但不透明且数据变化需重新训练;神经网络支持增量训练且占用空间小但解释性差;决策树易于解释并能突出重要因素但可能变得复杂;朴素贝叶斯训练和查询速度快且易于理解所学知识,但不能处理特征组合导致的变化。
450

被折叠的 条评论
为什么被折叠?



