AI-014: 吴恩达教授（Andrew Ng）的机器学习课程学习笔记49_吴恩达机器学习课程文本全纪录-优快云博客

本文探讨了机器学习系统的构建与优化策略，包括错误分析、评估指标如查准率和召回率的选择，以及如何处理类别不平衡问题。通过案例研究，如垃圾邮件过滤，介绍了如何通过增加数据量和特征复杂度来提升分类器性能。

本文是学习Andrew Ng的机器学习系列教程的学习笔记。教学视频地址：

https://study.163.com/course/introduction.htm?courseId=1004570029#/courseDetail?tab=1

49. Machine learning system design: prioritizing what to work on: spam classification example

以建立垃圾邮件过滤系统为例，首先建立分类器：

选择高频词汇作为特征。

如何降低分类器的错误率，举例：

收集大量数据
使用从邮件路由信息（比如发件人、标题）中提取的复杂特征，比如空标题、@saler.com等
使用从邮件内容中提取的复杂特征，比如由降价、促销等词汇
识别错误拼写

50. Machine Learning system design: Error analysis

方法论：

错误分析：

看看各种情况的分布，占比大的情况可以改进算法进行识别，尝试各种新的方法（更多数据、更多特征...），然后看看引起误差的主要原因；

算法最好能够返回量化的检验结果，比如返回错误率，这样根据引入不同的特征或方法（比如是否使用提取词干）获得的错误率来决定如何做更好：

如果引入词干提取的错误率更小，就采用引入词干分析的算法；

51. Machine learning system design: Error metric for skewed classes

skewed classes 偏斜类

accuracy 精确度

Precision 查准率

Recall 召回率

查准率和召回率越高越好；

if a classify is getting high precision and high recall then we are actually confident that the algorithm has to be doing well, even if we have very skewed classes.

So for the problem of skewed classes, precision and recall gives us more direct insight into how the learning algorithm is doing, and this is often a much better way to evaluate our learning algorithms than looking at classification error(分类误差) or classification accuracy(分类准确率) when the classes are very skewed.

51. Machine learing system design: Trading off precision and recall

threshold 临界值

被查出来的很少，但是一旦查出来，就可以确定->高查准率，低召回率。比如垃圾邮件，你可不希望错过正常邮件；

被查出来的很多，但是查出来的有很多是误判->低查准率，高召回率。比如预测癌症，保持怀疑态度：）