机器学习项目八步指南-优快云博客

机器学习项目指南(Guide for Machine Learning projects)

Checklist for Data Science and Machine Learning Engineer

数据科学和机器学习工程师的清单

This checklist can guide you through your Machine Learning projects. There are eight main steps:

该清单可以指导您完成机器学习项目。主要有八个步骤：

1. Frame the problem and look at the big picture.

1.框出问题并查看全局。

2. Get the data.

2.获取数据。

3. Explore the data to gain insights.

3.探索数据以获得见解。

4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.

4.准备数据，以更好地将基础数据模式暴露给机器学习算法。

5. Explore many different models and short-list the best ones.

5.探索许多不同的模型，并列出最佳模型。

6. Fine-tune your models and combine them into a great solution.

6.调整模型并将其组合成一个很好的解决方案。

7. Present your solution.

7.提出您的解决方案。

8. Launch, monitor, and maintain your system.

8.启动，监视和维护您的系统。

Obviously, you should feel free to adapt this checklist to your needs.

显然，您应该根据自己的需要随意调整此清单。

第1步：确定问题并查看全局 (Step-1: Frame the Problem and Look at the Big Picture)

1. Define the objective in business terms.

1.用业务术语定义目标。

2. How will your solution be used?

2.您的解决方案将如何使用？

3. What are the current solutions/workarounds (if any)?

3.当前有哪些解决方案/解决方法(如果有)？

4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)?

4.您应该如何解决此问题(受监督/不受监督，在线/离线等)？

5. How should performance be measured?

5.应该如何衡量绩效？

6. Is the performance measure aligned with the business objective?

6.绩效指标是否符合业务目标？

7. What would be the minimum performance needed to reach the business objective?

7.达到业务目标所需的最低绩效是多少？

8. What are comparable problems? Can you reuse experience or tools?

8.什么是可比较的问题？您可以重用经验或工具吗？

9. Is human expertise available?

9.是否有人类专业知识？

10. How would you solve the problem manually?

10.您将如何手动解决问题？

11. List the assumptions you (or others) have made so far.

11.列出您(或其他人)到目前为止所做的假设。

12. Verify assumptions if possible.

12.如有可能，验证假设。

步骤2：获取数据 (Step-2: Get the Data)

Note: automate as much as possible so you can easily get fresh data.

注意：尽可能地自动化，以便您可以轻松获取新数据。

1. List the data you need and how much you need.

1.列出所需的数据以及所需的数据量。

2. Find and document where you can get that data.

2.查找并记录可从何处获得该数据。

3. Check how much space it will take.

3.检查将占用多少空间。

4. Check legal obligations, and get authorization if necessary.

4.检查法律义务，并在必要时获得授权。

5. Get access authorizations.

5.获取访问授权。

6. Create a workspace (with enough storage space).

6.创建一个工作空间(具有足够的存储空间)。

7. Get the data.

7.获取数据。

8. Convert the data to a format you can easily manipulate (without changing the data itself).

8.将数据转换为您可以轻松操作的格式(无需更改数据本身)。

9. Ensure sensitive information is deleted or protected (e.g., anonymized).

9.确保敏感信息被删除或受保护(例如匿名)。

10. Check the size and type of data (time series, sample, geographical, etc.).

10.检查数据的大小和类型(时间序列，样本，地理等)。

11. Sample a test set, put it aside, and never look at it (no data snooping!).

11.抽样测试集，将其放在一边，再也不要看它(无数据监听！)。

步骤3：探索资料 (Step-3: Explore the Data)

Note: try to get insights from a field expert for these steps.

注意：请尝试从现场专家那里获取有关这些步骤的见解。

1. Create a copy of the data for exploration (sampling it down to a manageable size if necessary).

1.创建数据副本以进行探索(必要时将其采样到可管理的大小)。

2. Create a Jupyter notebook to keep a record of your data exploration.

2.创建一个Jupyter笔记本以记录您的数据浏览。

3. Study each attribute and its characteristics:

3.研究每个属性及其特征：

Name
名称
Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
类型(分类，整数/浮点型，有界/无界，文本，结构化等)
% of missing values
缺失值的百分比
Noisiness and type of noise (stochastic, outliers, rounding errors, etc.)
噪声和噪声类型(随机，离群值，舍入误差等)
Possibly useful for the task?
对这项任务可能有用吗？
Type of distribution (Gaussian, uniform, logarithmic, etc.)
分布类型(高斯分布，均匀分布，对数分布等)

4. For supervised learning tasks, identify the target attribute(s).

4.对于监督学习任务，请确定目标属性。

5. Visualize the data.

5.可视化数据。

6. Study the correlations between attributes.

6.研究属性之间的相关性。

7. Study how you would solve the problem manually.

7.研究如何手动解决问题。

8. Identify the promising transformations you may want to apply.

8.确定您可能希望应用的有希望的转变。

9. Identify extra data that would be useful (go back to “Get the Data”).

9.确定有用的额外数据(返回“获取数据”)。

10. Document what you have learned.

10.记录所学知识。

步骤4：准备资料 (Step-4: Prepare the Data)

Notes:

笔记：

Work on copies of the data (keep the original dataset intact).
处理数据副本(保持原始数据集完整)。
Write functions for all data transformations you apply, for five reasons:
为您应用的所有数据转换编写函数，原因有五个：
So you can easily prepare the data the next time you get a fresh dataset
因此，下次获取新的数据集时，您可以轻松准备数据
So you can apply these transformations in future projects
因此您可以在未来的项目中应用这些转换
To clean and prepare the test set
清洁并准备测试仪
To clean and prepare new data instances once your solution is live
解决方案上线后清理并准备新的数据实例
To make it easy to treat your preparation choices as hyperparameters
使您可以轻松地将准备选择视为超参数

1. Data cleaning:

1.数据清理：

Fix or remove outliers (optional).
修复或删除异常值(可选)。
Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).
填写缺失值(例如，零，均值，中位数……)或删除其行(或列)。

2. Feature selection (optional):

2.功能选择(可选)：

Drop the attributes that provide no useful information for the task.
删除没有为任务提供有用信息的属性。

3. Feature engineering, where appropriate:

3.功能工程(如果适用)：

Discretize continuous features.
离散化连续特征。
Decompose features (e.g., categorical, date/time, etc.).
分解特征(例如分类，日期/时间等)。
Add promising transformations of features (e.g., log(x), sqrt(x), x², etc.).
添加有希望的特征转换(例如log(x)，sqrt(x)，x²等)。
Aggregate features into promising new features.
将功能聚合为有希望的新功能。

4. Feature scaling: standardize or normalize features.

4.特征缩放：标准化或标准化特征。

步骤5：入围有希望的模型 (Step-5: Short-List Promising Models)

Notes:

笔记：

If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests).
如果数据量巨大，则可能需要对较小的训练集进行抽样，以便可以在合理的时间内训练许多不同的模型(请注意，这会对诸如大型神经网络或随机森林之类的复杂模型造成不利影响)。
Once again, try to automate these steps as much as possible.
再次尝试尽可能自动执行这些步骤。

1. Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.

1.使用标准参数训练来自不同类别(例如，线性，朴素贝叶斯，SVM，随机森林，神经网络等)的许多快速而肮脏的模型。

2. Measure and compare their performance.

2.测量并比较其性能。

For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
对于每个模型，使用N折交叉验证并在N折上计算性能度量的平均值和标准差。

3. Analyze the most significant variables for each algorithm.

3.分析每种算法的最重要变量。

4. Analyze the types of errors the models make.

4.分析模型所犯错误的类型。

What data would a human have used to avoid these errors?
人类将使用什么数据来避免这些错误？

5. Have a quick round of feature selection and engineering.

5.快速进行功能选择和工程设计。

6. Have one or two more quick iterations of the five previous steps.

6.在之前的五个步骤中进行一两次或两次以上的快速迭代。

7. Short-list the top three to five most promising models, preferring models that make different types of errors.

7.列出前三到五个最有希望的模型，最好选择会产生不同类型错误的模型。

步骤6：微调系统 (Step-6: Fine-Tune the System)

Notes:

笔记：

You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning.
您将需要在此步骤中使用尽可能多的数据，尤其是在您即将进行微调时。
As always automate what you can.
一如既往地自动化您的能力。

1. Fine-tune the hyperparameters using cross-validation.

1.使用交叉验证微调超参数。

Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or with the median value? Or just drop the rows?).
将您的数据转换选择视为超参数，尤其是当您不确定它们时(例如，我应该将缺失值替换为零还是中位数？还是只删除行？)。
Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams).1
除非要探索的超参数值很少，否则应优先选择随机搜索而不是网格搜索。如果训练时间很长，您可能更喜欢贝叶斯优化方法(例如，使用Jasper Snoek，Hugo Larochelle和Ryan Adams所述的高斯过程先验)。

2. Try Ensemble methods. Combining your best models will often perform better than running them individually.

2.尝试集成方法。组合最佳模型通常会比单独运行它们更好。

3. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error.

3.对最终模型有信心后，请在测试集中测量其性能，以估计泛化误差。

WARNING: Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set.

警告：在测量了泛化误差之后，请不要对模型进行调整：您只会开始过度拟合测试集。

步骤7：介绍您的解决方案 (Step-7: Present Your Solution)

1. Document what you have done.

1.记录您所做的事情。

2. Create a nice presentation.

2.创建一个漂亮的演示文稿。

Make sure you highlight the big picture first.
确保先突出显示大图。

3. Explain why your solution achieves the business objective.

3.解释为什么您的解决方案可以实现业务目标。

4. Don’t forget to present interesting points you noticed along the way.

4.不要忘了提出您一路上注意到的有趣观点。

Describe what worked and what did not.
描述什么有效，什么无效。
List your assumptions and your system’s limitations.
列出您的假设和系统的局限性。

5. Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).

5.确保通过精美的可视化效果或易于记忆的陈述传达您的主要发现(例如，“中位数收入是房价的第一预测因子”)。

步骤8：启动！ (Step-8: Launch!)

1. Get your solution ready for production (plug into production data inputs, write unit tests, etc.).

1.准备好解决方案以进行生产(插入生产数据输入，编写单元测试等)。

2. Write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops.

2.编写监控代码，以定期检查系统的实时性能，并在系统下降时触发警报。

Beware of slow degradation too: models tend to “rot” as data evolves.
还要提防缓慢的降级：随着数据的发展，模型倾向于“腐烂”。
Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
衡量绩效可能需要人力管道(例如，通过众包服务)。
Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.
还要监视输入的质量(例如，传感器出现故障，发送随机值，或者另一个团队的输出过时)。这对于在线学习系统尤其重要。

3. Retrain your models on a regular basis on fresh data (automate as much as possible).

3.定期根据新数据重新训练模型(尽可能自动进行)。

I hope you found this article useful, Thank you for reading till here. If you have any question and/or suggestions, let me know in the comments.You can also get in touch with me directly through email or LinkedIn or Twitter

希望本文对您有所帮助，谢谢您的阅读。如果您有任何疑问和/或建议，请在评论中让我知道。您也可以直接通过电子邮件， LinkedIn或Twitter与我联系

References and Further Reading

参考资料和进一步阅读

Data Science & Machine Learning Use Cases

数据科学与机器学习用例

Explore Machine Learning from Scratch

从头开始探索机器学习