data

最新推荐文章于 2024-04-24 21:41:06 发布

他来自江湖

最新推荐文章于 2024-04-24 21:41:06 发布

阅读量641

点赞数

CC 4.0 BY-SA版权

分类专栏：转载和记录

本文链接：https://blog.youkuaiyun.com/hua_007/article/details/43149405

转载和记录专栏收录该内容

24 篇文章

订阅专栏

Data Leakage:

The problem of making use of data in the model to which a production system would not have access. This is particularly common in time series problems. Can also happen with data like system id’s that may indicate a class label. Run a model and take a careful look at the attributes that contribute to the success of the model. Sanity check and consider whether it makes sense. (check out the referenced paper “Leakage in Data Mining” PDF)

Overfitting:

Modeling the training data too closely such that the model also includes noise in the model. The result is poor ability to generalize. This becomes more of a problem in higher dimensions with more complex class boundaries.

Data Sampling and Splitting:

Related to data leakage, you need to very careful that the train/test/validation sets are indeed independent samples. Much thought and work is required for time series problems to ensure that you can reply data to the system chronologically and validate model accuracy.

Data Quality:

Check the consistency of your data. Ben gave an example of flight data where some aircraft were landing before taking off. Inconsistent, duplicate, and corrupt data needs to be identified and explicitly handled. It can directly hurt the modeling problem and ability of a model to generalize.