Go Through an ML project

最新推荐文章于 2024-05-24 11:32:12 发布

東方海竹

最新推荐文章于 2024-05-24 11:32:12 发布

阅读量326

点赞数

分类专栏：机器学习

本文链接：https://blog.youkuaiyun.com/huanghaifeng201213/article/details/127991421

版权

机器学习专栏收录该内容

15 篇文章

订阅专栏

Go Through an ML project

1. Look at the big picture.

Frame the Problem
what exactly is the business objective
Select a Performance Measure
Root Mean Square Error (RMSE) - L2范式
Mean Absolute Error (MAE) - L1范式
Check the Assumptions

2. Get the data.

2.1 Create the Workspace

Anaconda ( Jupyter | Spyder )

2.2 Download the Data

Popular open data repositories

UC Irvine Machine Learning Repository
Kaggle datasets
Amazon’s AWS datasets

Meta portals (they list open data repositories)

http://dataportals.org/
http://opendatamonitor.eu/
http://quandl.com/

Other pages listing many popular open data repositories

Wikipedia’s list of Machine Learning datasets
Quora.com question
Datasets subreddit

2.3 Take a Quick Look at the Data Structure

head()、info()、[‘key’].value_counts() 统计值出现的次数、describe()-shows a summary of the numerical attributes、hist()
Jupyter’s magic command “%matplotlib inline”

2.4 Create a Test Set

pick 20% of the dataset randomly, and set them aside

train_set, test_set = sklearn.model_selection.train_test_split(housing, test_size=0.2, random_state=42)
层次取样 
split = sklearn.model_selection.StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

3. Discover and visualize the data to gain insights.

Visualizing Geographical Data
Looking for Correlations
corr() 查看各个特征与当前特征的关系
pandas.tools.plotting.scatter_matrix() 查看n个特征两两之间的关系，并plot绘图
Experimenting with Attribute Combinations
such as bedrooms_per_room better than the total number of rooms

4. Prepare the data for Machine Learning algorithms.

4.1 Data Cleaning

Get rid of the corresponding districts.
housing.dropna(subset=[“total_bedrooms”]) # option 1
Get rid of the whole attribute.
housing.drop(“total_bedrooms”, axis=1) # option 2
Set the values to some value (zero, the mean, the median, etc.)
比如 #直接设置空值为平均数
median = housing[“total_bedrooms”].median()
housing[“total_bedrooms”].fillna(median) # option 3
或者 #使用Imputer管理转换空值
imputer = sklearn.preprocessing.Imputer(strategy=“median”)
X = imputer.fit_transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

4.2 Handling Text and Categorical Attributes

convert these text labels to numbers

sklearn.preprocessing.LabelEncoder将类别转换成[0, 1, 2,]
【问题】类别值只代表某一个类，大小无意义，但是用数值可能会有误导
sklearn.preprocessing.OneHotEncoder 将类别转换成OneHot形式
sklearn.preprocessing.OneHotEncoder #1
sklearn.preprocessing.LabelBinarizer #2
#1 和 #2 都可以将数值型、文本型数据转为OneHot形式，区别是#1输入二维数组，#2输入一维数组（0.22.x版本之前categories维度判断，#1用的是Max(value)只能处理数值型，升级之后用的是Unique(value)就OK了）

4.3 Custom Transformers 订制转换

sklearn.base.BaseEstimator、sklearn.base.TransformerMixin

4.4 Feature Scaling

min-max scaling ( subtracting the min value and dividing by the max minus the min )
sklearn.preprocessing.MinMaxScaler
standardization ( subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance )
sklearn.preprocessing.StandardScaler

标准化对比MAX-MIN来说，异常值的容错性更大；但值域却不在[0, 1]

4.5 Transformation Pipelines

sklearn.pipeline.Pipeline 串行Pipeline
sklearn.pipeline.FeatureUnion 并行Pipeline
Pipeline要求流程中出最后一个步骤以外，之前的步骤都必须是transformers，每一步都会调用fit_transform()，并将输出作为下一次输入

5. Select a model and train it.

5.1 Training and Evaluating on the Training Set

underfitting
select a more powerful model
better features
reduce the constraints(regularized) on the model
overfitting
simplify the model
gather more training data
reduce the noise in the training data
add the constraints(regularized) on the model

5.2 Better Evaluation Using Cross-Validation

sklearn.model_selection.cross_val_score 交叉训练
an estimate of the performance of model
a measure of how precise this estimate is

5.3 model save ( sklearn.externals.joblib )

6. Fine-tune your model.

Grid Search（超参不多）
fiddle with the hyperparameters manually 手动调整超参
sklearn.model_selection.GridSearchCV 设定多组超参数，自动遍历，对比
Randomized Search（超参很多）
sklearn.model_selection.RandomizedSearchCV 随机的选择组合超参对比
1,000 iterations，可以迭代一千次，尝试一千个随机值，而不是指定的特定值；在有足够的资源尝新时，可能寻到更好的参数组合
Ensemble Methods
Analyze the Best Models and Their Errors
Evaluate Your System on the Test Set

7. Present your solution.

high‐lighting what you have learned
what worked and what did not
what assumptions were made
what your system’s limitations are
document everything
create nice presentations with clear visualizations and easy-to-remember statements

8. Launch, monitor, and maintain your system.

check your system’s live performance at regular intervals and trigger alerts when it drops
evaluate the system’s input data quality.
setting up human evaluation pipelines
automating regular model training

@ WHAT - HOW - WHY