sklearn cookbook 总结

Sklearn cookbook总结

1 数据预处理

1.1 获取数据

sklearn自带一些数据集,可以通过datasets模块的load_*方法加载,还有一些数据集比较大,可以通过fetch_*的方式下载。下面的代码示例了加载boston的房价数据和下载california的房价数据的方法。

from sklearn import datasets

boston = datasets.load_boston()
print(boston.DESCR)

california = datasets.fetch_california_housing('./temp')
# print(california.DESCR)
Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

1.2 数据处理

sklearn的preprocess模块提供了若干预处理数据的方法。其功能如下:

类或方法 作用
StandardScaler 数据减去均值后除以方差
MinMaxScaler 减去最小值除以最大最小值的差
normalize 将数据除以所有点的平方和
binary 由设置的阀值s进行二值化,x>s?1:0

其使用示例如下,由于是二维数组,计算在列上进行,即axis为0:

import numpy as np
from sklearn import preprocessing

a = np.array([[4., 2.], [2., 4.], [2, -2]], dtype=np.float)

print(a)

scaler = preprocessing.StandardScaler()
r = scaler.fit_transform(a)
print(r) 

scaler = preprocessing.MinMaxScaler()
r = scaler.fit_transform(a)
print(r)

r = preprocessing.normalize(a)
print(r) 

binary = preprocessing.Binarizer(3.5)
r = binary.fit_transform(a)
print(r) 
[[ 4.  2.]
 [ 2.  4.]
 [ 2. -2.]]
[[ 1.41421356  0.26726124]
 [-0.70710678  1.06904497]
 [-0.70710678 -1.33630621]]
[[1.         0.66666667]
 [0.         1.        ]
 [0.         0.        ]]
[[ 0.89442719  0.4472136 ]
 [ 0.4472136   0.89442719]
 [ 0.70710678 -0.70710678]]
[[1. 0.]
 [0. 1.]
 [0. 0.]]

###1.3 分类编码
对于类别型的数据,需要将其数值化,以支持向量运算。

对于数值型的,可以使用preprocessing包的OneHotEncoder;对于字符串型的需要借助feature_extraction模块来进行。

from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer


labels = [[1], [2], [3], [2]]

onehot = preprocessing.OneHotEncoder()
y = onehot.fit_transform(labels)

print(y.toarray())


labels = [{
   'kind':'apple'}, {
   'kind':'orange'}]
dv = DictVectorizer()
y = dv.fit_transform(labels)
print(y.toarray())


labels = [1,2,3,3,2,1]
lb = preprocessing.LabelBinarizer()
vec = lb.fit_transform(labels)
print(vec)


[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
[[1. 0.]
 [0. 1.]]
[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [0 1 0]
 [1 0 0]]

1.4 缺失值处理

缺失值可以表示为nan,但在计算中无法使用,因此根据需要可以填充为合适的值。sklearn和pandas都能处理缺失值。

import pandas as pd
from sklearn import preprocessing


data = np.array([[1, 2], [np.nan, 4]])
print('origin:\n', data)

imputer = preprocessing.Imputer(strategy='mean')
r = imputer.fit_transform(data)
print('sklean:\n', r)


data_df = pd.DataFrame(data)
df = data_df.fillna(data_df.mean())
print('pandas\n',df)
origin:
 [[ 1.  2.]
 [nan  4.]]
sklean:
 [[1. 2.]
 [1. 4.]]
pandas
      0    1
0  1.0  2.0
1  1.0  4.0

1.5 去除无用的维度

PCA是sklearn的一个分解模块,可以借助它来完成数据降维。

下面的代码对iris的特征进行PCA降维,通过对各维度的贡献分析,96%的变量可以由前两个主成分表示。因此可以把数据降低到前两维上,通过对PCA的参数n_components指定维度或比例,可以将数据进行降维。在只有两维的数据上通过plot作图以验证数据的可分性。

降维的另一个方法是使用FactorAnalysis类,使用上和PCA类似。其支持的核函数有liner, poly, rbf, sigmoid, cosine。

最后,利用矩阵的SVD也可以实现数据降维。各种降维方法的示例代码及效果如下:

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.decomposition import TruncatedSVD


iris = datasets.load_iris()

pca = PCA()
dt = pca.fit_transform(iris.data)
print(pca.explained_variance_ratio_)

'''
array([  8.05814643e-01,   1.63050854e-01,   2.13486883e-02,......)
'''

fig, axes = plt.subplots(1,3)

pca = decomposition.PCA(n_components = 2)
dt = pca.fit_transform(iris.data)
axes[0].scatter(dt[:,0], dt[:,1], c=iris.target)


fa = FactorAnalysis(n_components=2)
dt = fa.fit_transform(iris.data)
axes[1].scatter(dt[:,0], dt[:,1], c=iris.target)


svd = TruncatedSVD()
dt = svd.fit_transform(iris.data)
axes[2].scatter(dt[:,0], dt[:,1], c=iris.target)
[0.92461621 0.05301557 0.01718514 0.00518309]





<matplotlib.collections.PathCollection at 0x7f2d406d9ef0>

在这里插入图片描述

1.6 使用pipeline连接多个变换

对于多步处理,pipeline提供了一种便捷的组织代码的方式。如下示例:

from sklearn import pipeline, preprocessing, decomposition, datasets

iris = datasets.load_iris()

imputer = preprocessing.Imputer()
pca = decomposition.PCA(n_components=2)
line = [('imputer', imputer), ('pca', pca)]

pipe = pipeline.Pipeline(line)
dt = pipe.fit_transform(iris.data)
print dt.shape #(150,2)

1.7 利用高斯随机过程处理回归

如果假设变量的分布和自变量符合高斯分布或正态分布,则可以使用高斯过程来进行回归分析。

from sklearn import datasets
from sklearn.gaussian_process import GaussianProcess

boston = datasets.load_boston()

sel = np.random.choice([True, False], len(boston.data), p=[0.75, 0.25])
gp = GaussianProcess()
gp.fit(boston.data[sel], boston.target[sel])

pred = gp.predict(boston.data[~sel])
diff = pred - boston.target[~sel]
xtick = range(len(pred))

fig, axes = plt.subplots(2,1)

axes[0].plot(xtick, pred, c='red',label='predict')
axes[0].plot(xtick, boston.target[~sel], c='blue', label='real')

axes[1].plot(xtick, diff)

plt.show()
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)

在这里插入图片描述

1.8 SGD处理回归

from sklearn import datasets
from sklearn.linear_model import SGDRegressor

X, y = datasets.make_regression(1000)
sel = np.random.choice([True, False], len(X), p=[
About This Book ============================== Learn how to handle a variety of tasks with Scikit-Learn with interesting recipes that show you how the library really works Use Scikit-Learn to simplify the programming side data so you can focus on thinking Discover how to apply algorithms in a variety of situations Who This Book Is For ============================== If you're a data scientist already familiar with Python but not Scikit-Learn, or are familiar with other programming languages like R and want to take the plunge with the gold standard of Python machine learning libraries, then this is the book for you. In Detail ============================== Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across. The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值