Python机器学习的库:scikit-learn
1.1: 特性:
简单高效的数据挖掘和机器学习分析
对所有用户开放,根据不同需求高度可重用性
基于Numpy, SciPy和matplotlib
开源,商用级别:获得 BSD许可1.2 覆盖问题领域:
分类(classification), 回归(regression), 聚类(clustering), 降维(dimensionality reduction)
模型选择(model selection), 预处理(preprocessing)使用用scikit-learn
安装scikit-learn: pip, easy_install, windows installer
安装必要package:numpy, SciPy和matplotlib, 可使用Anaconda (包含numpy, scipy等科学计算常用package)例子:某机构调查人群中买电脑的情况。
根据信息熵可以选择最优的决策树:
代码如下:csv文件下载地址:https://pan.baidu.com/s/1sluPilZ
# -*- coding:utf-8 -*-
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO
# Read in the csv file and put features into list of dict and list of class label
allElectronicsData = open(r'C:\Users\zmj\Desktop\AllElectronics2.csv', 'rb')
reader = csv.reader(allElectronicsData)
headers = next(reader) #headers 指的是特征
#