以下参考自《python数据挖掘入门与实战》
1.4分类
已知类别的数据集,经过训练得到一个分类模型,再用模型对类别未知的的数据进行分类。比如垃圾邮件过滤器
离散化:数据集的特征为连续值,算法使用类别性特征值,连续值转变为类别值,这个过程叫离散化
最简单的离散化算法:确定一个阈值,低于该阈值的特征设置为0,高于阈值的值设置为1
阈值计算方法之一:该特征的所有特征值的均值
oneR算法
对于每个特征的每个取值,统计它在各个类别出现次数
错误率 = 1 - 第i个特征的第j个取值出现次数 / 第i特征所有取值的和
选择错误率最低的特征作为唯一分类准则
测试算法
机器学习流程分为2步,训练和测试。从数据集中取一部分数据用于训练模型,取一部分数据用于测试模型对于未知数据的拟合效果
过拟合:模型对训练集表现好,对于未见过的数据表现差。解决方法:不要用训练数据测试模型或者把数据集分为两部分,分别用于训练和测试。
代码实现
#从scikit_learn库内置的iris植物分类数据集
from sklearn.datasets import load_iris
dataset = load_iris()
print(dataset.DESCR) #可以打印出数据集的具体信息
import numpy as np
from collections import defaultdict
from operator import itemgetter
X = dataset.data #特征值
Y = dataset.target #类别
attribute_mean = X.mean(axis=0)
X_d = np.array(X >= attribute_mean, dtype='int') # transfer continuous value to discrete将连续值离散化
def train_feature_value(X, y_true, feature_index, value):
# create a dictionary to count how frequenctly a sample given a specific feature appears in certian class
#创建字典,统计某一特征在某一类别出现频率
class_counts = defaultdict(int)
for sample, y in zip(X,y_true):
if sample[feature_index] == value:
class_counts[y] += 1
# get the best one by sorting 特征值最有可能归属的类别
sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
most_frequent_class = sorted_class_counts[0][0]
#error is the number of samples that do not classified as the most frequent class 1- 属于most_frequent_class的特征值
incorrect_predictions = [class_count for class_value, class_count in class_counts.items() if class_value != most_frequent_class]
error = sum(incorrect_predictions)
return most_frequent_class, error
def train_on_value(X, y_true, feature_index):
predictors = {} #create a dictionary with key denoting feature value and value denoting which class it belongs特征值,类组成的字典
errors = []
values = set(X[:, feature_index])
for v in values:
most_frequent_class, error = train_feature_value(X, y_true, feature_index, v)
predictors[v] = most_frequent_class
errors.append(error)
total_error = sum(errors)
return predictors, total_error
#split dataset into two parts: trainning set and testing set (default 25%)
#分训练集和测试集,让每个集合符合原集合的分布
from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, Y, random_state=14) #random_state?
all_predictors = {}
errors = {}
for feature_index in range(Xd_train.shape[1]):
predictor, error = train_on_value(Xd_train, y_train, feature_index)
all_predictors[feature_index] = predictor
errors[feature_index] = error
#建立的分类预测模型
best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]
model = {"feature": best_feature, 'predictor': all_predictors[best_feature]}
def predict(X_test, model):
feature = model["feature"]
predictor = model["predictor"]
y_predicted = np.array([predictor[int(sample[feature])] for sample in X_test])
return y_predicted
#预测值与实际值的误差
y_predicted = predict(Xd_test, model)
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {0:.1f}".format(accuracy))