拟解决基本问题描述
本文拟解决iris鸢尾花所属类型的问题,其中这种花的特征有sepal length(萼片长度)、sepal width(萼片宽度)、petal length(花瓣长度)、petal width(花瓣宽度)。试使用kNN算法、决策树、朴素贝叶斯进行分类。
数据准备与数据预处理
数据准备
数据来源于iris数据集,它包含150朵花的特征sepal length(萼片长度)、sepal width(萼片宽度)、petal length(花瓣长度)、petal width(花瓣宽度)。
数据预处理
解析‘,’分割的数据行。对数据进行预处理,将文本文件中数据中的用来隔开数据的空格删除,替换为换行符。
模型基本原理与算法实现
kNN近邻算法
(1)计算已知类别数据集中的点与当前点之间的距离;
(2)按照距离递增次序排序
(3)按照与当前点距离最小的k个点
(4)确定前k个点所在类别的出现频率
(5)返回前k个点出现频率最高的类别作为当前点的预测分类
决策树
划分数据集的大原则是:将无序的数据变得更加有序,可以使用信息论度量信息。在划分数据集前后信息发生的变化称为信息增益,知道如何计算信息增益,就可以计算每个特征值划分数据集获得的信息增益,获得信息增益最高的特征就是最好的选择。
熵定义为信息的期望值,符号 的信息定义为
其中
为了计算熵,我们需要计算所有类别所有可能值包含的信息期望值,通过下面的公式得到
其中n是分类的数目。
朴素贝叶斯算法
概率论中的经典条件概率公式:
朴素贝叶斯的经典应用是对垃圾邮件的过滤,是对文本格式的数据进行处理,因此这里以此为背景讲解朴素贝叶斯定理。设D是训练样本和相关联的类标号的集合,其中训练样本的属性集为X { X1,X2, … , Xn }, 共有n 个属性;类标号为 C{ C1,C2, … ,Cm }, 有m 中类别。朴素贝叶斯定理
其中,为后验概率,
为先验概率,
为条件概率。朴素贝叶斯的两个假设:1、属性之间相互独立。2、每个属性同等重要。条件概率
可以简化为:
数据可视化
针对iris数据集,首先查看此数据集的大体情况,多少个样本,每个样本多少个特征等。
在查看是否为均衡分类,每一类多少个样本,可看出总共有三个分类,每个分类有50个样本。
接着查看它们的分布趋势是否有离群点存在,根据花的种类来在散点图上标记不同的颜色。此时我们只抽取了两个特征,即sepal length(萼片长度)、sepal width(萼片宽度)
通过只观察一个特征值,来看数据的分布,这里抽取sepal length(萼片长度)和sepal width(萼片宽度)这两个特征来观察
而抽取petal length(花瓣长度)和petal width(花瓣宽度)来观察可得
通过以上其实可以看出决定花品种的因素中花瓣的特征更加容易区分花的品种。
再查看任意两个变量之间的关系
通过数据可看出确实花瓣的特征更能决定花的品种。
测试方法与结果
因自身对python编程的不熟练,在预处理数据和使用数据还有很大不足,在使用kNN近邻算法测试iris数据集时,选取的k值为3,因为没有测试集,故将iris数据集按比例划分,使得训练集和测试集的比例大致为1:3
因iris数据集十分经典,也意味着数据十分“好”,所以测试准确率为
在选取不同的k值后,发现在k值在50以前准确率一直稳定在百分百,当k值更大时,准确率才会产生波动。
在使用决策树算法时,首先直接使用书本上的代码,发现数据过于连续,导致可视化的决策树较为繁杂
为了减少分支,所以对数据进行了处理,对每一列特征数据进行了处理,处理方式为首先计算这列数据最大值和最小值,再将这列数据分成四个区间,四个区间分别是“short(短)”“mid(中)”“longer(较长)”“long(长)”。这样减少了分支,为了可视化方便,将这四种特征简化为“s”“m”“lger”“l”。
在测试数据时,采用抽取一条数据作为测试数据,在测试时,有时因一条测试数据只有这唯一一条,所以造成在训练数集查找不到从而出现错误
在测试数据时发现,若采用数据中第50-100条数据中一条作为测试数据不易出现错误,所以在这段数据中随机抽取5条数据作为测试集
得到其中一次的结果:
在使用朴素贝叶斯时,将数据集分成测试集和训练集,经过测试得到
总结
因自身的不足,此次测试学习了很多网上的知识,在测试决策树的时候,因有些数据较为独立,导致作为测试数据时出现错误,这也凸显出了决策树的一些不足。运用决策树的方法进行了预测,对决策树进行了深入的学习,随机进行了5次不同的训练和测试后,其中一次结果有4次测试数据的预测类型与标签相同。在随机抽取测试数据的时候,大部分数据都不可用于测试,这造成了大大的困难,所以这写测试数据的测试结果显得不太可靠。在使用knn算法时,因iris数据的经典,使得准确率达到了百分之百,当k值取得很大时才会发生变化。再用朴素贝叶斯计算时,得到正确率较高。
经过此次的报告,发现自身对编程的欠缺,不会灵活运用书中代码,且在数据处理上有很大的不足,如数据归一化和数据导入后的处理、对特征的处理等。
代码
可视化部分
import pandas as pd
from sklearn.datasets import load_iris
# 载入seaborn,因为载入时会有警告出现,因此先载入warnings,忽略警告。
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)
iris = pd.DataFrame(load_iris().data)
iris.columns = ['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
iris['Species'] = load_iris().target
## 首先探索一下数据集的大体情况,多少个样本,每一个样本多少个特征等
#print (iris.shape)
#print (iris.head())
#
#
### 探索一下是否为均衡分类
### 分几类?每一类多少个样本?
#print (iris["Species"].value_counts())
#
#
#
## plot是画图的最主要方法,Series和DataFrame都有plot方法。
## plot默认生成是曲线图,可以通过kind参数生成其他的图形,可选的值为:line, bar(柱状图), barh,
## kde, density, scatter(散点图)。
## 对于坐标类数据,可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在
#iris.plot(kind="scatter", x="SepalLengthCm", y="SepalWidthCm")
#接下来用 seaborn 的 FacetGrid 函数按照Species花的种类来在散点图上标上不同的颜色,hue英文是色彩的意思。
#sns.FacetGrid(iris, hue="Species", size=5).map(plt.scatter, "SepalLengthCm", "SepalWidthCm").add_legend()
# 通过箱线图来查看单个特征的分布
# 对 Numerical Variable,可以用 Box Plot 来直观地查看不同花类型的分布。
#sns.boxplot(x="Species", y="SepalLengthCm", data=iris)
# 将每一个Species所属的点加到对应的位置,加上散点图,
# 振动值jitter=True 使各个散点分开,要不然会是一条直线
#ax = sns.boxplot(x="Species", y="PetalWidthCm", data=iris)
#ax = sns.stripplot(x="Species", y="PetalWidthCm", data=iris, jitter=True, edgecolor="gray")
# violinplot 小提琴图,查看密度分布,结合了前面的两个图,并且进行了简化
# 数据越稠密越宽,越稀疏越窄
#sns.violinplot(x="Species", y="PetalLengthCm", data=iris, size=6)
# sns.kdeplot == kernel density 核密度图(单个变量)
#sns.FacetGrid(iris, hue="Species", size=6).map(sns.kdeplot, "PetalLengthCm").add_legend()
# pairplot 任意两个变量间的关系
sns.pairplot(iris, hue="Species", size=3)
kNN部分
import csv
import random
import math
import operator
from sklearn import neighbors
def loadDataset(filename,split,trainingSet=[],testSet = []):
with open(filename,"rt") as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(4):
dataset[x][y] = float(dataset[x][y])
if random.random()<split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[y])
def euclideanDistance(instance1,instance2,length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]),2)
return math.sqrt(distance)
def getNeighbors(trainingSet,testInstance,k):
distances = []
length = len(testInstance) -1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], length)
distances.append((trainingSet[x],dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response]+=1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(),key = operator.itemgetter(1),reverse =True)
return sortedVotes[0][0]
def getAccuracy(testSet,predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct+=1
return (correct/float(len(testSet))) * 100.0
def main():
trainingSet = []
testSet = []
split = 0.67
loadDataset("c:\\Users\\29795\\Desktop\\iris\\iris.data",split,trainingSet,testSet)
print("Train set :" + repr(len(trainingSet)))
print ("Test set :" + repr(len(testSet)) )
predictions = []
k = 70
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print (">predicted = " + repr(result) + ",actual = " + repr(testSet[x][-1]))
accuracy = getAccuracy(testSet, predictions)
print ("Accuracy:" + repr(accuracy) + "%" )
if __name__ =="__main__":
main()
决策树部分
from math import log
import matplotlib.pyplot as plt
import numpy as np
import operator
import re
import pandas as pd
import random
from sklearn.datasets import load_iris
def createDataSet():
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
labels = ['no surfacing','flippers']
#change to discrete values
return dataSet, labels
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return np.array(shannonEnt)
def splitDataSet(dataSet, axis, value):
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #chop out axis used for splitting
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
def majorityCnt(classList):
classCount={}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]#stop splitting when all of the classes are equal
if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
return myTree
def classify(inputTree,featLabels,testVec):
firstStr = list(inputTree.keys())[0]
secondDict = inputTree[firstStr]
featIndex = featLabels.index(firstStr)
key = testVec[featIndex]
valueOfFeat = secondDict[key]
if isinstance(valueOfFeat, dict):
classLabel = classify(valueOfFeat, featLabels, testVec)
else: classLabel = valueOfFeat
return classLabel
def storeTree(inputTree,filename):
import pickle
fw = open(filename,'w')
pickle.dump(inputTree,fw)
fw.close()
def grabTree(filename):
import pickle
fr = open(filename)
return pickle.load(fr)
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
def getNumLeafs(myTree):
numLeafs = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
numLeafs += getNumLeafs(secondDict[key])
else: numLeafs +=1
return numLeafs
def getTreeDepth(myTree):
maxDepth = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
thisDepth = 1 + getTreeDepth(secondDict[key])
else: thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
xytext=centerPt, textcoords='axes fraction',
va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)
def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
numLeafs = getNumLeafs(myTree) #this determines the x width of this tree
depth = getTreeDepth(myTree)
firstStr = list(myTree.keys())[0] #the text label for this node should be this
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
plotTree(secondDict[key],cntrPt,str(key)) #recursion
else: #it's a leaf node print the leaf node
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict
def createPlot(inTree):
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #no ticks
#createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()
#def createPlot():
# fig = plt.figure(1, facecolor='white')
# fig.clf()
# createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
# plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
# plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
# plt.show()
def retrieveTree(i):
listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
]
return listOfTrees[i]
def autoNorm(dataSet):
dataSet = kNN.array(dataSet)
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide
return normDataSet
def dataDiscretize(dataSet):
m,n = dataSet.shape #获取数据集行列(样本数和特征数)
disMat = tile([0],dataSet.shape) #初始化离散化数据集
for i in range(n-1): #由于最后一列为类别,因此遍历前n-1列,即遍历特征列
x = [l[i] for l in dataSet] #获取第i+1特征向量
y = pd.cut(x,10,labels=[0,1,2,3,4,5,6,7,8,9]) #调用cut函数,将特征离散化为10类,可根据自己需求更改离散化种类
for k in range(n): #将离散化值传入离散化数据集
disMat[k][i] = y[k]
return disMat
def deal1(dataset):
b=dataset["sepal length (cm)"].max()
a=dataset["sepal length (cm)"].min()
t=round((b-a)/4,2)
def f(x):
if a<=x<=a+t:
return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
elif x<=a+2*t:
return "m"+"("+str(a+t)+","+str(a+2*t)+")"
elif x<=a+3*t:
return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
else:
return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
dataset["sepal length (cm)"]=dataset["sepal length (cm)"].apply(f)
return dataset
def deal2(dataset):
b=dataset["sepal width (cm)"].max()
a=dataset["sepal width (cm)"].min()
t=round((b-a)/4,2)
def f(x):
if a<=x<=a+t:
return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
elif x<=a+2*t:
return "m"+"("+str(round(a+t,2))+","+str(round(a+2*t,2))+")"
elif x<=a+3*t:
return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
else:
return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
dataset["sepal width (cm)"]=dataset["sepal width (cm)"].apply(f)
return dataset
def deal3(dataset):
b=dataset["petal length (cm)"].max()
a=dataset["petal length (cm)"].min()
t=round((b-a)/4,2)
def f(x):
if a<=x<=a+t:
return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
elif x<=a+2*t:
return "m"+"("+str(round(a+t,2))+","+str(round(a+2*t,2))+")"
elif x<=a+3*t:
return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
else:
return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
dataset["petal length (cm)"]=dataset["petal length (cm)"].apply(f)
return dataset
def deal4(dataset):
b=dataset["petal width (cm)"].max()
a=dataset["petal width (cm)"].min()
t=round((b-a)/4,2)
def f(x):
if a<=x<=a+t:
return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
elif x<=a+2*t:
return "m"+"("+str(round(a+t,2))+","+str(round(a+2*t,2))+")"
elif x<=a+3*t:
return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
else:
return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
dataset["petal width (cm)"]=dataset["petal width (cm)"].apply(f)
return dataset
data_set = load_iris()
feature_names = data_set["feature_names"]
data = data_set["data"]
labels = data_set["target"]
df = pd.DataFrame(data,columns=feature_names)
df["label"] = labels
df=deal1(df)
df=deal2(df)
df=deal3(df)
df=deal4(df)
#print(df)
X = df.iloc[:,[0,1,2,3,4]].values.tolist()
#print(X)
y = df.iloc[:,4].values.T.tolist()
#print(y)
lensesLabels=['sepallength','sepalwidth','petallength','petalwidth']
#lensesTree=createTree(X,lensesLabels)
#print(lensesTree)
#createPlot(lensesTree)
num=0
k=0
trainlenses=X[31:]
test=X[:30]
#print(test)
lensesTrees=createTree(test,lensesLabels)
createPlot(lensesTree)
#test=X[:30]
#lensesTrees=createTree(trainlenses,lensesLabels)
#lensesLabels=['sepallength','sepalwidth','petallength','petalwidth']
#for i in range(30):
# if X[i][-1]==classify(lensesTrees,lensesLabels,test[i]):
# num=num+1
#print(num)
#lensesTree=createTree(lenses,lensesLabels)
#testVect=lenses[0]
贝叶斯部分
import pandas as pd
import numpy as np
data='''5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica'''
data = data.replace(' ','').replace("Iris-setosa","1.0").replace("Iris-versicolor","2.0").replace("Iris-virginica","3.0").split('\n')
data = list(filter(lambda x: len(x) > 0,data))
data = [x.split(',') for x in data]
data = np.array(data).astype(np.float16)
def splitData(trainPrecent=0.7):
train = []
test = []
for i in data:
(train if np.random.random() < trainPrecent else test).append(i)
return np.array(train),np.array(test)
trainData,testData = splitData()
print("共有%d条数据,分解为%d条训练集与%d条测试集"%(len(data),len(trainData),len(testData)))
clf=set(trainData[:,-1])
trainClfData={}
for x in clf:
clfItems=np.array(list(filter(lambda i:i[-1]==x ,trainData)))[:,:-1]
mean=clfItems.mean(axis=0)
stdev= np.sqrt(np.sum((clfItems-mean)**2,axis=0)/float(len(clfItems)-1))
trainClfData[x]=np.array([mean,stdev]).T
result=[]
for testItem in testData:
itemData=testItem[0:-1]
itemClf=testItem[-1]
prediction={}
for clfItem in trainClfData:
probabilities= np.exp(-1*(testItem[0:-1]-trainClfData[clfItem][:,0])**2/(trainClfData[clfItem][:,1]**2*2)) / (np.sqrt(2*np.pi)*trainClfData[clfItem][:,1])
clfPrediction=1
for proItem in probabilities:
clfPrediction*=proItem
prediction[clfItem]=clfPrediction
maxProbablity=None
for x in prediction:
if maxProbablity==None or prediction[x]>prediction[maxProbablity]:
maxProbablity=x
result.append({'数据':itemData.tolist()
,'实际分类':itemClf
,'各类别概率':prediction
,'测试分类(最大概率类别)':maxProbablity
,'是否正确': 1 if itemClf==maxProbablity else 0})
rightCount=0;
for x in result:
rightCount+=x['是否正确']
print('共%d条测试数据,测试正确%d条,正确率%2f:'%(len(result),rightCount,rightCount/len(result)))
参考文献
【1】《机器学习实战》