《机器学习实战》学习笔记———运用多种方法处理iris分类问题

最新推荐文章于 2024-01-16 00:00:37 发布

qq_33161972

最新推荐文章于 2024-01-16 00:00:37 发布

阅读量2.5k

点赞数 4

本文链接：https://blog.youkuaiyun.com/qq_33161972/article/details/80979225

版权

本文通过使用kNN、决策树及朴素贝叶斯算法对经典的鸢尾花数据集进行分类，详细介绍了每种算法的基本原理及实现过程。并利用数据可视化技术展示了数据分布特征。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

拟解决基本问题描述

本文拟解决iris鸢尾花所属类型的问题，其中这种花的特征有sepal length（萼片长度）、sepal width（萼片宽度）、petal length（花瓣长度）、petal width（花瓣宽度）。试使用kNN算法、决策树、朴素贝叶斯进行分类。

数据准备与数据预处理

数据准备
数据来源于iris数据集，它包含150朵花的特征sepal length（萼片长度）、sepal width（萼片宽度）、petal length（花瓣长度）、petal width（花瓣宽度）。
数据预处理
解析‘，’分割的数据行。对数据进行预处理，将文本文件中数据中的用来隔开数据的空格删除，替换为换行符。

模型基本原理与算法实现

kNN近邻算法
（1）计算已知类别数据集中的点与当前点之间的距离；
（2）按照距离递增次序排序
（3）按照与当前点距离最小的k个点
（4）确定前k个点所在类别的出现频率
（5）返回前k个点出现频率最高的类别作为当前点的预测分类

决策树
划分数据集的大原则是：将无序的数据变得更加有序，可以使用信息论度量信息。在划分数据集前后信息发生的变化称为信息增益，知道如何计算信息增益，就可以计算每个特征值划分数据集获得的信息增益，获得信息增益最高的特征就是最好的选择。
熵定义为信息的期望值，符号的信息定义为

l (x i) = - l o g 2 p (x i)

$l(x_{i})=-log{_{2}}^{p\left ( x_{i} \right )}$

其中

p (x i)

$p\left ( x_{i} \right )$ 是选择该分类的概率。

为了计算熵，我们需要计算所有类别所有可能值包含的信息期望值，通过下面的公式得到

H = - \sum i = 1 n p (x i) l o g 2 p (x i)

$H=-\sum_{i=1}^{n}p\left ( x_{i} \right )log{_{2}}^{p\left ( x_{i} \right )}$

其中n是分类的数目。

朴素贝叶斯算法
概率论中的经典条件概率公式：

这里写图片描述

朴素贝叶斯的经典应用是对垃圾邮件的过滤，是对文本格式的数据进行处理，因此这里以此为背景讲解朴素贝叶斯定理。设Ｄ是训练样本和相关联的类标号的集合，其中训练样本的属性集为X { X1,X2, … , Xn }, 共有n 个属性；类标号为 C{ C1,C2, … ,Cm }, 有m 中类别。朴素贝叶斯定理

这里写图片描述

其中，这里写图片描述为后验概率，为先验概率，为条件概率。朴素贝叶斯的两个假设：1、属性之间相互独立。2、每个属性同等重要。条件概率可以简化为：

这里写图片描述

数据可视化

针对iris数据集，首先查看此数据集的大体情况，多少个样本，每个样本多少个特征等。

这里写图片描述

在查看是否为均衡分类，每一类多少个样本，可看出总共有三个分类，每个分类有50个样本。

这里写图片描述

接着查看它们的分布趋势是否有离群点存在，根据花的种类来在散点图上标记不同的颜色。此时我们只抽取了两个特征，即sepal length（萼片长度）、sepal width（萼片宽度）

这里写图片描述

通过只观察一个特征值，来看数据的分布，这里抽取sepal length（萼片长度）和sepal width（萼片宽度）这两个特征来观察
这里写图片描述

而抽取petal length（花瓣长度）和petal width（花瓣宽度）来观察可得

这里写图片描述

通过以上其实可以看出决定花品种的因素中花瓣的特征更加容易区分花的品种。
再查看任意两个变量之间的关系

这里写图片描述

通过数据可看出确实花瓣的特征更能决定花的品种。

测试方法与结果

因自身对python编程的不熟练，在预处理数据和使用数据还有很大不足，在使用kNN近邻算法测试iris数据集时，选取的k值为3，因为没有测试集，故将iris数据集按比例划分，使得训练集和测试集的比例大致为1：3

这里写图片描述
因iris数据集十分经典，也意味着数据十分“好”，所以测试准确率为

这里写图片描述

在选取不同的k值后，发现在k值在50以前准确率一直稳定在百分百，当k值更大时，准确率才会产生波动。
在使用决策树算法时，首先直接使用书本上的代码，发现数据过于连续，导致可视化的决策树较为繁杂
这里写图片描述

为了减少分支，所以对数据进行了处理，对每一列特征数据进行了处理，处理方式为首先计算这列数据最大值和最小值，再将这列数据分成四个区间，四个区间分别是“short（短）”“mid（中）”“longer（较长）”“long（长）”。这样减少了分支，为了可视化方便，将这四种特征简化为“s”“m”“lger”“l”。

这里写图片描述

在测试数据时，采用抽取一条数据作为测试数据，在测试时，有时因一条测试数据只有这唯一一条，所以造成在训练数集查找不到从而出现错误
这里写图片描述
在测试数据时发现，若采用数据中第50-100条数据中一条作为测试数据不易出现错误，所以在这段数据中随机抽取5条数据作为测试集

得到其中一次的结果：

这里写图片描述
在使用朴素贝叶斯时，将数据集分成测试集和训练集，经过测试得到

总结

因自身的不足，此次测试学习了很多网上的知识，在测试决策树的时候，因有些数据较为独立，导致作为测试数据时出现错误，这也凸显出了决策树的一些不足。运用决策树的方法进行了预测，对决策树进行了深入的学习，随机进行了5次不同的训练和测试后，其中一次结果有4次测试数据的预测类型与标签相同。在随机抽取测试数据的时候，大部分数据都不可用于测试，这造成了大大的困难，所以这写测试数据的测试结果显得不太可靠。在使用knn算法时，因iris数据的经典，使得准确率达到了百分之百，当k值取得很大时才会发生变化。再用朴素贝叶斯计算时，得到正确率较高。
经过此次的报告，发现自身对编程的欠缺，不会灵活运用书中代码，且在数据处理上有很大的不足，如数据归一化和数据导入后的处理、对特征的处理等。

代码

可视化部分

import pandas as pd
from sklearn.datasets import load_iris

# 载入seaborn,因为载入时会有警告出现，因此先载入warnings，忽略警告。
import warnings 
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)

iris = pd.DataFrame(load_iris().data)
iris.columns = ['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
iris['Species'] = load_iris().target
## 首先探索一下数据集的大体情况，多少个样本，每一个样本多少个特征等
#print (iris.shape)
#print (iris.head())

#
#
### 探索一下是否为均衡分类
### 分几类？每一类多少个样本？

#print (iris["Species"].value_counts())
#
#
#
## plot是画图的最主要方法，Series和DataFrame都有plot方法。
## plot默认生成是曲线图，可以通过kind参数生成其他的图形，可选的值为：line, bar(柱状图), barh, 
## kde, density, scatter（散点图）。
## 对于坐标类数据，可以用 Scatter Plot 来查看它们的分布趋势和是否有离群点的存在
#iris.plot(kind="scatter", x="SepalLengthCm", y="SepalWidthCm")


#接下来用 seaborn 的 FacetGrid 函数按照Species花的种类来在散点图上标上不同的颜色，hue英文是色彩的意思。
#sns.FacetGrid(iris, hue="Species", size=5).map(plt.scatter, "SepalLengthCm", "SepalWidthCm").add_legend()



# 通过箱线图来查看单个特征的分布
# 对 Numerical Variable，可以用 Box Plot 来直观地查看不同花类型的分布。
#sns.boxplot(x="Species", y="SepalLengthCm", data=iris)


# 将每一个Species所属的点加到对应的位置，加上散点图，
# 振动值jitter=True 使各个散点分开，要不然会是一条直线
#ax = sns.boxplot(x="Species", y="PetalWidthCm", data=iris)
#ax = sns.stripplot(x="Species", y="PetalWidthCm", data=iris, jitter=True, edgecolor="gray")



# violinplot 小提琴图，查看密度分布，结合了前面的两个图，并且进行了简化
# 数据越稠密越宽，越稀疏越窄
#sns.violinplot(x="Species", y="PetalLengthCm", data=iris, size=6)

# sns.kdeplot == kernel density 核密度图（单个变量）
#sns.FacetGrid(iris, hue="Species", size=6).map(sns.kdeplot, "PetalLengthCm").add_legend()

# pairplot 任意两个变量间的关系
sns.pairplot(iris, hue="Species", size=3)

kNN部分

import csv      
import random       
import math             
import operator  
from sklearn import neighbors  

def loadDataset(filename,split,trainingSet=[],testSet = []):    
    with open(filename,"rt") as csvfile:    
        lines = csv.reader(csvfile)    
        dataset = list(lines)    
        for x in range(len(dataset)-1):    
            for y in range(4):    
                dataset[x][y] = float(dataset[x][y])    
            if random.random()<split:    
                trainingSet.append(dataset[x])    
            else:    
                testSet.append(dataset[y])    

def euclideanDistance(instance1,instance2,length):    
    distance = 0    
    for x in range(length):    
        distance += pow((instance1[x] - instance2[x]),2)    
    return math.sqrt(distance)    

def getNeighbors(trainingSet,testInstance,k):    
    distances = []    
    length = len(testInstance) -1     
    for x in range(len(trainingSet)):    
        dist = euclideanDistance(testInstance, trainingSet[x], length)    
        distances.append((trainingSet[x],dist))     
    distances.sort(key=operator.itemgetter(1))    
    neighbors = []     
    for x in range(k):    
        neighbors.append(distances[x][0])    
    return neighbors    

def getResponse(neighbors):    
    classVotes = {}    
    for x in range(len(neighbors)):    
        response = neighbors[x][-1]    
        if response in classVotes:    
            classVotes[response]+=1    
        else:    
            classVotes[response] = 1    
    sortedVotes = sorted(classVotes.items(),key = operator.itemgetter(1),reverse =True)    
    return sortedVotes[0][0]    

def getAccuracy(testSet,predictions):    
    correct = 0    
    for x in range(len(testSet)):    
        if testSet[x][-1] == predictions[x]:    
            correct+=1    
    return (correct/float(len(testSet))) * 100.0    

def main():    
    trainingSet = [] 
    testSet = []       
    split = 0.67       
    loadDataset("c:\\Users\\29795\\Desktop\\iris\\iris.data",split,trainingSet,testSet)     
    print("Train set :" + repr(len(trainingSet)))    
    print ("Test set :" + repr(len(testSet))    )                

    predictions = []    
    k =  70   
    for x in range(len(testSet)):    
        neighbors = getNeighbors(trainingSet, testSet[x], k)    
        result = getResponse(neighbors)    
        predictions.append(result)    
        print (">predicted = " + repr(result) + ",actual = " + repr(testSet[x][-1]))    
    accuracy = getAccuracy(testSet, predictions)    
    print ("Accuracy:" + repr(accuracy) + "%"  )  

if __name__ =="__main__":    
    main()

决策树部分

from math import log
import matplotlib.pyplot as plt
import numpy as np  
import operator
import re
import pandas as pd
import random
from sklearn.datasets import load_iris


def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    #change to discrete values
    return dataSet, labels

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return np.array(shannonEnt)

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]                              #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree                            

def classify(inputTree,featLabels,testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict): 
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)


decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = list(myTree.keys())[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = list(myTree.keys())[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()

#def createPlot():
#    fig = plt.figure(1, facecolor='white')
#    fig.clf()
#    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
#    plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
#    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
#    plt.show()

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

def autoNorm(dataSet):
    dataSet = kNN.array(dataSet)
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m,1))
    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide
    return normDataSet

def dataDiscretize(dataSet):  
    m,n = dataSet.shape   #获取数据集行列（样本数和特征数)  
    disMat = tile([0],dataSet.shape)  #初始化离散化数据集  
    for i in range(n-1):    #由于最后一列为类别，因此遍历前n-1列，即遍历特征列  
        x = [l[i] for l in dataSet] #获取第i+1特征向量  
        y = pd.cut(x,10,labels=[0,1,2,3,4,5,6,7,8,9])   #调用cut函数，将特征离散化为10类，可根据自己需求更改离散化种类  
        for k in range(n):  #将离散化值传入离散化数据集  
            disMat[k][i] = y[k]      
    return disMat




def deal1(dataset):
    b=dataset["sepal length (cm)"].max()
    a=dataset["sepal length (cm)"].min()
    t=round((b-a)/4,2)
    def f(x):
        if a<=x<=a+t:
            return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
        elif x<=a+2*t:
            return "m"+"("+str(a+t)+","+str(a+2*t)+")"
        elif x<=a+3*t:
            return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
        else:
            return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
    dataset["sepal length (cm)"]=dataset["sepal length (cm)"].apply(f)
    return dataset
def deal2(dataset):
    b=dataset["sepal width (cm)"].max()
    a=dataset["sepal width (cm)"].min()
    t=round((b-a)/4,2)
    def f(x):
        if a<=x<=a+t:
            return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
        elif x<=a+2*t:
            return "m"+"("+str(round(a+t,2))+","+str(round(a+2*t,2))+")"
        elif x<=a+3*t:
            return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
        else:
            return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
    dataset["sepal width (cm)"]=dataset["sepal width (cm)"].apply(f)
    return dataset
def deal3(dataset):
    b=dataset["petal length (cm)"].max()
    a=dataset["petal length (cm)"].min()
    t=round((b-a)/4,2)
    def f(x):
        if a<=x<=a+t:
            return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
        elif x<=a+2*t:
            return "m"+"("+str(round(a+t,2))+","+str(round(a+2*t,2))+")"
        elif x<=a+3*t:
            return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
        else:
            return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
    dataset["petal length (cm)"]=dataset["petal length (cm)"].apply(f)
    return dataset
def deal4(dataset):
    b=dataset["petal width (cm)"].max()
    a=dataset["petal width (cm)"].min()
    t=round((b-a)/4,2)
    def f(x):
        if a<=x<=a+t:
            return "s"+"("+str(round(a,2))+","+str(round(a+t,2))+")"
        elif x<=a+2*t:
            return "m"+"("+str(round(a+t,2))+","+str(round(a+2*t,2))+")"
        elif x<=a+3*t:
            return "lger"+"("+str(round(a+2*t,2))+","+str(round(a+3*t,2))+")"
        else:
            return "l"+"("+str(round(a+3*t,2))+","+str(round(b,2))+")"
    dataset["petal width (cm)"]=dataset["petal width (cm)"].apply(f)
    return dataset

data_set = load_iris()
feature_names = data_set["feature_names"]
data = data_set["data"]
labels = data_set["target"]
df = pd.DataFrame(data,columns=feature_names)
df["label"] = labels
df=deal1(df)
df=deal2(df)
df=deal3(df)
df=deal4(df)
#print(df)
X = df.iloc[:,[0,1,2,3,4]].values.tolist()
#print(X)
y = df.iloc[:,4].values.T.tolist()
#print(y)
lensesLabels=['sepallength','sepalwidth','petallength','petalwidth']
#lensesTree=createTree(X,lensesLabels)
#print(lensesTree)
#createPlot(lensesTree)
num=0
k=0

trainlenses=X[31:]
test=X[:30]
#print(test)
lensesTrees=createTree(test,lensesLabels)
createPlot(lensesTree)
#test=X[:30]
#lensesTrees=createTree(trainlenses,lensesLabels)
#lensesLabels=['sepallength','sepalwidth','petallength','petalwidth']
#for i in range(30):
#    if X[i][-1]==classify(lensesTrees,lensesLabels,test[i]):
#        num=num+1
#print(num)

#lensesTree=createTree(lenses,lensesLabels)
#testVect=lenses[0]

贝叶斯部分

import pandas as pd
import numpy as np  
data='''5.1,3.5,1.4,0.2,Iris-setosa  
        4.9,3.0,1.4,0.2,Iris-setosa  
        4.7,3.2,1.3,0.2,Iris-setosa  
        4.6,3.1,1.5,0.2,Iris-setosa  
        5.0,3.6,1.4,0.2,Iris-setosa  
        5.4,3.9,1.7,0.4,Iris-setosa  
        4.6,3.4,1.4,0.3,Iris-setosa  
        5.0,3.4,1.5,0.2,Iris-setosa  
        4.4,2.9,1.4,0.2,Iris-setosa  
        4.9,3.1,1.5,0.1,Iris-setosa  
        5.4,3.7,1.5,0.2,Iris-setosa  
        4.8,3.4,1.6,0.2,Iris-setosa  
        4.8,3.0,1.4,0.1,Iris-setosa  
        4.3,3.0,1.1,0.1,Iris-setosa  
        5.8,4.0,1.2,0.2,Iris-setosa  
        5.7,4.4,1.5,0.4,Iris-setosa  
        5.4,3.9,1.3,0.4,Iris-setosa  
        5.1,3.5,1.4,0.3,Iris-setosa  
        5.7,3.8,1.7,0.3,Iris-setosa  
        5.1,3.8,1.5,0.3,Iris-setosa  
        5.4,3.4,1.7,0.2,Iris-setosa  
        5.1,3.7,1.5,0.4,Iris-setosa  
        4.6,3.6,1.0,0.2,Iris-setosa  
        5.1,3.3,1.7,0.5,Iris-setosa  
        4.8,3.4,1.9,0.2,Iris-setosa  
        5.0,3.0,1.6,0.2,Iris-setosa  
        5.0,3.4,1.6,0.4,Iris-setosa  
        5.2,3.5,1.5,0.2,Iris-setosa  
        5.2,3.4,1.4,0.2,Iris-setosa  
        4.7,3.2,1.6,0.2,Iris-setosa  
        4.8,3.1,1.6,0.2,Iris-setosa  
        5.4,3.4,1.5,0.4,Iris-setosa  
        5.2,4.1,1.5,0.1,Iris-setosa  
        5.5,4.2,1.4,0.2,Iris-setosa  
        4.9,3.1,1.5,0.1,Iris-setosa  
        5.0,3.2,1.2,0.2,Iris-setosa  
        5.5,3.5,1.3,0.2,Iris-setosa  
        4.9,3.1,1.5,0.1,Iris-setosa  
        4.4,3.0,1.3,0.2,Iris-setosa  
        5.1,3.4,1.5,0.2,Iris-setosa  
        5.0,3.5,1.3,0.3,Iris-setosa  
        4.5,2.3,1.3,0.3,Iris-setosa  
        4.4,3.2,1.3,0.2,Iris-setosa  
        5.0,3.5,1.6,0.6,Iris-setosa  
        5.1,3.8,1.9,0.4,Iris-setosa  
        4.8,3.0,1.4,0.3,Iris-setosa  
        5.1,3.8,1.6,0.2,Iris-setosa  
        4.6,3.2,1.4,0.2,Iris-setosa  
        5.3,3.7,1.5,0.2,Iris-setosa  
        5.0,3.3,1.4,0.2,Iris-setosa  
        7.0,3.2,4.7,1.4,Iris-versicolor  
        6.4,3.2,4.5,1.5,Iris-versicolor  
        6.9,3.1,4.9,1.5,Iris-versicolor  
        5.5,2.3,4.0,1.3,Iris-versicolor  
        6.5,2.8,4.6,1.5,Iris-versicolor  
        5.7,2.8,4.5,1.3,Iris-versicolor  
        6.3,3.3,4.7,1.6,Iris-versicolor  
        4.9,2.4,3.3,1.0,Iris-versicolor  
        6.6,2.9,4.6,1.3,Iris-versicolor  
        5.2,2.7,3.9,1.4,Iris-versicolor  
        5.0,2.0,3.5,1.0,Iris-versicolor  
        5.9,3.0,4.2,1.5,Iris-versicolor  
        6.0,2.2,4.0,1.0,Iris-versicolor  
        6.1,2.9,4.7,1.4,Iris-versicolor  
        5.6,2.9,3.6,1.3,Iris-versicolor  
        6.7,3.1,4.4,1.4,Iris-versicolor  
        5.6,3.0,4.5,1.5,Iris-versicolor  
        5.8,2.7,4.1,1.0,Iris-versicolor  
        6.2,2.2,4.5,1.5,Iris-versicolor  
        5.6,2.5,3.9,1.1,Iris-versicolor  
        5.9,3.2,4.8,1.8,Iris-versicolor  
        6.1,2.8,4.0,1.3,Iris-versicolor  
        6.3,2.5,4.9,1.5,Iris-versicolor  
        6.1,2.8,4.7,1.2,Iris-versicolor  
        6.4,2.9,4.3,1.3,Iris-versicolor  
        6.6,3.0,4.4,1.4,Iris-versicolor  
        6.8,2.8,4.8,1.4,Iris-versicolor  
        6.7,3.0,5.0,1.7,Iris-versicolor  
        6.0,2.9,4.5,1.5,Iris-versicolor  
        5.7,2.6,3.5,1.0,Iris-versicolor  
        5.5,2.4,3.8,1.1,Iris-versicolor  
        5.5,2.4,3.7,1.0,Iris-versicolor  
        5.8,2.7,3.9,1.2,Iris-versicolor  
        6.0,2.7,5.1,1.6,Iris-versicolor  
        5.4,3.0,4.5,1.5,Iris-versicolor  
        6.0,3.4,4.5,1.6,Iris-versicolor  
        6.7,3.1,4.7,1.5,Iris-versicolor  
        6.3,2.3,4.4,1.3,Iris-versicolor  
        5.6,3.0,4.1,1.3,Iris-versicolor  
        5.5,2.5,4.0,1.3,Iris-versicolor  
        5.5,2.6,4.4,1.2,Iris-versicolor  
        6.1,3.0,4.6,1.4,Iris-versicolor  
        5.8,2.6,4.0,1.2,Iris-versicolor  
        5.0,2.3,3.3,1.0,Iris-versicolor  
        5.6,2.7,4.2,1.3,Iris-versicolor  
        5.7,3.0,4.2,1.2,Iris-versicolor  
        5.7,2.9,4.2,1.3,Iris-versicolor  
        6.2,2.9,4.3,1.3,Iris-versicolor  
        5.1,2.5,3.0,1.1,Iris-versicolor  
        5.7,2.8,4.1,1.3,Iris-versicolor  
        6.3,3.3,6.0,2.5,Iris-virginica  
        5.8,2.7,5.1,1.9,Iris-virginica  
        7.1,3.0,5.9,2.1,Iris-virginica  
        6.3,2.9,5.6,1.8,Iris-virginica  
        6.5,3.0,5.8,2.2,Iris-virginica  
        7.6,3.0,6.6,2.1,Iris-virginica  
        4.9,2.5,4.5,1.7,Iris-virginica  
        7.3,2.9,6.3,1.8,Iris-virginica  
        6.7,2.5,5.8,1.8,Iris-virginica  
        7.2,3.6,6.1,2.5,Iris-virginica  
        6.5,3.2,5.1,2.0,Iris-virginica  
        6.4,2.7,5.3,1.9,Iris-virginica  
        6.8,3.0,5.5,2.1,Iris-virginica  
        5.7,2.5,5.0,2.0,Iris-virginica  
        5.8,2.8,5.1,2.4,Iris-virginica  
        6.4,3.2,5.3,2.3,Iris-virginica  
        6.5,3.0,5.5,1.8,Iris-virginica  
        7.7,3.8,6.7,2.2,Iris-virginica  
        7.7,2.6,6.9,2.3,Iris-virginica  
        6.0,2.2,5.0,1.5,Iris-virginica  
        6.9,3.2,5.7,2.3,Iris-virginica  
        5.6,2.8,4.9,2.0,Iris-virginica  
        7.7,2.8,6.7,2.0,Iris-virginica  
        6.3,2.7,4.9,1.8,Iris-virginica  
        6.7,3.3,5.7,2.1,Iris-virginica  
        7.2,3.2,6.0,1.8,Iris-virginica  
        6.2,2.8,4.8,1.8,Iris-virginica  
        6.1,3.0,4.9,1.8,Iris-virginica  
        6.4,2.8,5.6,2.1,Iris-virginica  
        7.2,3.0,5.8,1.6,Iris-virginica  
        7.4,2.8,6.1,1.9,Iris-virginica  
        7.9,3.8,6.4,2.0,Iris-virginica  
        6.4,2.8,5.6,2.2,Iris-virginica  
        6.3,2.8,5.1,1.5,Iris-virginica  
        6.1,2.6,5.6,1.4,Iris-virginica  
        7.7,3.0,6.1,2.3,Iris-virginica  
        6.3,3.4,5.6,2.4,Iris-virginica  
        6.4,3.1,5.5,1.8,Iris-virginica  
        6.0,3.0,4.8,1.8,Iris-virginica  
        6.9,3.1,5.4,2.1,Iris-virginica  
        6.7,3.1,5.6,2.4,Iris-virginica  
        6.9,3.1,5.1,2.3,Iris-virginica  
        5.8,2.7,5.1,1.9,Iris-virginica  
        6.8,3.2,5.9,2.3,Iris-virginica  
        6.7,3.3,5.7,2.5,Iris-virginica  
        6.7,3.0,5.2,2.3,Iris-virginica  
        6.3,2.5,5.0,1.9,Iris-virginica  
        6.5,3.0,5.2,2.0,Iris-virginica  
        6.2,3.4,5.4,2.3,Iris-virginica  
        5.9,3.0,5.1,1.8,Iris-virginica''' 
data = data.replace(' ','').replace("Iris-setosa","1.0").replace("Iris-versicolor","2.0").replace("Iris-virginica","3.0").split('\n')  
data = list(filter(lambda x: len(x) > 0,data))  
data = [x.split(',') for x in data]  
data = np.array(data).astype(np.float16)  

def splitData(trainPrecent=0.7):  
    train = []  
    test = []  
    for i in data:  
        (train if np.random.random() < trainPrecent else test).append(i)  
    return np.array(train),np.array(test)  
trainData,testData = splitData()  
print("共有%d条数据，分解为%d条训练集与%d条测试集"%(len(data),len(trainData),len(testData)))  

clf=set(trainData[:,-1]) 
trainClfData={} 
for x in clf:  
    clfItems=np.array(list(filter(lambda i:i[-1]==x ,trainData)))[:,:-1]
    mean=clfItems.mean(axis=0)
    stdev= np.sqrt(np.sum((clfItems-mean)**2,axis=0)/float(len(clfItems)-1))
    trainClfData[x]=np.array([mean,stdev]).T 

result=[]  
for testItem in testData:  
    itemData=testItem[0:-1]
    itemClf=testItem[-1] 

    prediction={} 
    for clfItem in trainClfData:  

        probabilities= np.exp(-1*(testItem[0:-1]-trainClfData[clfItem][:,0])**2/(trainClfData[clfItem][:,1]**2*2)) / (np.sqrt(2*np.pi)*trainClfData[clfItem][:,1])  

        clfPrediction=1  
        for proItem in probabilities:  
            clfPrediction*=proItem  
        prediction[clfItem]=clfPrediction  


    maxProbablity=None  
    for x in prediction:  
        if maxProbablity==None or prediction[x]>prediction[maxProbablity]:  
            maxProbablity=x  


    result.append({'数据':itemData.tolist()  
                    ,'实际分类':itemClf  
                    ,'各类别概率':prediction  
                    ,'测试分类(最大概率类别)':maxProbablity  
                    ,'是否正确': 1 if itemClf==maxProbablity else 0})  
rightCount=0;  
for x  in result:  
    rightCount+=x['是否正确']  
print('共%d条测试数据，测试正确%d条,正确率%2f:'%(len(result),rightCount,rightCount/len(result)))

参考文献

【1】《机器学习实战》