实验目的
理解和掌握朴素贝叶斯基本原理和方法,理解极大似然估计方法,理解先验概率分布和后验概率分布等概念,掌握朴素贝叶斯分类器训练方法。
实验要求
给定数据集,编程实现朴素贝叶斯分类算法,计算相应先验概率,条件概率,高斯分布均值和方差的估计值,并给出模型在测试集上的精度。
实验环境
python, numpy, scipy
实验代码
import numpy as np
from scipy.stats import norm
# 导入训练数据
train_dataset_data = np.genfromtxt("experiment_07_training_set.csv", delimiter=",", skip_header=1, usecols=(1, 2, 3, 4))
rowOfTrainDataset = train_dataset_data.shape[0]
train_dataset_label = np.genfromtxt("experiment_07_training_set.csv", delimiter=",", skip_header=1, usecols=(5,), dtype="str")
# 导入测试数据
test_dataset_data = np.genfromtxt("experiment_07_testing_set.csv", delimiter=",", skip_header=1, usecols=(1, 2, 3, 4))
rowOfTestDataset = test_dataset_data.shape[0]
test_dataset_label = np.genfromtxt("experiment_07_testing_set.csv", delimiter=",", skip_header=1, usecols=(5,), dtype="str")
# 定义种类列表和属性列表
species = ["Iris-setosa", "Iris-versicolor", "Iris-virginica"]
xs = ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]
# 统计先验概率
prior = np.zeros(3)
for i in range(3):
prior[i] = np.sum(train_dataset_label == species[i], axis=0) / rowOfTrainDataset
print("先验概率: ")
for i in range(3):
print(f"{species[i]}-> {prior[i]}")
# 计算条件概率
condition = np.zeros((3, 4, 2))
# 对3个类别计算
for i in range(3):
# 每个类别4个属性
for j in range(4):
temp = train_dataset_data[train_dataset_label == species[i], j]
# 存储均值
condition[i, j, 0] = np.mean(temp)
# 存储标准差
condition[i, j, 1] = np.sqrt(np.var(temp))
print(f"高斯分布参数估计:")
for i in range(3):
print(f"P(X|Y={species[i]})->", end='')
for j in range(4):
print(f"|X1={xs[j]}:均值->{condition[i, j, 0]:.4f} 标准差->{condition[i, j, 1]: .4f}|", end=' ')
print("")
# 计算精度
pred = np.zeros_like(test_dataset_label)
for i in range(rowOfTestDataset):
# 将概率初始化为0
probability1 = 0
# 计算3种类别的概率取最大概率作为种类
for j in range(3):
p0 = norm.pdf(test_dataset_data[i, 0], loc=condition[j, 0, 0], scale=condition[j, 0, 1])
p1 = norm.pdf(test_dataset_data[i, 1], loc=condition[j, 1, 0], scale=condition[j, 1, 1])
p2 = norm.pdf(test_dataset_data[i, 2], loc=condition[j, 2, 0], scale=condition[j, 2, 1])
p3 = norm.pdf(test_dataset_data[i, 3], loc=condition[j, 3, 0], scale=condition[j, 3, 1])
probability2 = prior[j] * p0 * p1 * p2 * p3
if probability2 > probability1:
pred[i] = j
probability1 = probability2
# 将类别从编号转为字符串
pred_species = np.array([species[int(p)] for p in pred])
# 计算精度
accuracy = np.sum(pred_species == test_dataset_label) / rowOfTestDataset
print(f"模型精度: {accuracy * 100: .2f}%")
结果分析
先验概率
|
类别 |
先验概率 |
|
P(Y=setosa) |
0.4 |
|
P(Y=versicolor) |
0.4 |
|
P(Y=virginica) |
0.2 |
高斯分布参数估计精度:
|
类别 |
X1=SepalLength |
X2=SepalWidth |
X3=PetalLength |
X4=PetalWidth | ||||
|
均值 |
标准差 |
均值 |
标准差 |
均值 |
标准差 |
均值 |
标准差 | |
|
P(X|Y=setosa) |
5.0375 |
0.3576 |
3.4400 |
0.3597 |
1.4625 |
0.1698 |
0.2325 |
0.0985 |
|
P(X|Y=versicolor) |
6.0150 |
0.5126 |
2.7875 |
0.3257 |
4.3200 |
0.4440 |
1.3500 |
0.2049 |
|
P(X|Y=virginica) |
6.5600 |
0.7130 |
2.9200 |
0.3763 |
5.6550 |
0.6241 |
2.0450 |
0.2673 |
模型精度:
92.00%
1014

被折叠的 条评论
为什么被折叠?



