高斯混合模型(Gaussian Mixture Model, GMM)是一种常用的概率模型,用于描述由多个高斯分布组成的混合分布。期望最大化(Expectation-Maximization, EM)算法是一种迭代算法,用于估计含有隐变量的概率模型的参数。以下是利用一个班级的身高数据来训练GMM并使用EM算法估计模型参数的步骤:
### 步骤1:数据准备
首先,我们需要从CSV文件中读取数据。假设CSV文件包含两列:身高和性别。
```python
import pandas as pd
import numpy as np
# 读取CSV文件
data = pd.read_csv('height_data.csv')
heights = data['height'].values
```
### 步骤2:初始化GMM参数
我们需要初始化GMM的参数,包括每个高斯分布的均值、协方差和混合系数。
```python
# 初始化参数
num_components = 2 # 假设有两个高斯分布
means = np.random.choice(heights, num_components)
covariances = np.array([np.var(heights)] * num_components)
mixing_coeffs = np.ones(num_components) / num_components
```
### 步骤3:定义EM算法
EM算法包括两个步骤:E步(期望步)和M步(最大化步)。
```python
# 定义EM算法
def em_algorithm(data, means, covariances, mixing_coeffs, num_iterations=100):
for iteration in range(num_iterations):
# E步:计算每个数据点属于每个高斯分布的后验概率
responsibilities = np.zeros((len(data), num_components))
for i in range(num_components):
responsibilities[:, i] = mixing_coeffs[i] * np.exp(-(data - means[i])**2 / (2 * covariances[i])) / np.sqrt(2 * np.pi * covariances[i])
# 计算每个数据点属于所有高斯分布的后验概率之和
responsibilities_sum = np.sum(responsibilities, axis=1)
# 归一化后验概率
responsibilities /= responsibilities_sum[:, np.newaxis]
# M步:更新参数
for i in range(num_components):
# 更新均值
means[i] = np.sum(responsibilities[:, i] * data) / np.sum(responsibilities[:, i])
# 更新协方差
covariances[i] = np.sum(responsibilities[:, i] * (data - means[i])**2) / np.sum(responsibilities[:, i])
# 更新混合系数
mixing_coeffs[i] = np.sum(responsibilities[:, i]) / len(data)
# 输出当前迭代的损失函数值
loss = -np.sum(np.log(np.sum(mixing_coeffs * np.exp(-(data[:, np.newaxis] - means) ** 2 / (2 * covariances)) / np.sqrt(2 * np.pi * covariances), axis=1)))
print(f'Iteration {iteration + 1}: Loss = {loss}')
return means, covariances, mixing_coeffs
```
### 步骤4:训练模型
调用EM算法训练模型。
```python
# 训练模型
means, covariances, mixing_coeffs = em_algorithm(heights, means, covariances, mixing_coeffs, num_iterations=100)
```
### 步骤5:结果分析
输出训练得到的参数。
```python
print('Means:', means)
print('Covariances:', covariances)
print('Mixing Coefficients:', mixing_coeffs)
```
通过以上步骤,我们就可以利用一个班级的身高数据来训练一个高斯混合模型,并使用期望最大化算法来估计模型参数。