练习八:异常检测和推荐系统
目录:
1.包含的文件
2.异常检测
3.推荐系统
1.包含的文件
文件名 | 含义 |
ex8.py | 异常检测实验 |
ex8_cofi.py | 推荐系统实验 |
ex8data1.mat | 异常检测数据集1 |
ex8data2.mat | 异常检测数据集2 |
ex8_movies.mat | 电影评分数据集 |
ex8_movieParams.mat | 参数优化 |
multivariateGaussian.py | 多元高斯分布 |
visualizeFit.py | 数据可视化 |
checkCostFunction.py | 协同过滤的梯度检查 |
computeNumericalGradient.py | 近似梯度计算 |
loadMovieList.py | 加载电影列表 |
movie_ids.txt | 电影名字列表 |
normalizeRatings.py | 协同过滤均值规范化 |
estimateGaussian.py | 高斯分布参数估计 |
selectThreshold.py | 异常检测的阈值设置 |
cofiCostFunc.py | 实现协同过滤的代价函数 |
注:红色部分需要自己填写。
2.异常检测
- 导入需要的包以及初始化:
import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio
import estimateGaussian as eg
import multivariateGaussian as mvg
import visualizeFit as vf
import selectThreshold as st
plt.ion()
# np.set_printoptions(formatter={'float': '{: 0.6f}'.format})
2.1数据可视化
# ===================== Part 1: Load Example Dataset =====================
# We start this exercise by using a small dataset that is easy to visualize.
#
# Our example case consists of two network server statistics across
# several machines: the latency and throughput of each machine.
# This exercise will help us find possibly faulty (or very fast) machines
#
print('Visualizing example dataset for outlier detection.')
# The following command loads the dataset. You should now have the
# variables X, Xval, yval in your environment.
data = scio.loadmat('ex8data1.mat')
X = data['X']
Xval = data['Xval']
yval = data['yval'].flatten()
# Visualize the example dataset
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c='b', marker='x', s=15, linewidth=1)
plt.axis([0, 30, 0, 30])
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')
input('Program paused. Press ENTER to continue')
- 可视化结果
2.2估计概率分布
- 要执行异常检测,首先需要根据数据的分布匹配模型。高斯分布为:
- 要估计平均值,可以使用:
- 对于方差:
- 编写参数估计程序estimateGaussian.py
import numpy as np
def estimate_gaussian(X):
# Useful variables
m, n = X.shape
# You should return these values correctly
mu = np.zeros(n)
sigma2 = np.zeros(n)
# ===================== Your Code Here =====================
# Instructions: Compute the mean of the data and the variances
# In particular, mu[i] should contain the mean of
# the data for the i-th feature and sigma2[i]
# should contain variance of the i-th feature
#
mu = (1/m)*X.sum(axis = 0).reshape(1, -1)
sigma2 = ((1/m)*(X - mu)*(X - mu)).sum(axis = 0)
# ==========================================================
return mu, sigma2
- 估计训练集的概率分布
# ===================== Part 2: Estimate the dataset statistics =====================
# For this exercise, we assume a Gaussian distribution for the dataset.
#
# We first estimate the parameters of our assumed Gaussian distribution,
# then compute the probabilities for each of the points and then visualize
# both the overall distribution and where each of the points falls in
# terms of that distribution
#
print('Visualizing Gaussian fit.')
# Estimate mu and sigma2
mu, sigma2 = eg.estimate_gaussian(X)
# Returns the density of the multivariate normal at each data point(row) of X
p = mvg.multivariate_gaussian(X, mu, sigma2)
# Visualize the fit
vf.visualize_fit(X, mu, sigma2)
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')
input('Program paused. Press ENTER to continue')
- 查看计算概率分布的程序multivariateGaussian.py
import numpy as np
def multivariate_gaussian(X, mu, sigma2):
#特征的个数
k = mu.size
#如果是基于单元高斯分布的模型 将其sigma2转换为对角矩阵 作为协方差矩阵 代入多元高斯分布公式
#此时单元模型和多元模型是等价的
#如果是基于多元高斯分布的模型 直接将计算的协方差矩阵sigma2代入多元高斯分布公式
if sigma2.ndim == 1 or (sigma2.ndim == 2 and (sigma2.shape[1] == 1 or sigma2.shape[0] == 1)):
sigma2 = np.diag(sigma2)
x = X - mu
p = (2 * np.pi) ** (-k / 2) * np.linalg.det(sigma2) ** (-0.5) * np.exp(-0.5*np.sum(np.dot(x, np.linalg.pinv(sigma2)) * x, axis=1))
return p
- 查看数据可视化程序 visualizeFit.py
import matplotlib.pyplot as plt
import numpy as np
import multivariateGaussian as mvg
def visualize_fit(X, mu