吴恩达cs229|编程作业第八周（Python）

最新推荐文章于 2024-06-14 09:41:17 发布

原创

最新推荐文章于 2024-06-14 09:41:17 发布 · 952 阅读

1 ·

CC 4.0 BY-SA版权

练习八：异常检测和推荐系统

1.包含的文件

2.异常检测

3.推荐系统

1.包含的文件

文件名	含义
ex8.py	异常检测实验
ex8_cofi.py	推荐系统实验
ex8data1.mat	异常检测数据集1
ex8data2.mat	异常检测数据集2
ex8_movies.mat	电影评分数据集
ex8_movieParams.mat	参数优化
multivariateGaussian.py	多元高斯分布
visualizeFit.py	数据可视化
checkCostFunction.py	协同过滤的梯度检查
computeNumericalGradient.py	近似梯度计算
loadMovieList.py	加载电影列表
movie_ids.txt	电影名字列表
normalizeRatings.py	协同过滤均值规范化
estimateGaussian.py	高斯分布参数估计
selectThreshold.py	异常检测的阈值设置
cofiCostFunc.py	实现协同过滤的代价函数

注：红色部分需要自己填写。

2.异常检测

导入需要的包以及初始化：

import matplotlib.pyplot as plt
import numpy as np
import scipy.io as scio

import estimateGaussian as eg
import multivariateGaussian as mvg
import visualizeFit as vf
import selectThreshold as st

plt.ion()
# np.set_printoptions(formatter={'float': '{: 0.6f}'.format})

2.1数据可视化

# ===================== Part 1: Load Example Dataset =====================
# We start this exercise by using a small dataset that is easy to visualize.
#
# Our example case consists of two network server statistics across
# several machines: the latency and throughput of each machine.
# This exercise will help us find possibly faulty (or very fast) machines
#

print('Visualizing example dataset for outlier detection.')

#  The following command loads the dataset. You should now have the
#  variables X, Xval, yval in your environment.
data = scio.loadmat('ex8data1.mat')
X = data['X']
Xval = data['Xval']
yval = data['yval'].flatten()

# Visualize the example dataset
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c='b', marker='x', s=15, linewidth=1)
plt.axis([0, 30, 0, 30])
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')

input('Program paused. Press ENTER to continue')

可视化结果

2.2估计概率分布

要执行异常检测，首先需要根据数据的分布匹配模型。高斯分布为：

要估计平均值，可以使用：

对于方差：

编写参数估计程序estimateGaussian.py

import numpy as np


def estimate_gaussian(X):
    # Useful variables
    m, n = X.shape

    # You should return these values correctly
    mu = np.zeros(n)
    sigma2 = np.zeros(n)

    # ===================== Your Code Here =====================
    # Instructions: Compute the mean of the data and the variances
    #               In particular, mu[i] should contain the mean of
    #               the data for the i-th feature and sigma2[i]
    #               should contain variance of the i-th feature
    #
    mu = (1/m)*X.sum(axis = 0).reshape(1, -1)
    
    sigma2 = ((1/m)*(X - mu)*(X - mu)).sum(axis = 0)

    # ==========================================================

    return mu, sigma2

估计训练集的概率分布

# ===================== Part 2: Estimate the dataset statistics =====================
# For this exercise, we assume a Gaussian distribution for the dataset.
#
# We first estimate the parameters of our assumed Gaussian distribution,
# then compute the probabilities for each of the points and then visualize
# both the overall distribution and where each of the points falls in
# terms of that distribution
#
print('Visualizing Gaussian fit.')

# Estimate mu and sigma2
mu, sigma2 = eg.estimate_gaussian(X)

# Returns the density of the multivariate normal at each data point(row) of X
p = mvg.multivariate_gaussian(X, mu, sigma2)

# Visualize the fit
vf.visualize_fit(X, mu, sigma2)
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s')

input('Program paused. Press ENTER to continue')

查看计算概率分布的程序multivariateGaussian.py

import numpy as np


def multivariate_gaussian(X, mu, sigma2):
    #特征的个数
    k = mu.size

    #如果是基于单元高斯分布的模型  将其sigma2转换为对角矩阵 作为协方差矩阵 代入多元高斯分布公式
    #此时单元模型和多元模型是等价的
    #如果是基于多元高斯分布的模型 直接将计算的协方差矩阵sigma2代入多元高斯分布公式
    if sigma2.ndim == 1 or (sigma2.ndim == 2 and (sigma2.shape[1] == 1 or sigma2.shape[0] == 1)):
        sigma2 = np.diag(sigma2)

    x = X - mu
    p = (2 * np.pi) ** (-k / 2) * np.linalg.det(sigma2) ** (-0.5) * np.exp(-0.5*np.sum(np.dot(x, np.linalg.pinv(sigma2)) * x, axis=1))

    return p

查看数据可视化程序 visualizeFit.py

import matplotlib.pyplot as plt
import numpy as np
import multivariateGaussian as mvg


def visualize_fit(X, mu

最低0.47元/天解锁文章

200万优质内容无限畅学