Statistic Helper 开发自查帮助文档（四）：Unsupervised learning_pr . out <- prcomp ( usarrests , scale = true )-优快云博客

本文链接：https://blog.youkuaiyun.com/TJangun/article/details/105485825

Statistic Helper 开发自查帮助文档（四）：Unsupervised learning

模块一：PCA
模块二：Clustering
- K-Means
- Hierarchical

非监督学习指的是不使用response的前提下对数据进行分析处理。主要使用两种方法，第一种是大家耳熟能详的主成分方法，第二种是聚类方法。PCA之前降维时介绍过，但介绍不够详细，这部分会更加详细一点。聚类则引入两种聚类方法。

模块一：PCA

PCA的主要原理是在原样本空间中找到更少数量的方向，使得方差尽可能的大，相关性更强的数据维度，会被倾向于分为一类。

可用prcomp()实现PCA，通过scale=TRUE, 可以将标准差设为1。

(pr_out <- prcomp (USArrests , scale=TRUE))
pr_out$center
pr_out$scale
biplot (pr_out , scale =0)

可以观察不同主成分数量下的PVE和累积PVE

( pr_var <- pr_out$sdev^2 ) 
( pve <- pr_var/sum(pr_var) )
par(mfrow=c(1,2), mar=c(4,4,1,1)) 
plot(pve, xlab="Principal Component", ylab="PVE", ylim=c(0,1),type='b') 
plot(cumsum(pve), xlab="Principal Component", ylab ="Cumulative PVE", ylim=c(0,1), + type='b')

模块二：Clustering

聚类是基于数据点向量间距离关系对数据点进行分类的方法。具体的方法很多，目前仅使用两种。

K-Means

K型聚类需要首先给定所分类别的数目。使用kmeans()。

模拟数据

 set.seed (2) 
 x <- matrix(rnorm (50*2), ncol=2) # zero mean 
 x[1:25, 1] <- x[1:25, 1] + 3 # shift to right by 3 
 x[1:25, 2] <- x[1:25, 2] - 4 # shift down by 4 
 par(mar=c(4,4,1,1)) 
 plot(x[,1], x[,2], col=c(rep("red", 25), rep("blue", 25)), pch=20, cex=2)
 km_out <- kmeans(x, 2, nstart=20) 
 km_out$cluster

Hierarchical

H型聚类是将每一个cluster看做一个新的样本点，然后基于此对所有点进行层层聚类，类的数量从N到1，因此不需要事先设定数目。

使用hclust()

 hc_complete <- hclust(dist(x), method ="complete") 
 hc_average <- hclust(dist(x), method="average") 
 hc_single <- hclust(dist(x), method="single") 
 par(mfrow=c(2,2), par(mar=c(1,4,1,0.5))) 
 plot(hc_complete, main="Complete Linkage", xlab="", sub ="", cex=.9) 
 plot(hc_average, main ="Average Linkage", xlab="", sub ="", cex=.9) 
 plot(hc_single, main="Single Linkage", xlab="", sub ="", cex=.9) 
 cutree(hc_complete, 2) # to obtain two clusters
 cutree(hc_average, 2) # to obtain two clusters
 cutree(hc_single, 2) # to obtain two clusters
 cutree(hc_single , 4) # to obtain four clusters

另外距离定义的选择除了默认的，还可以自定义。

 n <- 30 
 x <- matrix (rnorm(3*n), ncol=3)
 dd <- as.dist(1 - cor(t(x))) # high correlation => shorter "distance" betw two obs 
 hc_complete <- hclust(dd, method ="complete") 
 par(mar=c(4,4,1,1)) 
 plot(hc_complete, main="Complete Linkage with Correlation-Based Distance", xlab="", + sub ="")
 plot(x[1,], col=C[1]+1, type="b", ylim=range(x)) 
 for (i in 2:n){ 
 + points(x[i,], col=C[i]+1, type="b") 
 +  }