WEEK2-Descriptive statistics and data cleaning

最新推荐文章于 2024-09-29 17:20:44 发布

原创最新推荐文章于 2024-09-29 17:20:44 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#R #plot

R 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了使用R语言进行数据清理的过程，包括读取数据、转换数据格式、使用reshape2包进行数据重塑，并展示了如何利用不同参数绘制图表，如根据性别区分的散点图等。

Explore Statistics with R (EDX)

WEEK2-Descriptive statistics and data cleaning视频笔记

例1：

1. 获得一部分数据

obesity <- read.csv("http://www.hscic.gov.uk/catalogue/PUB13648/Obes-phys-acti-diet-eng-2014-tab_CSV.csv", skip=4, nrows=12)

#skip the first 4 rows; import 12 rows

2. 看一看这数据长什么样子（structure）

str(obesity)

然后发现他长得乱七八糟的，接下来需要整理一下。

3. 只留下三个数据：日期，男，女

obesity$Males <- as.numeric(as.character(gsub(",","",obesity$Males)))
obesity$Females <- as.numeric(as.character(gsub(",","",obesity$Females)))
obesity<- obesity[-1,c(-2, -5:-12)]
obesity

#去掉男女千分位的逗号（用global substitution来replace the comma with nothing）；
改变数据类型从factor到numeric（change the factor to a character and then change the character to numeric）
#去掉第一行，去掉第2列和第5至12列

现在他变得干净整齐多了。

#This is the So called wide format.
#We would like to have the long format: one row - one observation

4. 要用的package如果没装要先装一下（install a package and activate it）

install.packages("reshape2")
library("reshape2")

（老师的装包过程好整齐，我的怎么乱七八糟的，算了能用就好）

5. 使用reshape2包的melt，每次用的时候如果没有library过都要library一下

obesitylong <- melt(obesity)
obesitylong #long format

然后数据就变成了日期，性别和值了。

6. 画个图来看看

plot(obesitylong$value~obesitylong$variable)

（the default behavior of the function plot(), if I ask to plot the value in obesitylong dependent on the variable, in this case the sex, in obesitylong, what I will get is a boxplot, like this.）（这块我不太明白）

然后老师推荐装包什么的。。

# install.packages("lubridate")

# setting the argument colClasses= in read.table() can reduce import time of large datasets

例2：

1. 先做了以上1,2的事情，就是读数据和看样子：

body <- read.table("http://www.amstat.org/publications/jse/datasets/body.dat.txt")
dim(body)
str(body)

然后发现变量都没有名字啊，就叫V1，V2之类的怎么行。。

2. 给他们加上名称

BodyMeasurements <- c("Biacromial_diameter","Biiliac_diameter","Bitrochanteric_diameter","Chest_depth","Chest_diameter","Elbow_diameter","Wrist_diameter","Knee_diameter","Ankle_diameter","Shoulder_girth","Chest_girth","Waist_girth","Navel_girth","Hip_girth","Thigh_girth","Bicep_girth","Forearm_girth","Knee_girth","Calf_max_girth","Ankle_min_girth","Wrist_min_girth","Age","Weight","Height","Gender")
names(body) <- BodyMeasurements

3. 看看数据特征和画图

summary(body)
boxplot(body)

#summary就是最小值，最大值，中间值，均值，四分位值什么的，变量太多，截图只显示一部分

然后发现横轴的变量都看不清楚有木有。。

接着召唤par让横轴的变量现身。。（这块我也不太明白）

keep.par <- par()
par(mar = c(10,4,4,2)+0.1)
boxplot(body, las=3)

# to restore parameters to defaul, use: par(keep.par)
#or close your plotting window
#A few examples of visualization: Postion, colour, size, plot character #You can visualize many different variables in the same graph.

是的，老师你成功了，但是我的变量还有一半被吃掉了我不知道怎么办。。

例3：

1. 交代了一些画图的事情：

x<- 1:10
set.seed(23)
y <- x + rnorm(10)

#产生一些x和随机的y用来画图

plot(x,y) #position #正常的散点图
plot(x,y, col=x) #colour #散点颜色改变的散点图
plot(x,y, col=x, cex=x) #size #散点大小改变的散点图，因为这里的x是1到10，所以这里是散点逐渐变大的散点图
plot(x,y, col=x, cex=x, pch=x) #plot chartacter #散点形状改变的散点图

以上只解释了增加的特征，然后老师又给pch举了一个例子，太可怕了。。还有出错信息。。

x <- rep(1:10, 10)
y <- rep(1:10, each=10)
z <- 1:100
plot(x,y,pch =z)

2. 我们来实践一下：

plot(body$Thigh_girth,body$Bicep_girth) #一个正常的散点图
plot(body$Thigh_girth,body$Bicep_girth, pch=body$Gender) #根据性别改变散点形状的散点图，就是说男的和女的的散点形状是不一样的
plot(body$Thigh_girth,body$Bicep_girth, col=body$Gender+1) #根据性别改变散点颜色的散点图，我比较喜欢这个，感觉比较明显

可以看到比较明显的线性关系。

#Summarize a variable by binning
breaks <- seq(min(body$Age),max(body$Age), 5)
Age_group <- cut(body$Age, breaks)
body$Age_group <- Age_group

plot(body$Thigh_girth,body$Bicep_girth, pch=body$Gender, col=body$Age_group)
plot(body$Thigh_girth,body$Bicep_girth, pch=body$Gender, col=body$Age_group, cex=(body$Weight/10))

感觉要瞎了。。