目录
view glimpse summary和str的区别
1.View
将以表格的形式在上方展示出该数据集
View(HairEyeColor)
2.glimpse
library(dplyr)
glimpse(HairEyeColor)
glimpse函数是dplyr包中的,需要先加载一下dplyr包
处理结果为:
'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
- attr(*, "dimnames")=List of 3
..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
..$ Sex : chr [1:2] "Male" "Female"
将字符型的变量,按照因子型来分类,分别是4类,4类,2类
将HairEyeColor转为数据框后,再使用glimpse处理
glimpse(as_tibble(HairEyeColor))
Rows: 32
Columns: 4
$ Hair <chr> "Black", "Brown", "Red", "Blond", "Black", "Brown", "Red", "Blond", "Black", "Brown", "Red", "Blond", "Bl…
$ Eye <chr> "Brown", "Brown", "Brown", "Brown", "Blue", "Blue", "Blue", "Blue", "Hazel", "Hazel", "Hazel", "Hazel", "…
$ Sex <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ n <dbl> 32, 53, 10, 3, 11, 50, 10, 30, 10, 25, 7, 5, 3, 15, 7, 8, 36, 66, 16, 4, 9, 34, 7, 64, 5, 29, 7, 5, 2, 14…
则会列举出多少行,多少列,以及每列的值分别是什么
3.summary
> summary(HairEyeColor)
Number of cases in table: 592
Number of factors: 3
Test for independence of all factors:
Chisq = 164.92, df = 24, p-value = 5.321e-23
Chi-squared approximation may be incorrect
> summary(as_tibble(HairEyeColor))
Hair Eye Sex n
Length:32 Length:32 Length:32 Min. : 2.00
Class :character Class :character Class :character 1st Qu.: 7.00
Mode :character Mode :character Mode :character Median :10.00
Mean :18.50
3rd Qu.:29.25
Max. :66.00
summary函数在处理数据框时,会计算出数值型变量的六个值,分别是最大值、最小值、上四分位数、下四分位数、中位数和平均值。字符型变量则会显示出其长度,类型等信息。
同时,summary可以给出分类变量的类别和数量
a<-as.data.frame(HairEyeColor)
summary(a)
Hair Eye Sex Freq
Black:8 Brown:8 Male :16 Min. : 2.00
Brown:8 Blue :8 Female:16 1st Qu.: 7.00
Red :8 Hazel:8 Median :10.00
Blond:8 Green:8 Mean :18.50
3rd Qu.:29.25
Max. :66.00
4.str
str函数是utils包中的,可作为summary的替代
library(utils)
> str(HairEyeColor)
'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
- attr(*, "dimnames")=List of 3
..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
..$ Sex : chr [1:2] "Male" "Female"
> str(as_tibble(HairEyeColor))
tibble [32 × 4] (S3: tbl_df/tbl/data.frame)
$ Hair: chr [1:32] "Black" "Brown" "Red" "Blond" ...
$ Eye : chr [1:32] "Brown" "Brown" "Brown" "Brown" ...
$ Sex : chr [1:32] "Male" "Male" "Male" "Male" ...
$ n : num [1:32] 32 53 10 3 11 50 10 30 10 25 ...
处理数据时需要注意的一些问题
1.数据格式
读入txt文件时,要注意分割符的形式,同时首行是否为变量名
college<-read.table("college.txt",header = TRUE,sep="\t")
2.table函数
可以查看某个数值变量的值出现的次数
table(college$Tier)
1 2 3 4
51 82 66 61
3.缺失值的判断
(1)is.na
缺失则为false,不缺失则为true
is.na(college)
School Enrollment Tier Retention Grad.rate Pct.20 Pct.50 Full.time Top.10 Accept.rate Alumni.giving
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[53,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[54,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[55,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[58,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[59,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[60,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[62,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[63,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[64,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[65,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[68,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[69,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[72,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[74,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[75,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[76,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[77,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[80,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[81,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[82,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[83,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[84,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[86,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[87,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[88,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[90,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 170 rows ]
(2)complete.cases()
缺失则为true,不缺失则为false
(3)summary()
检查每个变量的描述性统计数据
summary(college)
School Enrollment Tier Retention Grad.rate Pct.20 Pct.50
Length:260 Min. : 1712 Min. :1.000 Min. :54.00 Min. : 9.00 Min. :18.00 Min. : 0.00
Class :character 1st Qu.: 9814 1st Qu.:2.000 1st Qu.:76.00 1st Qu.:49.00 1st Qu.:36.50 1st Qu.: 6.00
Mode :character Median :16478 Median :2.000 Median :83.50 Median :62.00 Median :45.00 Median :10.00
Mean :18875 Mean :2.527 Mean :82.57 Mean :62.98 Mean :47.22 Mean :10.54
3rd Qu.:26856 3rd Qu.:3.000 3rd Qu.:90.00 3rd Qu.:77.00 3rd Qu.:58.00 3rd Qu.:15.00
Max. :67082 Max. :4.000 Max. :99.00 Max. :98.00 Max. :94.00 Max. :30.00
NA's :3 NA's :2 NA's :1 NA's :5 NA's :6
Full.time Top.10 Accept.rate Alumni.giving
Min. : 37.00 Min. : 7.00 Min. : 8.00 Min. : 1.00
1st Qu.: 82.00 1st Qu.: 19.75 1st Qu.: 49.00 1st Qu.: 8.75
Median : 89.00 Median : 29.50 Median : 64.00 Median :12.00
Mean : 86.27 Mean : 39.78 Mean : 60.84 Mean :14.99
3rd Qu.: 93.50 3rd Qu.: 53.00 3rd Qu.: 75.00 3rd Qu.:18.25
Max. :100.00 Max. :100.00 Max. :100.00 Max. :67.00
NA's :5 NA's :24 NA's :3 NA's :8
4.用可视化探索任何特别感兴趣的变量
(1)可以使用条形图strichart()函数展示一个变量
stripchart(college$Retention,method="stack",pch=18,xlab="Rentention")
(2)可以使用条形图strichart()函数展示两个变量之间的关系
stripchart(Retention~Grad.rate,method="stack",pch=17,xlab="retention",ylab="grad rate",data=college)
从图中可以明显的看出,保留率高的大学,其毕业率也高。
5.标志离群值identify()函数
identify(college$Retention,college$Grad.rate,n=2,labels=college$School)
在运行该行函数后,点击图片中的离群点,即可看到该点对应的school值.
6.箱线图boxplot()和五数概要描述
boxplot(Retention~Tier,data=college,horizental=TRUE,ylab="Tier",xlab="Retention")
7.查看关键变量之间的关系散点图plot()
plot(college$Retention,college$Grad.rate)
保留率会在很大程度上影响该学校的毕业率
8.绘制拟合线
fit=line(college$Retention,college$Grad.rate,na.rm=TRUE)
abline(coef(fit))
9.绘制残差图并标识异常值
首先需要去除表格中的na值,否则绘制将不成功 na.omit()函数
college1<-na.omit(college)
fit=line(college1$Retention,college1$Grad.rate)
plot(college1$Retention,fit$residuals)
abline(h=0)
identify(college1$Retention,fit$residuals,n=2,labels=college1$School)