【探索性数据分析】学习记录9.20

本文介绍了R语言中View、glimpse、summary和str这四个函数在数据预处理中的作用,包括数据格式检查、变量查看、统计描述和异常值检测。重点展示了它们在数据探索和初步分析中的区别和用法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录

view glimpse summary和str的区别

 1.View

2.glimpse

3.summary

4.str

处理数据时需要注意的一些问题

1.数据格式

2.table函数

3.缺失值的判断

4.用可视化探索任何特别感兴趣的变量

5.标志离群值identify()函数

6.箱线图boxplot()和五数概要描述

7.查看关键变量之间的关系散点图plot()

 8.绘制拟合线


view glimpse summary和str的区别

 1.View

将以表格的形式在上方展示出该数据集

View(HairEyeColor)

2.glimpse
library(dplyr)
glimpse(HairEyeColor)

glimpse函数是dplyr包中的,需要先加载一下dplyr包

处理结果为:

 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
 - attr(*, "dimnames")=List of 3
  ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
  ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
  ..$ Sex : chr [1:2] "Male" "Female"

将字符型的变量,按照因子型来分类,分别是4类,4类,2类

将HairEyeColor转为数据框后,再使用glimpse处理

glimpse(as_tibble(HairEyeColor))
Rows: 32
Columns: 4
$ Hair <chr> "Black", "Brown", "Red", "Blond", "Black", "Brown", "Red", "Blond", "Black", "Brown", "Red", "Blond", "Bl…
$ Eye  <chr> "Brown", "Brown", "Brown", "Brown", "Blue", "Blue", "Blue", "Blue", "Hazel", "Hazel", "Hazel", "Hazel", "…
$ Sex  <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "Male", "…
$ n    <dbl> 32, 53, 10, 3, 11, 50, 10, 30, 10, 25, 7, 5, 3, 15, 7, 8, 36, 66, 16, 4, 9, 34, 7, 64, 5, 29, 7, 5, 2, 14…

则会列举出多少行,多少列,以及每列的值分别是什么

3.summary
> summary(HairEyeColor)
Number of cases in table: 592 
Number of factors: 3 
Test for independence of all factors:
	Chisq = 164.92, df = 24, p-value = 5.321e-23
	Chi-squared approximation may be incorrect
> summary(as_tibble(HairEyeColor))
     Hair               Eye                Sex                  n        
 Length:32          Length:32          Length:32          Min.   : 2.00  
 Class :character   Class :character   Class :character   1st Qu.: 7.00  
 Mode  :character   Mode  :character   Mode  :character   Median :10.00  
                                                          Mean   :18.50  
                                                          3rd Qu.:29.25  
                                                          Max.   :66.00  

summary函数在处理数据框时,会计算出数值型变量的六个值,分别是最大值、最小值、上四分位数、下四分位数、中位数和平均值。字符型变量则会显示出其长度,类型等信息。

同时,summary可以给出分类变量的类别和数量

a<-as.data.frame(HairEyeColor)
summary(a)
    Hair      Eye        Sex          Freq      
 Black:8   Brown:8   Male  :16   Min.   : 2.00  
 Brown:8   Blue :8   Female:16   1st Qu.: 7.00  
 Red  :8   Hazel:8               Median :10.00  
 Blond:8   Green:8               Mean   :18.50  
                                 3rd Qu.:29.25  
                                 Max.   :66.00  
4.str

str函数是utils包中的,可作为summary的替代

library(utils)
> str(HairEyeColor)
 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
 - attr(*, "dimnames")=List of 3
  ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
  ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
  ..$ Sex : chr [1:2] "Male" "Female"
> str(as_tibble(HairEyeColor))
tibble [32 × 4] (S3: tbl_df/tbl/data.frame)
 $ Hair: chr [1:32] "Black" "Brown" "Red" "Blond" ...
 $ Eye : chr [1:32] "Brown" "Brown" "Brown" "Brown" ...
 $ Sex : chr [1:32] "Male" "Male" "Male" "Male" ...
 $ n   : num [1:32] 32 53 10 3 11 50 10 30 10 25 ...

处理数据时需要注意的一些问题

1.数据格式

读入txt文件时,要注意分割符的形式,同时首行是否为变量名

college<-read.table("college.txt",header = TRUE,sep="\t")
2.table函数

可以查看某个数值变量的值出现的次数

table(college$Tier)

 1  2  3  4 
51 82 66 61 
3.缺失值的判断

(1)is.na

缺失则为false,不缺失则为true

is.na(college)
       School Enrollment  Tier Retention Grad.rate Pct.20 Pct.50 Full.time Top.10 Accept.rate Alumni.giving
  [1,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [2,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [3,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [4,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [5,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [6,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [7,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [8,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
  [9,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [10,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [11,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [12,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [13,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [14,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [15,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [16,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [17,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [18,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [19,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [20,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [21,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [22,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [23,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [24,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [25,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [26,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [27,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [28,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [29,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [30,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [31,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [32,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [33,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [34,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [35,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [36,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [37,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [38,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [39,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [40,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [41,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [42,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [43,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [44,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [45,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [46,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [47,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [48,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [49,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [50,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [51,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [52,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [53,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [54,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [55,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [56,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [57,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [58,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [59,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [60,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [61,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [62,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE   TRUE     FALSE  FALSE       FALSE         FALSE
 [63,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [64,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [65,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [66,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [67,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [68,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [69,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [70,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [71,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [72,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [73,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [74,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [75,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [76,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [77,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [78,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [79,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [80,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [81,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [82,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [83,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [84,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [85,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [86,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [87,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [88,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [89,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [90,]  FALSE      FALSE FALSE     FALSE     FALSE  FALSE  FALSE     FALSE  FALSE       FALSE         FALSE
 [ reached getOption("max.print") -- omitted 170 rows ]

(2)complete.cases()

缺失则为true,不缺失则为false

(3)summary()

检查每个变量的描述性统计数据

summary(college)
    School            Enrollment         Tier         Retention       Grad.rate         Pct.20          Pct.50     
 Length:260         Min.   : 1712   Min.   :1.000   Min.   :54.00   Min.   : 9.00   Min.   :18.00   Min.   : 0.00  
 Class :character   1st Qu.: 9814   1st Qu.:2.000   1st Qu.:76.00   1st Qu.:49.00   1st Qu.:36.50   1st Qu.: 6.00  
 Mode  :character   Median :16478   Median :2.000   Median :83.50   Median :62.00   Median :45.00   Median :10.00  
                    Mean   :18875   Mean   :2.527   Mean   :82.57   Mean   :62.98   Mean   :47.22   Mean   :10.54  
                    3rd Qu.:26856   3rd Qu.:3.000   3rd Qu.:90.00   3rd Qu.:77.00   3rd Qu.:58.00   3rd Qu.:15.00  
                    Max.   :67082   Max.   :4.000   Max.   :99.00   Max.   :98.00   Max.   :94.00   Max.   :30.00  
                    NA's   :3                       NA's   :2       NA's   :1       NA's   :5       NA's   :6      
   Full.time          Top.10        Accept.rate     Alumni.giving  
 Min.   : 37.00   Min.   :  7.00   Min.   :  8.00   Min.   : 1.00  
 1st Qu.: 82.00   1st Qu.: 19.75   1st Qu.: 49.00   1st Qu.: 8.75  
 Median : 89.00   Median : 29.50   Median : 64.00   Median :12.00  
 Mean   : 86.27   Mean   : 39.78   Mean   : 60.84   Mean   :14.99  
 3rd Qu.: 93.50   3rd Qu.: 53.00   3rd Qu.: 75.00   3rd Qu.:18.25  
 Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :67.00  
 NA's   :5        NA's   :24       NA's   :3        NA's   :8      
4.用可视化探索任何特别感兴趣的变量

(1)可以使用条形图strichart()函数展示一个变量

stripchart(college$Retention,method="stack",pch=18,xlab="Rentention")

(2)可以使用条形图strichart()函数展示两个变量之间的关系

 stripchart(Retention~Grad.rate,method="stack",pch=17,xlab="retention",ylab="grad rate",data=college)

从图中可以明显的看出,保留率高的大学,其毕业率也高。

5.标志离群值identify()函数
 identify(college$Retention,college$Grad.rate,n=2,labels=college$School)

在运行该行函数后,点击图片中的离群点,即可看到该点对应的school值.

6.箱线图boxplot()和五数概要描述
boxplot(Retention~Tier,data=college,horizental=TRUE,ylab="Tier",xlab="Retention")

7.查看关键变量之间的关系散点图plot()
 plot(college$Retention,college$Grad.rate)

 保留率会在很大程度上影响该学校的毕业率

 8.绘制拟合线
fit=line(college$Retention,college$Grad.rate,na.rm=TRUE)
abline(coef(fit))

9.绘制残差图并标识异常值

首先需要去除表格中的na值,否则绘制将不成功   na.omit()函数

college1<-na.omit(college)
fit=line(college1$Retention,college1$Grad.rate)
plot(college1$Retention,fit$residuals)
abline(h=0)
identify(college1$Retention,fit$residuals,n=2,labels=college1$School)

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值