定义:
- 因子:factor,离散型的表示类别的变量;在table相关的计算中经常被用到
- levels:因子变量的所有非重复的元素取值集合称为levels
特点:
函数:
- factor()
- as.factor()
- is.factor()
- tapply(x, f, g):x为数据(必须为vector类型),f为因子(f的长度必须与x相同,表示数据对应的因子),g为函数
- split()
- by() :作用类似于tapply,但是应用的数据不限于vector类型
应用:
- 因子创建
- > x <- c(5,12,13,12)
- > xf <- factor(x) #使用默认的levels,即所有非重复元素集合
- > xf
- [1] 5 12 13 12
- Levels: 5 12 13
- > class(xf)
- [1] "factor"
- > str(xf)
- Factor w/ 3 levels "5","12","13": 1 2 3 2
- > unclass(xf)
- [1] 1 2 3 2
- attr(,"levels")
- [1] "5" "12" "13"
- > levels(xf) #返回因子的levels
- [1] "5" "12" "13"
- > xff <- factor(x,levels=c(5,12,13,88)) #显示设置levels,可以添加没有对应元素的level
- > xff
- [1] 5 12 13 12
- Levels: 5 12 13 88
- > x
- [1] 5 12 13 12
- > xff <- factor(x,levels=c(5,12,11,88)) #当因子中的元素没有对应level时,该元素会被标识为<NA>,即缺失
- > xff
- [1] 5 12 <NA> 12
- Levels: 5 12 11 88
- > xff[2] <- 88 #因子的元素可以通过索引值访问
- > xff
- [1] 5 88 13 12
- Levels: 5 12 13 88
- > xff[5] <- 28 #不能添加不属于levels的元素
- Warning message:
- In `[<-.factor`(`*tmp*`, 5, value = 28) :
- invalid factor level, NA generated
- 因子上应用函数:
- > ages <- c(25,26,55,37,21,42)
- > affils <- c("R","D","D","R","U","D")
- > tapply(ages,affils,mean) #使用tapply在因子上应用函数
- D R U
- 41 31 21
- > d <- data.frame(list(gender=c("M","M","F","M","F","F"), age=c(47,59,21,32,33,24),income=c(55000,88000,32450,76500,123000,45650)))
- > d$over25 <- ifelse(d$age > 25,1,0)
- > tapply(d$income,list(d$gender,d$over25),mean) #使用联合因子,可以使用任意多联合因子,相当于类型组合
- 0 1
- F 39050 123000.00
- M NA 73166.67
- 根据因子分类应用复杂函数
- > by(mtcars, mtcars$gear, function(m) lm(m[,1]~m[,2])) #在数据框上应用复杂函数(多余一个输入参数)
- mtcars$gear: 3
- Call:
- lm(formula = m[, 1] ~ m[, 2])
- Coefficients:
- (Intercept) m[, 2]
- 29.784 -1.832
- ------------------------------------------------------------
- mtcars$gear: 4
- Call:
- lm(formula = m[, 1] ~ m[, 2])
- Coefficients:
- (Intercept) m[, 2]
- 41.275 -3.588
- ------------------------------------------------------------
- mtcars$gear: 5
- Call:
- lm(formula = m[, 1] ~ m[, 2])
- Coefficients:
- (Intercept) m[, 2]
- 40.58 -3.20
- 根据因子切分数据
- > split(d$income,list(d$gender,d$over25)) #使用联合因子切分数据,与tapply的区别在于不在数据上应用函数
- $F.0
- [1] 32450 45650
- $M.0
- numeric(0)
- $F.1
- [1] 123000
- $M.1
- [1] 55000 88000 76500
- > class(split(d$income,list(d$gender,d$over25))) #split的返回值为列表
- [1] "list"
- others