apply lapply sapply等R函数

本文介绍了R语言中提高代码效率的几种高级函数,包括apply、lapply、sapply等,并展示了如何利用这些函数进行矩阵、数据框操作及统计计算。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

转载出处:http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm

R Library: Advanced functions

The R program (as a text file) for the code on this page.
In order to see more than just the results from the computations of the functions (i.e. if you want to see the functions echoed back in console as they are processed) use the echo=T option in the source function when running the program.

One of the main methods for improving the efficiency of a function is to avoid using loops which are very slow and inefficient. On this page we will show a number of ways to avoid using loops by vectorizing the functions. We will cover the following topics:

The apply function
The lapply function
The sapply function
The tapply function
The sweep function
The column functions
The row functions
Miscellaneous

The apply function

Applies a function to sections of an array and returns the results in an array.

apply(array, margin, function, ...)

Note that an array in R is a very generic data type; it is a general structure of up to eight dimensions. For specific dimesions there are special names for the structures. A zero dimensional array is a scalar or a point; a one dimensional array is a vector; and a two dimensional array is a matrix.
The margin argument is used to specify which margin we want to apply the function to and which margin we wish to keep. If the array we are using is a matrix then we can specify the margin to be either 1 (apply the function to the rows) or 2 (apply the function to the columns). The function can be any function that is built in or user defined. The ... after the function refers to any other arguments that is passed to the function being used.
Note that in R the apply function internally uses a loop so perhaps one of the other apply functions would be a better choice if time and efficiency is very important.

mat1 <- matrix(rep(seq(4), 4), ncol = 4)
mat1
     [,1] [,2] [,3] [,4] 
[1,]    1    1    1    1
[2,]    2    2    2    2
[3,]    3    3    3    3
[4,]    4    4    4    4

#row sums of mat1
apply(mat1, 1, sum)
[1]  4  8 12 16

#column sums of mat1
apply(mat1, 2, sum)
[1] 10 10 10 10
#using a user defined function
sum.plus.2 <- function(x){
	sum(x) + 2
}

#using the sum.plus.2 function on the rows of mat1
apply(mat1, 1, sum.plus.2)
[1]  6 10 14 18

#the function can be defined inside the apply function
#note the lack of curly brackets 
apply(mat1, 1, function(x) sum(x) + 2)
[1]  6 10 14 18
#generalizing the function to add any number to the sum
#add 3 to the row sums 
apply(mat1, 1, function(x, y) sum(x) + y, y=3)
[1]  7 11 15 19

#add 5 to the column sums
apply(mat1, 2, function(x, y) sum(x) + y, y=5)
[1] 15 15 15 15
The lapply function

Applies a function to elements in a list or a vector and returns the results in a list.

lapply(list, function, ...)

The lapply function becomes especially useful when dealing with data frames. In R the data frame is considered a list and the variables in the data frame are the elements of the list. We can therefore apply a function to all the variables in a data frame by using the lapply function.
Note that unlike in the apply function there is no margin argument since we are just applying the function to each component of the list.

#creating a data frame using mat1
mat1.df <- data.frame(mat1)
mat1.df
  X1 X2 X3 X4
1  1  1  1  1
2  2  2  2  2
3  3  3  3  3
4  4  4  4  4

#in the data frame mat1.df the variables mat1.1 - mat1.4 are elements of the list mat1.df
#these variables can thus be accessed by lapply
is.list(mat1.df)
[1] TRUE

#obtaining the sum of each variable in mat1.df
lapply(mat1.df, sum)
$X1
[1] 10

$X2
[1] 10

$X3
[1] 10

$X4
[1] 10

Verifying that the results are stored in a list, obtaining the names of the elements in the result list and displaying the first element of the result list.

#storing the results of the lapply function in the list y
y <- lapply(mat1.df, sum)

#verifying that y is a list
is.list(y)
[1] TRUE

#names of the elements in y
names(y)
[1] "X1" "X2" "X3" "X4"

#displaying the first element
y[[1]]
[1] 10

y$X1
[1] 10

Just like in the apply function we can use any built in or user defined function and we can define the function to be used inside the lapply function.

#user defined function with multiple arguments 
#function defined inside the lapply function
#displaying the first two results in the list
y1 <- lapply(mat1.df, function(x, y) sum(x) + y, y = 5)
y1[1:2]
$X1
[1] 15

$X2
[1] 15

Another useful application of the lapply function is with a "dummy sequence". The list argument is the dummy sequence and it is only used to specify how many iterations we would like to have the function executed. When the lapply functions is used in this way it can replace a for loop very easily.

#using the lapply function instead of the for loop
unlist(lapply(1:5, function(i) print(i) ))
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1 2 3 4 5

#using the for loop
for(i in 1:5) print(i)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
The sapply function

Applies a function to elements in a list and returns the results in a vector, matrix or a list.

sapply(list, function, ..., simplify)

When the argument simplify=F then the sapply function returns the results in a list just like the lapply function. However, when the argument simplify=T, the default, then the sapply function returns the results in a simplified form if at all possible. If the results are all scalars then sapply returns a vector. If the results are all of the same length then sapply will return a matrix with a column for each element in list to which function was applied.

y2 <- sapply(mat1.df, function(x, y) sum(x) + y, y = 5)
y2
X1 X2 X3 X4 
15 15 15 15 
     
is.vector(y2)
[1] TRUE
The tapply function

Applies a function to each cell of a ragged array.

tapply(array, indicies, function, ..., simplify)

The function is applied to each of the cells which are defined by the categorical variables listed in argument indicies. If the results of applying function to each cell is a single number then the results are returned in a multi-way array which has as many dimensions as there are components in the argument indicies. For example, if the argument indicies = c(gender, employed) then the result will be a 2 by 2 matrix with rows defined by male, female and columns defined by employed, unemployed. If the results are not a single value then the results are in a list with an dim attribute which means that it prints like a list but the user access the components by using subscripts like in an array.

#creating the data set with two categorical variables
x1 <- runif(16)
x1
 [1] 0.83189832 0.93558412 0.59623797 0.71544196 0.79925238 0.44859140
 [7] 0.03347409 0.62955913 0.97507841 0.71243195 0.58639700 0.43562781
[13] 0.23623549 0.97273216 0.72284040 0.25412129

cat1 <- rep(1:4, 4)
cat1
 [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

cat2 <- c(rep(1, 8), rep(2, 8))
cat2
 [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

mat2.df <- data.frame(x1)
names(mat2.df) <- c("x1")
mat2.df$cat1 <- cat1
mat2.df$cat2 <- cat2
mat2.df
          x1 cat1 cat2 
 1 0.9574315    1    1
 2 0.1163076    2    1
 3 0.6661923    3    1
 4 0.8265729    4    1
 5 0.6701039    1    1
 6 0.1478860    2    1
 7 0.8537499    3    1
 8 0.9993158    4    1
 9 0.4189768    1    2
10 0.8830733    2    2
11 0.6114867    3    2
12 0.3111015    4    2
13 0.8834808    1    2
14 0.3606836    2    2
15 0.7056246    3    2
16 0.8052925    4    2

tapply(mat2.df$x1, mat2.df$cat1, mean)
         1         2         3         4 
 0.7324982 0.3769876 0.7092634 0.7355707

tapply(mat2.df$x1, list(mat2.df$cat1, mat2.df$cat2), mean)
          1         2 
1 0.8137677 0.6512288
2 0.1320968 0.6218785
3 0.7599711 0.6585556
4 0.9129443 0.5581970
The sweep function

The sweep function returns an array like the input array with stats swept out.

sweep(array, margin, stats, function, ...)

The input array can be any dimensional array. The stats argument is a vector containing the summary statistics of arraywhich are to be "swept" out. The argument margin specifies which dimensions of array corresponds to the summary statistics in stats. If array is a matrix then margin=1 refers to the rows and stats has to contain row summary statistics; margin=2 refers to the columns and stats then has to contain column summary statistics. The function argument specifies which function is to be used in the "sweep" operation; most often this is either "/" or "-".

#creating the data set
a <- matrix(runif(100, 1, 2),20)
a.df <- data.frame(a)
#subtract column means from each column
#centering each column around mean
colMeans(a)
[1] 1.470437 1.497412 1.553592 1.454150 1.613789

a1 <- sweep(a, 2, colMeans(a), "-")
a1[1:5,  ]
           [,1]        [,2]       [,3]        [,4]        [,5] 
[1,] -0.4285240 -0.42115156 -0.1069188 -0.07067987 -0.46239745
[2,] -0.4087171 -0.26986445  0.1232513 -0.38962010  0.35919248
[3,]  0.5000310 -0.42625368 -0.3284257 -0.42179545  0.06797352
[4,] -0.4324198 -0.22340951  0.2689937  0.01781444 -0.01896476
[5,]  0.5256995  0.06979928 -0.3596666 -0.03256511 -0.25716789

colMeans(a1)
[1] -4.440892e-017  4.440892e-017  8.881784e-017  8.881784e-017  8.881784e-017

#dividing each column by sum
a2 <- sweep(a, 2, colSums(a), "/")
a2[1:5,  ]
           [,1]       [,2]       [,3]       [,4]       [,5] 
[1,] 0.03542869 0.03593735 0.04655898 0.04756972 0.03567355
[2,] 0.03610219 0.04098897 0.05396666 0.03660317 0.06112886
[3,] 0.06700280 0.03576699 0.03943012 0.03549684 0.05210602
[4,] 0.03529621 0.04254014 0.05865716 0.05061254 0.04941242
[5,] 0.06787562 0.05233066 0.03842467 0.04888027 0.04203217

#centering each row around the mean of the row
rowMeans(a)[1:5]
[1] 1.219942 1.400724 1.396182 1.440279 1.507096

a3 <- sweep(a, 1, rowMeans(a), "-")
a3[1:5,  ]
           [,1]        [,2]       [,3]        [,4]        [,5] 
[1,] -0.1780286 -0.14368139  0.2267312  0.16352895 -0.06855023
[2,] -0.3390045 -0.17317704  0.2761186 -0.33619404  0.57225694
[3,]  0.5742861 -0.32502378 -0.1710159 -0.36382689  0.28558047
[4,] -0.4022616 -0.16627648  0.3823066  0.03168612  0.15454532
[5,]  0.4890407  0.06011528 -0.3131707 -0.08551046 -0.15047484

rowMeans(a3)[1:5]
[1]  0.000000e+000 -4.440892e-017 -8.881784e-017  4.440892e-017 -4.440892e-017
The column functions

There are a suite of functions whose sole purpose is to compute summary statistics over columns of vectors, matrices, arrays and data frames. These functions include: 
colMeans
colSums

#creating the data set
a <- matrix(runif(100, 1, 2), 20)
a.df[1:5, ]
        X1       X2       X3       X4       X5
1 1.533694 1.058162 1.739173 1.539331 1.523406
2 1.234076 1.305300 1.621082 1.274907 1.518986
3 1.628392 1.589093 1.067717 1.168978 1.538356
4 1.987724 1.900699 1.271701 1.022540 1.381527
5 1.252216 1.155357 1.441486 1.274234 1.550317

#Get columns means using columns function
#input is the matrix a, results in a vector
col.means <- colMeans(a)
col.means
[1] 1.461788 1.470676 1.451375 1.378107 1.555699

is.vector(col.means)
[1] TRUE
The row functions

There are a suite of functions whose sole purpose is to compute summary statistics over rows of vectors, matrices, arrays and data frames. These functions include: 
rowMeans
rowSums

#Get row means using row functions
#input is the matrix a, results are in a vector
row.means <- rowMeans(a)
row.means[1:5]
[1] 1.478753 1.390870 1.398507 1.512838 1.334722

is.vector(row.means)
[1] TRUE
Miscellaneous

It is important to realize that there are usually many different ways of obtaining the same results and that these methods do differ in efficiency and other details. The following examples shows three different methods for obtaining the column means and how the results differ.

#Get columns means using columns function
#input is the matrix a, results in a vector
col.means1 <- colMeans(a)
col.means1
[1] 1.470437 1.497412 1.553592 1.454150 1.613789

is.vector(col.means1)
[1] TRUE

#get column means using apply 
#input is a matrix, results in a vector
col.means2 <- apply(a, 2, mean)
col.means2
[1] 1.470437 1.497412 1.553592 1.454150 1.613789

is.vector(col.means2)
[1] TRUE

#get column means using lapply 
#input is the data frame which is a list, results are in a list
col.means3 <- lapply(a.df, mean)
col.means3
$X1:
[1] 1.470437

$X2:
[1] 1.497412

$X3:
[1] 1.553592

$X4:
[1] 1.45415

$X5:
[1] 1.613789

is.list(col.means3)
[1] TRUE

The following examples shows three different methods for obtaining the row means and how the results differ.

#Get row means using row functions
#input is the matrix a, results are in a vector
row.means1 <- rowMeans(a)
row.means1[1:5]
[1] 1.373179 1.550485 1.489238 1.599728 1.522427

is.vector(row.means1)
[1] TRUE

#using apply, input is a matrix, results are in a vector
row.means2 <- apply(a, 1, mean)
row.means2[1:5]
[1] 1.606348 1.350039 1.601302 1.631221 1.616117

is.vector(row.means2)
[1] TRUE

#we can transpose the data frame and create a new data frame
ta.df <- data.frame(t(a.df))

#use lapply on the data frame since it is a list 
#results are in a list
row.means3 <- lapply(ta.df, mean)
row.means3[1:5]
$X1:
[1] 1.219942

$X2:
[1] 1.400724

$X3:
[1] 1.396182

$X4:
[1] 1.440279

$X5:
[1] 1.507096

is.list(row.means3)
[1] TRUE

Any of the functions that have been mentioned above can be used inside a user defined function. In the following example the function f1 multiply the sequence 1-x by y by using the lapply function instead of a for loop.

f1 <- function(x, y) {
	return(lapply(1:x, function(a, b) b*a, b=y ))
}
#multiplying the sequence 1:3 by 2
f1(3, 2)
[[1]]:
[1] 2

[[2]]:
[1] 4

[[3]]:
[1] 6

#multiplying the sequence 1:4 by 10
f1(4, 10)
[[1]]:
[1] 10

[[2]]:
[1] 20

[[3]]:
[1] 30

[[4]]:
[1] 40

Cool use of the lapply function which can be used in many clever ways.

list1 <- lapply(1:6, runif)
list1
[[1]]
[1] 0.796063

[[2]]
[1] 0.5456884 0.8709621

[[3]]
[1] 0.6957483 0.2939853 0.3849384

[[4]]
[1] 0.77135125 0.14607271 0.58522428 0.09303452

[[5]]
[1] 0.6610859 0.2150228 0.1366291 0.0921625 0.6901002

[[6]]
[1] 0.1368554 0.2575741 0.1218799 0.6293610 0.4628676 0.8303309

list2 <- lapply(1:6, runif)
list2
[[1]]
[1] 0.444944

[[2]]
[1] 0.1236448 0.1709105

[[3]]
[1] 0.01409603 0.76272480 0.65504591

[[4]]
[1] 0.7724182 0.7856118 0.2360862 0.2319794

[[5]]
[1] 0.1785266 0.5621904 0.3170615 0.2320846 0.6087983

[[6]]
[1] 0.67940923 0.99266570 0.05010323 0.50740777 0.11782769 0.71910324


lapply(1:6, function(i, x, y) x[[i]] + y[[i]],
       x = list1, y = list2)
[[1]]
[1] 1.241007

[[2]]
[1] 0.6693333 1.0418726

[[3]]
[1] 0.7098443 1.0567101 1.0399843

[[4]]
[1] 1.5437694 0.9316845 0.8213105 0.3250139

[[5]]
[1] 0.8396126 0.7772132 0.4536906 0.3242471 1.2988985

[[6]]
[1] 0.8162646 1.2502398 0.1719831 1.1367687 0.5806953 1.5494342
<think>我们之前讨论的是在Windows上使用`mclapply`时遇到的问题,并给出了替代方案(如`parLapply`和`foreach`)。现在用户的新问题是:如何利用`lapply`或`sapply`函数替代for循环来改写R代码。根据引用[1]和[2],我们可以知道`lapply`和`sapply`是R中用于替代循环的常用函数。`lapply`总是返回一个列表(list),而`sapply`会尝试简化结果,可能返回向量或矩阵。用户希望用这些函数来改写for循环。下面我将提供一个通用的指导步骤,并举例说明如何将for循环改写成`lapply`或`sapply`的形式。 ###步骤1:识别循环结构假设我们有一个for循环,它对一个序列(例如向量、列表)的每个元素进行操作,并将结果存储起来。例如: ```rresults <-vector("list",length(x))#初始化列表用于存储结果for(iin seq_along(x)){#对x的第i个元素进行操作results[[i]]<- some_function(x[[i]])} ```###步骤2:使用`lapply`替代上述循环可以直接用`lapply`改写: ```rresults <-lapply(x, some_function)```这里`lapply`会对`x`的每个元素应用`some_function`函数,并返回一个与`x`等长的列表。 ###步骤3:使用`sapply`简化结果如果希望结果以向量或矩阵形式返回,可以使用`sapply`: ```rresults <-sapply(x,some_function) ```注意:`sapply`会尝试简化结果,但有时可能产生意外的输出类型。如果希望结果一致,可以考虑使用`vapply`(指定返回类型)或`purrr::map`系列函数(更一致的类型输出)。 ###示例1:计算向量中每个元素的平方```r#原始for循环x <-1:5results<- numeric(length(x))for (i inseq_along(x)) {results[i]<- x[i]^2}#使用lapply(返回列表) results_list <-lapply(x, function(elem)elem^2)#使用sapply(返回向量) results_vector <-sapply(x,function(elem)elem^2)``` ###示例2:对列表中的每个元素进行摘要统计假设我们有一个数据框列表,我们想对每个数据框计算某一列的均值。```r#创建示例数据框列表df1<- data.frame(a= rnorm(10))df2<- data.frame(a= rnorm(20))df_list<- list(df1, df2)#使用for循环means <-numeric(length(df_list)) for(iin seq_along(df_list)) {means[i]<- mean(df_list[[i]]$a)} #使用sapplymeans <-sapply(df_list, function(df)mean(df$a))``` ###示例3:使用索引的循环有时循环中需要使用索引(如同时访问多个数据结构中的相同索引位置)。根据引用[1],我们可以使用`lapply(seq_along(x),function(i){... })`。例如,有两个向量,我们想计算每个位置的和: ```rx <-1:5y<-6:10# for循环results<- numeric(length(x)) for(iin seq_along(x)){results[i] <-x[i] +y[i] }#使用lapply和索引results <-lapply(seq_along(x), function(i){x[i] +y[i] })#此时返回的是列表,我们可以用unlist转为向量,或者用sapplyresults<-sapply(seq_along(x),function(i) x[i]+ y[i])``` ###错误处理在改写过程中,如果函数可能会出现错误,我们可以使用引用[4]中的方法进行错误处理。例如,使用`try`捕获错误,然后通过自定义函数检查结果中哪些是错误。```r#假设我们的函数有时会出错f<- function(x){if (x ==3) stop("error!")else x^2}#使用lapply并捕获错误results<- lapply(1:5, function(x)try(f(x),silent =TRUE))#检查错误is.error <-function(x) inherits(x, "try-error") succeeded <-!sapply(results, is.error)#提取正确结果correct_results<- results[succeeded] #提取错误输入errors<-which(!succeeded) ```###总结-使用`lapply`替代循环,返回列表。 -使用`sapply`替代循环,返回简化的向量或矩阵(但要注意简化规则)。 -当循环体内需要索引时,使用`seq_along(x)`生成索引序列。-考虑错误处理,确保代码健壮性。通过以上方法,我们可以将大多数for循环改写成更简洁、高效的`lapply`或`sapply`形式。 ---§§相关问题§§1.在R中如何使用`vapply`函数确保返回结果类型一致?2.如何利用`purrr`包中的`map`系列函数进行迭代操作?3.在R中如何调试`apply`系列函数中的错误?</think>### 使用`lapply`/`sapply`替代`for`循环的完整指南 #### 1. **基础转换模式** 根据引用[1],`for`循环有三种基本转换方式: ```r # 原始for循环 results <- list() for (x in xs) { results[[length(results)+1]] <- your_function(x) } # 等价lapply写法 results <- lapply(xs, function(x) { your_function(x) }) # 需要索引时的写法 results <- lapply(seq_along(xs), function(i) { your_function(xs[[i]], i) # 可访问索引i }) # 需要元素名时的写法 results <- lapply(names(xs), function(nm) { your_function(xs[[nm]], nm) # 可访问元素名 }) ``` 这三种模式覆盖了90%的循环转换场景[^1]。 #### 2. **选择正确的函数** 根据引用[2],不同需求对应不同函数: | 函数 | 输出类型 | 典型应用场景 | |-----------|---------------|----------------------------| | `lapply` | 列表(list) | 保留原始数据结构 | | `sapply` | 简化结构 | 自动转换为向量/矩阵 | | `apply` | 数组/矩阵 | 行列计算(如rowSums) | | `vapply` | 指定类型 | 要求严格输出类型时 | | `tapply` | 分组统计 | 按因子分组计算 | #### 3. **实际转换案例** **案例1:简单计算替代** ```r # 原始for循环 squares <- numeric(10) for (i in 1:10) { squares[i] <- i^2 } # sapply简化版 squares <- sapply(1:10, function(i) i^2) # 返回向量 ``` **案例2:数据框处理** ```r # 原始循环 df <- data.frame(a=1:5, b=6:10) row_sums <- numeric(nrow(df)) for (i in 1:nrow(df)) { row_sums[i] <- sum(df[i, ]) } # apply替代版 row_sums <- apply(df, 1, sum) # 1表示按行计算 ``` **案例3:带错误处理** 根据引用[4],添加健壮的错误处理: ```r safe_log <- function(x) try(log(x), silent=TRUE) # 原始循环 results <- list() for (x in c(1, -2, "a")) { results <- c(results, safe_log(x)) } # lapply版带错误检测 results <- lapply(c(1, -2, "a"), safe_log) is_error <- sapply(results, function(x) inherits(x, "try-error")) valid_results <- results[!is_error] ``` #### 4. **性能优化技巧** 1. **向量化优先**:优先使用内置向量函数 ```r # 劣质循环 for (i in 1:1e6) { x[i] <- i*2 } # 向量化方案 (100倍) x <- 1:1e6 * 2 ``` 2. **预分配内存**:尤其对大型数据 ```r # 优化前 (动态扩展) result <- list() # 优化后 (预分配) result <- vector("list", length(xs)) result <- lapply(seq_along(xs), function(i) { # 计算代码 }) ``` 3. **避免嵌套**:使用`mapply`处理多参数 ```r # 嵌套循环 for (i in 1:3) { for (j in 1:2) { output[i,j] <- i*j } } # mapply替代 mapply(function(i,j) i*j, i = rep(1:3, each=2), j = rep(1:2, times=3)) ``` #### 5. **调试建议** 1. 使用`browser()`调试: ```r results <- lapply(xs, function(x) { browser() # 在此暂停调试 your_function(x) }) ``` 2. 分离复杂函数: ```r # 将复杂逻辑独立成函数 process_element <- function(x) { # 多步处理... } results <- lapply(xs, process_element) ``` > **关键区别**:`lapply`总是返回列表,`sapply`尝试简化结果类型(可能导致意外输出)。当需要确定输出类型时,优先使用`vapply`指定类型[^2]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值