您的位置:首页 > 其它

R Programming: Part 3 - Code Correctly and Efficiently

2016-07-13 23:45 225 查看

Part 3 - Code Correctly and Efficiently

Part 3 - Code Correctly and Efficiently
Loop Functions on Command Line
Lapply Sapply Vapply

Apply

Mapply

Tapply Split

Debugging Profiling
Debugging Tools

Code Profiling

An Example with R Features

Loop Functions on Command Line

Writing for and while loops is useful when programming but not very easy when working interactively on the command line. There are some functions which implement looping in one line to make life easier.

lapply: loop over a list and implement a function on each element in it

sapply: same as lapply but try to simplify the result

vapply: same as sapply but safer for specifying the expected format

apply: apply a function over the margins of an array

mapply: multivariate version of lapply

tapply: apply a function over subsets of a vector

split: an auxiliary function, particularly useful when in conjunction with lapply

Lapply, Sapply & Vapply

The function lapply(X, FUN, …) works on three arguments:

a list X (if not, it will be coerced into a list);

a function name (or a function body) FUN;

other arguments that FUN needs.

and it always returns a list, preserving the name of each element but not the class.

> x <- list(a = 1:4, b = rnorm(100))
> lapply(x, mean)
$a
[1] 2.5
$b
[1] -0.1094178
## b is a vector of 100 variates that follow N(0, 1)


> x <- 1:4                             ## x will be coerced into a list
> lapply(x, runif, min = 0, max = 10)  ## Generate variates that follow uniform distribution in [0, 10]
[[1]]
[1] 8.293113
[[2]]
[1] 5.702094 1.158094
[[3]]
[1] 4.625228 3.371865 7.122826
[[4]]
[1] 6.150899 9.011356 4.353911 7.532430


Lapply can be used with anonymous functions, which are temporarily defined inside lapply and no longer exist when lapply finishes.

Suppose I want to extract the first column of each element in list x:

> x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
> x
$a
[,1] [,2]
[1,]    1    3
[2,]    2    4
$b
[,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

> lapply(x, function(arg) {arg[, 1]})   ## An anonymous function with one argument "arg"
$a
[1] 1 2
$b
[1] 1 2 3


Sapply will try to simplify the result of lapply if possible.

If the result is a list where every element has length 1, then a vector is returned.

If the result is a list where every element is a vector of the same length (> 1), then a matrix is returned.

If it cannot figure things out, a list is returned.

> x <- list(1:4, 2:5, 3:6)
> lapply(x, mean)
[[1]]
[1] 2.5
[[2]]
[1] 3.5
[[3]]
[1] 4.5
> sapply(x, mean)
[1] 2.5 3.5 4.5


Vapply is a safer version of sapply, which explicitly specifies the expected format of each element in the resulted list. If the result does not match the format, an error will be thrown.

> x <- list(1:4, 2:5, 3:6)
> sapply(x, mean)
[1] 2.5 3.5 4.5
> vapply(x, mean, numeric(1))   ## Each element is a numeric vector of length 1
[1] 2.5 3.5 4.5
> vapply(x, mean, character(1))
Error in vapply(x, mean, character(1)) : 值的种类必需是'character',
但FUN(X[[1]])结果的种类却是'double'


Apply

Apply is another loop function that implements a function over the margins of an array. Though it is somehow no faster than a for loop for R nowadays, the reason of using apply is that it involves less typing, and good programmers are always lazy. LOL.

The function apply(X, DIM, FUN, …) works on four arguments:

an array X;

an integer vector indicating indices of dimensions which will be worked on;

a function name (or a function body) FUN;

other arguments that FUN needs.

> x <- matrix(1:12, 3, 4)
> apply(x, 2, mean)     ## Compute the mean for each column (2nd dimension)
[1]  2  5  8 11
> apply(x, 1, sum)      ## Compute the sum for each row (1st dimension)
[1] 22 26 30


For sums and means of matrices, we have some shortcuts which are much faster:

rowSums = apply(x, 1, sum)

rowMeans = apply(x, 1, mean)

colSums = apply(x, 2, sum)

colMeans = apply(x, 2, mean)

An apply example passing the … argument.

Quantile(x, probs = seq(0, 1, 0.25)) function gives quantiles of the numeric vector x according to the probs sequence, which will be quartiles by default (0%, 25%, 50%, 75%, 100%).

> x <- matrix(1:12, 3, 4)
> apply(x, 1, quantile, probs = c(0.25, 0.75))    ## 1/4 and 3/4 quantiles of each row
[,1] [,2] [,3]
25% 3.25 4.25 5.25
75% 7.75 8.75 9.75


Apply can be used for multi-dimension arrays as well.

Arrays can be created by array(data = NA, dim = length(data), dimnames = NA, …), where dim is an integer vector giving maximal indices of each dimension and dimnames must be a list giving names to each dimension.

> x <- array(1:40, c(2, 2, 10))
> apply(x, c(1, 2), mean)   ## Compute the average matrix
[,1] [,2]
[1,]   19   21
[2,]   20   22
> rowMeans(x, dims = 2)     ## Perform the same task
[,1] [,2]
[1,]   19   21
[2,]   20   22


If the dims argument is specified for rowSums/rowMeans function, which is 1 by default, the computation will run over all dimensions from dim + 1 thereafter. Similarly, the computation will run over dimensions 1:dims if dims is specified for colSums/colMeans.

Mapply

Mapply is a multivariate version of lapply, and its argument list mapply(FUN, …, MoreArgs = NULL, …) is a little bit different.

a function FUN to apply;

arguments (lists) to apply over;

other arguments needed for FUN;

other options including simplify and use.names.

Mapply will also return a list, and the result can be simplified as well if you set simplify = TRUE.

With mapply, we can simplify the implementation

> list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))


as follows:

> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4


So mapply can be used to achieve function vectorization, which has performances similar to vectorized operations, implementing element-by-element in the containers.

> rnorm(5, 5, 2)
[1] 6.077207 1.981472 3.273849 6.054433 4.204565
> rnorm(1:5, 1:5, 2)             ## Does not work!
[1]  1.988842  4.085437 -0.836809  3.073536  1.620521
> mapply(rnorm, 1:5, 1:5, 2)     ## Function vectorization
[[1]]
[1] 0.8973886
[[2]]
[1] 3.320832 2.924064
[[3]]
[1] 5.586013 4.051715 5.002757
[[4]]
[1] 0.8749286 1.6149667 3.9048797 4.8861302
[[5]]
[1] 6.030623 5.734222 4.671075 6.140077 6.166316


Tapply & Split

Tapply is used to apply a function over subsets of a vector.

Argument list: tapply(X, INDEX, FUN, …, SIMPLIFY = TRUE)

X is a vector;

INDEX is a factor or a list of factors (or else be coerced);

FUN is the function to be applied;

… contains other arguments to be passed to FUN;

SIMPLIFY is TRUE by default.

Gl() function (generate factor levels) can be used to label elements in the vector. Gl(n, k) generates a factor with n levels and k replications in each level.

> x <- c(rnorm(6), runif(6), rbinom(6, 1, 0.5))
> f <- gl(3, 6)
> f
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
1          2          3
-0.5819306  0.5209379  0.6666667


Split, a bit similar to tapply, takes a vector and splits it into groups determined by a factor or list of factors.

Argument list: split(x, f, DROP = FALSE, …)

x is a vector or list or data frame;

f is a factor (or coerced one) or list of factors;

DROP indicates whether or not empty levels should be dropped.

Split function returns a list as well, so split is often followed by lapply/sapply.

> x <- c(rnorm(6), runif(6), rbinom(6, 1, 0.5))
> f <- gl(3, 6)
> split(x, f)
$`1`
[1]  1.9366825  0.7792120  0.2007633 -0.6015537 -0.6651212  0.1234564
$`2`
[1] 0.8921660 0.6403586 0.4449109 0.1195821 0.1590710 0.5091350
$`3`
[1] 0 1 0 1 1 1

> lapply(split(x, f), mean)   ## Actually tapply works the same
$`1`
[1] 0.2955732
$`2`
[1] 0.4608706
$`3`
[1] 0.6666667


Split is particularly useful to split a data frame apart by some indices.

An instructive case follows.

## Load data of everyday air quality in New York from May to September in 1973
> library(datasets)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

## Now I’d like to compute the mean of the first three indices in different months
> s <- split(airquality, airquality$Month)        ## airquality$month coerced into a factor
> lapply(s, function(x) {colMeans(x[, 1:3])})     ## Also fine if subsetting it by name
$`5`
Ozone  Solar.R     Wind
NA       NA 11.62258
$`6`
Ozone   Solar.R      Wind
NA 190.16667  10.26667
$`7`
Ozone    Solar.R       Wind
NA 216.483871   8.941935
$`8`
Ozone  Solar.R     Wind
NA       NA 8.793548
$`9`
Ozone  Solar.R     Wind
NA 167.4333  10.1800

## Simplify the result into a matrix
> sapply(s, function(x) {colMeans(x[, 1:3])})
5         6          7        8        9
Ozone         NA        NA         NA       NA       NA
Solar.R       NA 190.16667 216.483871       NA 167.4333
Wind    11.62258  10.26667   8.941935 8.793548  10.1800

## Remove the NA values and recompute the means
> sapply(s, function(x) {colMeans(x[, 1:3], na.rm = TRUE)})
5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000


If there are more than one level for each element, interaction(f1, f2) can combine two factors to create a new one. But for split function, you just need to pass f1 and f2 as a list of factors, and the interaction() function will be automatically called.

> x <- rnorm(10)
> f1 <- gl(2, 5)
> f2 <- gl(5, 2)
> f1
[1] 1 1 1 1 1 2 2 2 2 2
Levels: 1 2
> f2
[1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
> interaction(f1, f2)
[1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5
Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5

> str(split(x, list(f1, f2)))
List of 10
$ 1.1: num [1:2] 0.927 -0.557
$ 2.1: num(0)
$ 1.2: num [1:2] -1.754 0.614
$ 2.2: num(0)
$ 1.3: num -1.48
$ 2.3: num -0.475
$ 1.4: num(0)
$ 2.4: num [1:2] 0.0462 -1.1802
$ 1.5: num(0)
$ 2.5: num [1:2] -2.92 1.17

> str(split(x, list(f1, f2), drop = TRUE))  ## Empty levels will be dropped
List of 6
$ 1.1: num [1:2] 0.927 -0.557
$ 1.2: num [1:2] -1.754 0.614
$ 1.3: num -1.48
$ 2.3: num -0.475
$ 2.4: num [1:2] 0.0462 -1.1802
$ 2.5: num [1:2] -2.92 1.17


Debugging & Profiling

Debugging Tools

Basic indications of problem occurrences:

message: a generic notification, produced by message() function, execution of the function continues.

warning: an indication that something is wrong but not necessarily fatal, produced by warning() function, execution continues.

error: an indication that a fatal problem occurs, produced by stop() function, execution stops.

printmessage <- function(x)
{
if (x > 0) "positive"
else  "not positive"
}

> x <- log(-1)
Warning message:
In log(-1) : 产生了NaNs
> printmessage(x)
Error in if (x > 0) "positive" else "not positive" :
需要TRUE/FALSE值的地方不可以用缺少值


Basic debugging functions in R:

traceback: prints out the function call stack after an error occurs

debug: flags a function for debug mode which allows you to follow the function execution line-by-line

browser: suspends the execution of a function wherever it is called and sets it to debug mode

trace: allows you to insert debugging code into a function

recover: allows you to modify the error behavior so that you can browse the function call stack

> lm(y ~ x)
Error in eval(expr, envir, enclos) : 找不到对象'y'
> traceback()
7: eval(expr, envir, enclos)
6: eval(predvars, data, env)
5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)
4: stats::model.frame(formula = y ~ x, drop.unused.levels = TRUE)
3: eval(expr, envir, enclos)
2: eval(mf, parent.frame())
1: lm(y ~ x)


> debug(lm)
> lm(y ~ x)
debugging in: lm(y ~ x)
debug: {
ret.x <- x
ret.y <- y
cl <- match.call()
mf <- match.call(expand.dots = FALSE)
…          ## The rest of the function body omitted
z
}
Browse[2]> n   ## Input n to debug the next line
debug: ret.x <- x
Browse[2]> n
debug: ret.y <- y
...
Browse[2]> n
Error in eval(expr, envir, enclos) : 找不到对象'y'


> options(error = recover)     ## Global setting
> read.csv("nosuchfile")
Error in file(file, "rt") : 无法打开链结
此外: Warning message:
In file(file, "rt") : 无法打开文件'nosuchfile': No such file or directory

Enter a frame number, or 0 to exit
## Browse the environments of functions in the call stack selectively

1: read.csv("nosuchfile")
2: read.table(file = file, header = header, sep = sep, quote = quote, dec
3: file(file, "rt")

Selection:


Debugging tools are not a substitute for thinking and might be a bad habit to get into.

You can get pretty far in R without using any of these functions as well, and meanwhile might become a more intelligent programmer.

Code Profiling

Profiling is a systematic way to examine how much time is spent on different parts of a program. It can be useful in code optimization.

But do design first, then optimize.

Premature optimization is the root of all evil. —- Donald Knuth

(1). System.time() function

System.time() function takes an arbitrary R expression as input (or program parts wrapped in curly braces) and returns the amount of time (in seconds) taken to evaluate the expression.

If there is an error, it returns the time till the error occurs.

The returned time is an object of class proc_time.

Note: not to confuse with Sys.time(), a POSIXct object representing the current time.

User time refers to the time the CPU(s) take to evaluate the expression.

Elapsed time refers to the time the user waits for the expression to be evaluated.

User time and elapsed time are not necessarily the same.

Elapsed time may be greater than user time, if the CPU spends a lot of time waiting other tasks to be finished.

Elapsed time may be less than user time, if your machine has multiple cores or processors.

Basic R programs do not use multiple cores, but they may link to libraries which do use them, e.g. BLAS (basic linear algebra standard) library or parallel processing libraries.

> system.time(readLines("http://www.baidu.com/"))
用户  系统  流逝
0.004 0.002 0.020

> system.time({
r <- numeric(1000)
for (i in 1:1000)
r[i] <- mean(rnorm(1000))
})
用户  系统  流逝
0.107 0.004 0.113


System.time() is useful when we know which part of the program takes extraordinarily long time. But what if we have no idea which part?

(2). The R Profiler

The Rprof() function Is used to enable and disable code profiling.

The summaryRprof() function summarizes the output from Rprof() and presents it in a readable way.

Rprof(filename) keeps track of the function call stack during the program execution at a regular sampling interval, and writes the profiling results to the file “filename”.

By default, the sampling interval is set to 0.02 seconds, and the filename is “Rprof.out”.

Setting filename to NULL disables profiling.

> x <- rnorm(2000000)
> y <- rnorm(2000000, 1)
> Rprof("lm_rprof.out")   ## Enable profiling
> lm(y ~ x)               ## Fitting a linear model for y and x

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
0.9995215   -0.0008899

> Rprof(NULL)             ## Disable profiling


then in the file “lm_rprof.out”, we will see

sample.interval=20000
"na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"is.na" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"is.na" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
"[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
## The following profiling results are simplified on purpose.
4  * "anyDuplicated.default" "anyDuplicated" "[.data.frame" "[" "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
2  * "na.omit.data.frame" "na.omit" ".External2" "model.frame.default" "stats::model.frame" "eval" "eval" "lm"
1  * "model.response" "lm"
84 * "as.character" "model.response" "lm"
22 * ".External2" "model.matrix.default" "model.matrix" "lm"
1  * "lazyLoadDBfetch" "lm"
18 * "lm.fit" "lm"
11 * "rep.int" "lm.fit" "lm"
1  * "lm.fit" "lm"


Every 0.02 seconds, the R profiler reports the function call chain in the stack (whose top is on the right side). To make the profiling results more readable, we call the function summaryRprof().

SummaryRprof() returns a list of four elements named by.self, by.total, sample.interval and sampling.time. Sample.interval is the interval specified between two reports of the function call stack and sampling.time is the total time spent on profiling.

By.total and by.self are two ways to normalize the profiling data.

By.total format considers the time spent on a function as its total run time, namely, the clock will continue even if it calls another function.

By.self format considers it as substracting the time spent on deeper function calls from the total run, namely, the clock stops once it calls another function, and continues when the call finishes.

> summaryRprof("lm_rprof.out")
$by.self
self.time self.pct total.time total.pct
"as.character"               1.68    53.50       1.68     53.50
".External2"                 0.44    14.01       0.82     26.11
"lm.fit"                     0.38    12.10       0.60     19.11
"rep.int"                    0.22     7.01       0.22      7.01
"[.data.frame"               0.12     3.82       0.20      6.37
"na.omit"                    0.08     2.55       0.38     12.10
"anyDuplicated.default"      0.08     2.55       0.08      2.55
"na.omit.data.frame"         0.06     1.91       0.30      9.55
"is.na"                      0.04     1.27       0.04      1.27
"model.response"             0.02     0.64       1.70     54.14
"lazyLoadDBfetch"            0.02     0.64       0.02      0.64

$by.total
total.time total.pct self.time self.pct
"lm"                          3.14    100.00      0.00     0.00
"model.response"              1.70     54.14      0.02     0.64
"as.character"                1.68     53.50      1.68    53.50
".External2"                  0.82     26.11      0.44    14.01
"lm.fit"                      0.60     19.11      0.38    12.10
"model.matrix.default"        0.44     14.01      0.00     0.00
"model.matrix"                0.44     14.01      0.00     0.00
"na.omit"                     0.38     12.10      0.08     2.55
"eval"                        0.38     12.10      0.00     0.00
"model.frame.default"         0.38     12.10      0.00     0.00
"stats::model.frame"          0.38     12.10      0.00     0.00
"na.omit.data.frame"          0.30      9.55      0.06     1.91
"rep.int"                     0.22      7.01      0.22     7.01
"[.data.frame"                0.20      6.37      0.12     3.82
"["                           0.20      6.37      0.00     0.00
"anyDuplicated.default"       0.08      2.55      0.08     2.55
"anyDuplicated"               0.08      2.55      0.00     0.00
"is.na"                       0.04      1.27      0.04     1.27
"lazyLoadDBfetch"             0.02      0.64      0.02     0.64

$sample.interval
[1] 0.02

$sampling.time
[1] 3.14


C or Fortran code executed through R code cannot be profiled.

An Example with R Features

This example is intended to write an R function which is able to cache potentially time-consuming computations. For example, taking the mean of a numeric vector is typically a fast operation. However, for a very long vector, it may take too long to compute the mean, especially if it has to be computed repeatedly (e.g. in a loop). If the contents of a vector are not changed, it does make sense to cache the value of the mean so that when we need it again, it can be looked up in the cache rather than recomputed.

The following example will take advantage of the scoping rules of the R language and how they can be manipulated to preserve state inside of an R object.

We introduce the <<- operator which can be used to assign a value to an object in an environment that is different from the current environment. Below are two functions that are used to create a special object that stores a numeric vector and caches its mean.

The first function makeVector() creates a special “vector”, which is in fact a list containing 4 functions to set/get the value of the vector and set/get the value of the mean respectively.

makeVector <- function(x = numeric()) {
m <- NULL
set <- function(y) {
x <<- y
m <<- NULL
}
get <- function() x
setmean <- function(mean) m <<- mean
getmean <- function() m
list(set = set, get = get,
setmean = setmean,
getmean = getmean)
}


The following function calculates the mean of the special “vector” created with the above function. However, it first checks to see if the mean has already been calculated. If so, it gets the mean from the cache and skips the computation. Otherwise, it calculates the mean of the data and sets the value of the mean in the cache via the setmean function.

cachemean <- function(x, ...) {
m <- x$getmean()
if (!is.null(m)) {
message("getting cached data")
return(m)
}
data <- x$get()
m <- mean(data, ...)
x$setmean(m)
m
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  R 笔记