Statistics doesn’t have to be so hard! Resampling in R and SAS
2014-11-18 15:54
591 查看
Example 2014.13: Statistics doesn’t have to be so hard! Resampling in R and SAS
November 17, 2014By
Nick Horton
(This article was first published on SAS and R, and kindly contributed to
R-bloggers)
A recent
post pointed us to a great
talk that elegantly described how inferences from a trial could be analyzed with a purely resampling-based approach. The talk uses data from a
paper that considered the association between beer consumption and mosquito attraction. We recommend the talk linked above for those thinking about creative ways to teach inference.
In this entry, we demonstrate how straightforward it is to replicate these analyses in R, and show how they can be done in SAS.
R
We'll repeat the exercise twice in R: first using the mosaic package that Nick and colleagues have been developing to help teach statistics, and then in base R.
For mosaic, we begin by entering the data and creating a dataframe. The do() operator and the shuffle() function facilitate carrying out a
permutation test (see also section 5.4.5 of the book).
beer = c(27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21, 18, 27, 20) water = c(21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20) ds = data.frame(y = c(beer, water), x = c(rep("beer", length(beer)), rep("water", length(water)))) require(mosaic) obsdiff = compareMean(y ~ x, data=ds) nulldist = do(999)*compareMean(y ~ shuffle(x), data=ds) histogram(~ result, xlab="permutation differences", data=nulldist) ladd(panel.abline(v=obsdiff, col="red", lwd=2)) > obsdiff [1] -4.377778 > tally(~ abs(result) > abs(obsdiff), format="percent", data=nulldist) TRUE FALSE 0.1 99.9
The do() operator evaluates the expression on the right hand side a specified number of times. In this case we shuffle (or permute) the group indicators.
We observe a mean difference of 4.4 attractions (comparing the beer to water groups). The histogram of the results-- plotted with the lattice graphics package that mosaic loads by default-- demonstrates that this result would be highly unlikely to occur by
chance: if the null hypothesis that the groups were equal was true, results more extreme than this would happen only 1 time out of 1000. This can be displayed using the tally() function, which adds some functionality to table(). We can calculate the p-value
by including the observed statistic in the numerator and the denominator = (1+1)/(999 + 1) = .002.
For those not invested in the mosaic package, base R functions can be used to perform this analysis . We present a version here that begins after making the data vectors.
alldata = c(beer, water) labels = c(rep("beer", length(beer)), rep("water", length(water))) obsdiff = mean(alldata[labels=="beer"]) - mean(alldata[labels=="water"]) > obsdiff [1] -4.377778
The sample() function re-orders the labels, effectively implementing the supposition that the number of bites might have happened under either the water or the beer drinking regimen.
resample_labels = sample(labels) resample_diff = mean(alldata[resample_labels=="beer"]) - mean(alldata[resample_labels=="water"]) resample_diff [1] 1.033333
In a teaching setting, the preceding code could be re-run several times, to mimic the presentation seen in the video linked above. To repeat many times, the most suitable base R tool is
replicate(). To use it, we make a function of the resampling procedure shown above.
resamp_means = function(data, labs){ resample_labels = sample(labs) resample_diff = mean(data[resample_labels=="beer"]) - mean(data[resample_labels=="water"]) return(resample_diff) } nulldist = replicate(9999,resamp_means(alldata,labels)) hist(nulldist, col="cyan") abline(v = obsdiff, col = "red")
The histogram is shown above. The p-value is obtained by counting the proportion of statistics (including the actual observed difference) among greater than or equal to the observed statistic:
alldiffs = c(obsdiff,nulldist) p = sum(abs(alldiffs >= obsdiff)/ 10000)
SAS
The SAS code is relatively cumbersome in comparison. We begin by reading the data in, using the "line hold" double-ampersand and the
infile datalines statement that allows us to specify a delimiter (other than a space) when reading data in directly in a data step. This let us copy the data from the R code. To identify the water and beer regimen subjects, we use the
_n_ implied variable that SAS creates but does not save with the data.
The summary procedure generates the mean for each group and saves the results in a data set with a row for each group; the
transpose procedure makes a data set with a single row and a variable for each group mean. Finally, we calculate the observed difference and use
call symput to make it into a macro variable for later use.
data bites;; if _n_ le 18 then drink="water"; else drink="beer"; infile datalines delimiter=','; input bites @@; datalines; 21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20 27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24 28, 24, 29, 21, 21, 18, 27, 20 ; run; proc summary nway data = bites; class drink; var bites; output out=obsmeans mean=mean; run; proc transpose data = obsmeans out=om2; var mean; id drink; run; data om3; set om2; obsdiff = beer-water; call symput('obsdiff',obsdiff); run; proc print data = om3; var obsdiff; run; Obs obsdiff 1 4.37778
(Yes, we could have done this with proc ttest and ODS. But the spirit of the video is that we don't understand t-tests, so we want to avoid them.)
To rerandomize, we can assign a random number to each row, sort on the random number, and assign drink labels based on the new order of the data.
data rerand; set bites; randorder = uniform(0); run; proc sort data = rerand; by randorder; run; data rerand2; set rerand; if _n_ le 18 then redrink = "water"; else redrink = "beer"; run; proc summary nway data = rerand2; class redrink; var bites; output out=rerand3 mean=mean; run; proc transpose data = rerand3 out=rerand4; var mean; id redrink; run; data rrdiff; set rerand4; rrdiff = beer-water; run; proc print data = rrdiff; var rrdiff; run; Obs rrdiff 1 -1.73778
One way to repeat this a bunch of times would be to make a macro out of the above and collect the resulting
rrdiff into a data set. Instead, we use the surveyselect procedure to do this much more efficiently. The
groups option sample groups of 18 and 25 from the data, while the reps option requests this be done 9,999 times. We can then use the
summary and transpose procedures as before, with the addition of the
by replicate statement to make a data set with columns for each group mean and a row for each replicate.
proc surveyselect data = bites groups=(18,25) reps = 9999 out = ssresamp; run; proc summary nway data = ssresamp; by replicate; class groupid; var bites; output out=ssresamp2 mean=mean; run; proc transpose data = ssresamp2 out=ssresamp3 prefix=group; by replicate; var mean; id groupid; run;
To get a p-value and make a histogram, we use the macro variable created earlier.
data ssresamp4; set ssresamp3; diff = group2 - group1; exceeds = abs(diff) ge &obsdiff; run; proc means data = ssresamp4 sum; var exceeds; run; The MEANS Procedure Analysis Variable : exceeds Sum 9.0000000 proc sgplot data = ssresamp4; histogram diff; refline &obsdiff /axis=x lineattrs=(thickness=2 color=red); run;
The p-value is 0.001 (= (9+1)/10000).
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences.
SAS and R is aggregated by
R-bloggers,
PROC-X, and
statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at
SAS and R. We answer comments there and offer direct subscriptions if you like our content. With exceptions noted above, no one is allowed to profit by this work under our
license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
相关文章推荐
- I have a custom view that I want to be able to initialize both in-code and in nib.
- error : A file found in a source-path must have an externally visible definition. If a definition in the file is meant to be ext
- Success and Motivation - You only have to be right once!(ZT)
- Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
- the win16 subsystem was unable to enter protected mode,DOSX.EXE must be in your AUTOEXEC.NT and pres
- using JS to control two select(html),the data can be loaded from database and XML,and show in the select
- In order to run a trace against SQL Server you must be a member of sysadmin fixed server role or have the ALTER TRACE permission.
- A class file was not written. The project may be inconsistent, if so try refreshing this project and building it. eclipse提示错误
- How to run eclipse in clean mode? and what happens if we do so?
- R and SAS in the curriculum: getting students to "think with data"
- This is probably a good time to review the order in which SELECT statement clauses are to be specified. Table 10.2 lists all the clauses we have learned thus far, in the order they must be used.
- SQL Server2008: The Changes you have made require the following tables to be dropped and re-created
- Saving changes is not permitted. The changes you have made require the following tables to be dropped and re-created.
- 使用Mono Cecil 动态获取运行时数据 (Atribute形式 进行注入 用于写Log) [此文报考 xxx is declared in another module and needs to be imported的解决方法]-摘自网络
- why inverse must be setted in hibernate bidirectional association of one-to-many and many-to-many
- whatever how, this should be record and be researched later. ---- about how to enable gpio value set in mach_smdk6410.c
- A class file was not written. The project may be inconsistent, if so try refreshing this project and building it
- Error occurred in deployment step 'Activate Features': Feature with Id '{GUID}' is not installed in this farm, and cannot be added to this scope.
- I have updated Android SDK to rev. 22 yesterday and there is no apkbuilder in tools
- PH FIN want to have PR/PO in TWD and SGD but do invoice matching and payment in USD