A prioripower analysis is used to plan and design studies to incorporate the correct sample size to be able to detect a pre-determined magnitude of difference that is of interest to us as biologists. Technically, the change we want to measure reliably through experiments is commonly defined as the effect size (e.g. Cohen’s d, which is a \(signal/noise\) ratio of difference between means and weighted average SD). A related parameter for ordinary ANOVAs is partial eta squared (\(\eta^2_p\)), or the proportional variance for a factor of interest after accounting for other factors. Examples below show ways of calculating sample sizes and observed power of a small experiment, first using formulae e.g. base R, pwr[1], effectsize[2] and parameters[3] packages, and then by simulating data thousands of times based on a small initial experiment. Install effectsize, parameters, and grafify from GitHub. Observed power is then used to determine sample sizes for a future study.
8.1 Power based on effect sizes
8.1.1 Power of t tests
Let’s assume that the t test in Chapter 4 was a small initial experiment used to design a larger study. If this is the difference we want to measure in a future larger experiment, what is the sample size we should use to achieve 80% power? Let’s try the pwr package starting with the two group means 25.60446 & 29.60889, and the difference between them -4.00443. We will need Cohen’s d, which the package effectsize can find for us (there is also a base function power.t.test which is similar but needs the difference between means and the pooled SD – these are used to calculate Cohen’s d).
library(pwr) library(effectsize)library(parameters)effectsize::cohens_d(Bodyweight ~ Diet, #same formula as the t-testdata = Micebw) #data table
Cohen's d | 95% CI
--------------------------
-1.21 | [-2.16, -0.24]
- Estimated using pooled SD.
The effect size is -1.21, which we will key in along with other parameters into the pwr.t.test function.
pwr.t.test(d =-1.21, #Cohen's dsig.level =0.05, #desired alpha cut-offpower =0.8, #desired power type ="two.sample", #type of t testalternative ="two.sided") #type of P values
Two-sample t test power calculation
n = 11.76412
d = 1.21
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
We will need ~12 mice per group.
But, the effect size is a point estimatewith a range, and to be sure we have sufficient power, we should consider the lower end of this range in case the effect size is less than -1.21! We could instead calculate sample size that “protects” against such variation. A recent recommendation is to use the lower end of 60% confidence interval to set what is called “safeguard power”[4].
#get 60% CI range for the effect sizeeffectsize::cohens_d(Bodyweight ~ Diet, #formula as the t-testdata = Micebw,ci =0.6) #get 60% confidence interval
Cohen's d | 60% CI
--------------------------
-1.21 | [-1.60, -0.78]
- Estimated using pooled SD.
#use the lower range to calculate sample sizepwr.t.test(d =-0.78, #lower end of Cohen's d 60% CIsig.level =0.05,power =0.8, type ="two.sample",alternative ="two.sided")
Two-sample t test power calculation
n = 26.79675
d = 0.78
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
We will need ~27 mice per group at safeguard power at 80%! Phew!
It may not always be feasible to do an experiment this large. Randomised block design (e.g., fewer mice in independent repeats rather than one experiment with 10 mice per group) could reduce sample size and still detect the desired effect size.
8.1.2 Power of paired t tests
For paired data, we need to find the number of pairs necessary for 80% power. For the t tests in Chapter 4, we first calculate Cohen’s d using effectsize, including the argument paired = T, just like in the t test. This is demonstrated below to get the 60% confidence interval range for the effect size and then get sample size at 80% safeguard power.
effectsize::cohens_d(log(IL6) ~ Genotype, data = Cytokine, #data tableci =0.6)
Cohen's d | 60% CI
--------------------------
-0.56 | [-0.76, -0.34]
- Estimated using pooled SD.
Paired t test power calculation
n = 7.250775
d = 1.24
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number of *pairs*
We will need ~8 pairs to detect this effect size at 80% safeguard power. Note that this dataset had 33 pairs, and our power analysis suggests fewer pairs could have been used if we were limited by time/resources.
8.1.3 Power of ANOVAs
For ordinary ANOVAs, the base function power.anova.test or pwr.anova.test from the pwr package can be used by filling in the required parameters. For mixed models, we use simulations.
8.2 Power by simulation
8.2.1 Simulating t tests
We start with two independent groups, which are easier to simulate than multiple groups in one-way ANOVA or two-way ANOVAs with random effects (see below). In the Micebw dataset, the two groups had means of 25.6 and 29.6, with SD of 3.44 and 3.18 (average SD = 3.31), and both groups had n = 10 mice each. We can simulate independent group data using the base R rnorm function (which draws random variates from the normal distribution of specified mean and SD), and then calculate observed power of the same t test we tested above using formulae. Below, random rnorm draws are wrapped within the t.test and replicate calls to generate multiple independent repeats of P values.
#make_1way_data generates response variable in column Values#independent variables are in column FixFac_1replicate(n =20, #20 simulationst.test(x =rnorm(n =10, #n in group 1mean =25.6, #mean 1sd =3.44), #sd 1y =rnorm(n =10, #n in group 2mean =29.6, #mean 2sd =3.18) #sd 2 )$p.value) #get P values
You can see the variability in the 20 independently simulated P values – this is what we can expect in independent “studies”. Let’s repeat this with n = 12 and n = 27 mice per group as above (in practice we will need to test a range of n values to achieve 80% power).
pt.test_12 <-replicate(n =1000, #1000 simulationst.test(x =rnorm(n =12, #n in group 1mean =25.6, #mean 1sd =3.44), #sd 1y =rnorm(n =12, #n in group 2mean =29.6, #mean 2sd =3.18) #sd 2 )$p.value) #get P values pt.test_27 <-replicate(n =1000, #1000 simulationst.test(x =rnorm(n =27, #n in group 1mean =25.6, #mean 1sd =3.44), #sd 1y =rnorm(n =27, #n in group 2mean =29.6, #mean 2sd =3.18) #sd 2 )$p.value) #get P values length(pt.test_12[pt.test_12 <=0.05])*100/length(pt.test_12)
As you can see, power increases with larger sample size and roughly agrees with results above with formulae. The power is roughly 80% with 12 mice and 99% with 27 mice (which was suggested by safeguard power). This experiment could be re-designed as randomised blocks to use fewer mice.
For paired t tests, we could simulate ratios or differences and their SDs, or use randomised block designs as below.
8.2.2 Simulating ANOVAs as mixed effects models
Getting effect sizes for mixed models is not reliable, therefore power analysis is better done by simulating data. Simulations approaches are fully flexible if we have various SDs that make up the linear model.
The idea is simple, simulate data based on the linear model (which is described by \(y = mx +c + \epsilon\), where \(\epsilon\) is residual SD of a normal distribution of mean = 0), re-fit linear model to simulated data and the same fixed and random factors, and get P values. Do this thousands of times and measure the proportion that has P\(\le\) 0.05. This is the observed power based on parameters (e.g. means and SDs) of a small study or preliminary data. Simulations with a range of sample sizes will tell us the sample size that generates P\(\le\) 0.05 in 80% of simulations (1 – \(\beta\) = 0.8). These repetitive tasks can be automated in R by creating functions as shown below.
The overall steps with the example of the two-way randomised block data from Chapter 6 (1 mouse each in 2 experimental blocks, with 4 Strains, 2 Treatments, and GST as the response variable) are listed below:
Fit measured preliminary data to a mixed model (we did this in Chapter 6)
Use simulate to generate new data based on the linear model (this is an independent repeat of the study).
Re-fit or update the linear model using simulated data, and get a P value.
Repeat steps 2 & 3 thousands of times using replicate, and count the proportion of P values that are \(\le\) 0.05 (the conventional \(\alpha\) cut-off). This is the observed power at the sample size of study used at step 1.
If power is less than 80%, repeat steps 2-4 by simulating data for a range of larger sample size using the same model parameters (e.g. means, residual SDs) as in step 1.
Plan the main study with a sample size at least as large as that with P\(\le\) 0.05 in 80% of simulations.
See code below for above steps using the Micelmer linear mixed effects model we fit in Chapter 6. These are random simulations and will change every time the simulate function is run. You will get different results if you copy the code and run it! In lm and lmer models, the original data can be accessed by using $model or @frame on the model object, respectively. The update function will call the same formula used to fit the original model but will fit it to new (simulated) data.
#simulate data from modelsimulate(Micelmer) #data frame of new responses for "GST"
#save simulated data to a new table with original dataSimMiceData <-cbind(Micelmer@frame, #original datasimulate(Micelmer)) #simulated data#fit the model again to sim_1 instead of GSTMicelmerUpdate <-update(Micelmer, #update the model sim_1 ~ ., #new response variable and same RHSdata = SimMiceData) #new datalibrary(car, quietly = T) #car package for Anova callcar::Anova(MicelmerUpdate, test.statistic ="F")
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
Warning in printHypothesis(L, rhs, names(b)): one or more coefficients in the hypothesis include
arithmetic operators in their names;
the printed representation of the hypothesis will be omitted
We’ve just simulated one independent repeat of the preliminary experiment! Note that these numbers are different from those with original GST data in Chapter 6 (and will be different for you if you copied the code because these are random draws based on the linear model).
Next, we need to do this thousands of times and count how may P values are \(\le\) 0.05! I created a function to do this, which as shown below starts with a linear model and first checks whether it is an ordinary (lm) or mixed effects model (lmer). This flexibility lets us run the power analysis on ordinary linear models fit using lm, but below we only run it on lmer models. Lines dopower onwards perform the simulation once, and replicate onwards do it as many number of times as we ask it to in the Nsim argument (default is 20 to save time, but should be 1000 or more). The code below performs an ANOVA (generates the F statistic and its P value based on Dfs in data) for each factor in the model.
#function to automate simulation of ordinary and mixed linear modelspower_linmodels <-function(model = mix_mod, Nsim =20, alpha =0.05, Factor_row =1){ #create a functionif(class(model) %in%c("lmerTest", #check if model is a mixed model "lme4", "lmerModLmerTest", "lmerMod")){ datacols <- model@frame} #extract original dataif(class(model) %in%c("lm")){ #check if model is linear model datacols <- model$model} #extract original dataif(Factor_row >length(labels(terms(model)))){stop("Error: Factor row is greater than terms in the model.") #stop if Factor_row input is wrong } dopower <-function(x) { #internal function for simulation sim_1 <-simulate(model) #simulate data datacols2 <-cbind(datacols, sim_1) #make new data frame modupdate <-update(model, #update model sim_1 ~ ., #with new formuladata = datacols2) #pass on new data to be fit pval <- car::Anova(modupdate, #run ANOVA updated new modeltest.statistic ="F")[[4]][[Factor_row]] #extract P values for required factor pval #result of 1 simulation } pvals <-replicate(n = Nsim, dopower(x)) #replicate Nsim times (usually ~1000) pvals #result of simulationslength(pvals[pvals <= alpha])*100/length(pvals) #percent <= alpha cut-off is power}
Now that we have created the function power_linmodels, we can run it on the Micelmer object for both factors.
suppressWarnings(power_linmodels(Micelmer, #run the functionNsim =20, #saving time, use 1000 at leastFactor_row =1)) #factor on row 1 of ANOVA table (Strain)
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
[1] 45
#repeat for row 2 for TreatmentsuppressWarnings(power_linmodels(Micelmer, #run the functionNsim =20, #saving time, use 1000 at leastFactor_row =2)) #factor on row 2 of ANOVA table (Treatment)
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
[1] 100
The power for Strain is quite low, but it is high for Treatment. We need more samples (larger sample size) to reliably detect an effect Strain. How many? This data set had 2 blocks with one mouse per block. We need to simulate additional blocks and calculate power for a range, say 3-5 blocks. The code for doing this is below.
To add more blocks, we will have to generate new data based on the means & SDs we already have from our experiment. In the grafify package, I included functions to simulate data with ordinary or randomised block one-way or two-way designs. These functions use \(y = mx + c + \epsilon\) to generate random draws based on the SDs of random factor and residuals from the Gaussian/normal distribution. To use this function (or simulate data using your own similar functions), we need to find the residual and experimental error from the model: the Block or experimental SD = 123.14 and residual SD is 54.38. Means for Strain and Treatment are calculated using dplyr on the data and are entered below. I simulate data with 3, 4 and 5 randomised blocks.
library(grafify)#Group1 is level 1 of Treatment (i.e.Control) & Group2 is Treated#Four means within Group1 are four Strain means for Control#Four means within Group2 are four Strain means for Treated#add one more block, 3 blocksmice_rb_3 <-make_2way_rb_data(Group1_means =c(526, #Control means for each strain508,504,604), Group2_means =c(742, #Treated means for each strain929,704,722),Num_exp =3, #3 randomised blocksExp_SD =123.14, #block SDResidual_SD =54.38) #residual SDhead(mice_rb_3, n =16) #see the first 16 rows
#the function outputs factor names as FixFac_1 (Treated) and FixFac_2 (Strain)#response variable is "Values"#note formula for lmer uses these variable namesmice_rb_3lmer <-lmer(Values ~ FixFac_1 * FixFac_2 + (1|RandFac), data = mice_rb_3)#four blocksmice_rb_4 <-make_2way_rb_data(Group1_means =c(526, #Control means for each strain508,504,604), Group2_means =c(742, #Treatment means929,704,722),Num_exp =4, #4 randomised blocksExp_SD =123.14, #block SDResidual_SD =54.38) #residual SDmice_rb_4lmer <-lmer(Values ~ FixFac_1 * FixFac_2 + (1|RandFac), data = mice_rb_4)#five blocksmice_rb_5 <-make_2way_rb_data(Group1_means =c(526, #Control means for each strain (FixFax_2)508,504,604), Group2_means =c(742, #Treatment means929,704,722),Num_exp =5, #5 randomised blocksExp_SD =123.14, #block SDResidual_SD =54.38) #residual SDmice_rb_5lmer <-lmer(Values ~ FixFac_1 * FixFac_2 + (1|RandFac), data = mice_rb_5)#power of each model (running fewer simulations to save time here)
Now that we’ve created three models with increasing sample size, we can run power_linmomdels on these three models to find whether which size achieves ~80% power.
suppressWarnings(power_linmodels(mice_rb_3lmer, Nsim =100, Factor_row =2)) #Strain is FixFac_2
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
Results are more reliable with 1000s of simulations, which can take about 2-5 min for these models (longer for more complex designs). Given the relatively small effect of Strain, we will likely need many Blocks (experiments) to detect this tiny difference. If we were more interested in the Treatment factor, we can address that with fewer mice because of the larger effect of Treatment on the response GST.
Simulating data structure based on initial experiments is thus quite flexible for assessing power and determining sample sizes.
M.S. Ben-Shachar, D. Lüdecke, D. Makowski, effectsize: Estimation of effect size indices and standardized parameters, Journal of Open Source Software 5 (2020) 2815. https://doi.org/10.21105/joss.02815.
[3]
D. Lüdecke, M.S. Ben-Shachar, I. Patil, D. Makowski, Parameters: Extracting, computing and exploring the parameters of statistical models using R., Journal of Open Source Software 5 (2020) 2445. https://doi.org/10.21105/joss.02445.
[4]
M. Perugini, M. Gallucci, G. Costantini, Safeguard power as a protection against imprecise power estimates, Perspectives on Psychological Science 9 (2014) 319–332. https://doi.org/10.1177/1745691614528519.