8  Power Analysis

A priori power analysis is used to plan and design studies to incorporate the correct sample size to be able to detect a pre-determined magnitude of difference that is of interest to us as biologists. Technically, the change we want to measure reliably through experiments is commonly defined as the effect size (e.g. Cohen’s d, which is a \(signal/noise\) ratio of difference between means and weighted average SD). A related parameter for ordinary ANOVAs is partial eta squared (\(\eta^2_p\)), or the proportional variance for a factor of interest after accounting for other factors. Examples below show ways of calculating sample sizes and observed power of a small experiment, first using formulae e.g. base R, pwr [1], effectsize [2] and parameters [3] packages, and then by simulating data thousands of times based on a small initial experiment. Install effectsize, parameters, and grafify from GitHub. Observed power is then used to determine sample sizes for a future study.

8.1 Power based on effect sizes

8.1.1 Power of t tests

Let’s assume that the t test in Chapter 4 was a small initial experiment used to design a larger study. If this is the difference we want to measure in a future larger experiment, what is the sample size we should use to achieve 80% power? Let’s try the pwr package starting with the two group means 25.60446 & 29.60889, and the difference between them -4.00443. We will need Cohen’s d, which the package effectsize can find for us (there is also a base function power.t.test which is similar but needs the difference between means and the pooled SD – these are used to calculate Cohen’s d).

library(pwr) 
library(effectsize)
library(parameters)

effectsize::cohens_d(Bodyweight ~ Diet, #same formula as the t-test
                     data = Micebw)     #data table
Cohen's d |         95% CI
--------------------------
-1.21     | [-2.16, -0.24]

- Estimated using pooled SD.

The effect size is -1.21, which we will key in along with other parameters into the pwr.t.test function.

pwr.t.test(d = -1.21,  #Cohen's d
           sig.level = 0.05,  #desired alpha cut-off
           power = 0.8,       #desired power 
           type = "two.sample",  #type of t test
           alternative = "two.sided")  #type of P values

     Two-sample t test power calculation 

              n = 11.76412
              d = 1.21
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

We will need ~12 mice per group.

But, the effect size is a point estimate with a range, and to be sure we have sufficient power, we should consider the lower end of this range in case the effect size is less than -1.21! We could instead calculate sample size that “protects” against such variation. A recent recommendation is to use the lower end of 60% confidence interval to set what is called “safeguard power” [4].

#get 60% CI range for the effect size
effectsize::cohens_d(Bodyweight ~ Diet, #formula as the t-test
                     data = Micebw,
                     ci = 0.6) #get 60% confidence interval
Cohen's d |         60% CI
--------------------------
-1.21     | [-1.60, -0.78]

- Estimated using pooled SD.
#use the lower range to calculate sample size
pwr.t.test(d = -0.78,  #lower end of Cohen's d 60% CI
           sig.level = 0.05,
           power = 0.8, 
           type = "two.sample",
           alternative = "two.sided") 

     Two-sample t test power calculation 

              n = 26.79675
              d = 0.78
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

We will need ~27 mice per group at safeguard power at 80%! Phew!

It may not always be feasible to do an experiment this large. Randomised block design (e.g., fewer mice in independent repeats rather than one experiment with 10 mice per group) could reduce sample size and still detect the desired effect size.

8.1.2 Power of paired t tests

For paired data, we need to find the number of pairs necessary for 80% power. For the t tests in Chapter 4, we first calculate Cohen’s d using effectsize, including the argument paired = T, just like in the t test. This is demonstrated below to get the 60% confidence interval range for the effect size and then get sample size at 80% safeguard power.

effectsize::cohens_d(log(IL6) ~ Genotype, 
                     data = Cytokine, #data table
                     ci = 0.6)
Cohen's d |         60% CI
--------------------------
-0.56     | [-0.76, -0.34]

- Estimated using pooled SD.
pwr.t.test(d = 1.24,
           power = 0.8,
           sig.level = 0.05,
           type = "paired")

     Paired t test power calculation 

              n = 7.250775
              d = 1.24
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number of *pairs*

We will need ~8 pairs to detect this effect size at 80% safeguard power. Note that this dataset had 33 pairs, and our power analysis suggests fewer pairs could have been used if we were limited by time/resources.

8.1.3 Power of ANOVAs

For ordinary ANOVAs, the base function power.anova.test or pwr.anova.test from the pwr package can be used by filling in the required parameters. For mixed models, we use simulations.

8.2 Power by simulation

8.2.1 Simulating t tests

We start with two independent groups, which are easier to simulate than multiple groups in one-way ANOVA or two-way ANOVAs with random effects (see below). In the Micebw dataset, the two groups had means of 25.6 and 29.6, with SD of 3.44 and 3.18 (average SD = 3.31), and both groups had n = 10 mice each. We can simulate independent group data using the base R rnorm function (which draws random variates from the normal distribution of specified mean and SD), and then calculate observed power of the same t test we tested above using formulae. Below, random rnorm draws are wrapped within the t.test and replicate calls to generate multiple independent repeats of P values.

#make_1way_data generates response variable in column Values
#independent variables are in column FixFac_1

replicate(n = 20,        #20 simulations
          t.test(x = rnorm(n = 10,       #n in group 1
                           mean = 25.6,  #mean 1
                           sd = 3.44),   #sd 1
                 y = rnorm(n = 10,       #n in group 2
                           mean = 29.6,  #mean 2
                           sd = 3.18)    #sd 2
          )$p.value)             #get P values           
 [1] 0.0442516494 0.0057505891 0.0032096972 0.0044031555 0.0214233486
 [6] 0.1841017261 0.0011047518 0.0256390468 0.0028990793 0.0001051730
[11] 0.0100479428 0.0344615803 0.1214839301 0.0230059582 0.0112841499
[16] 0.0042022864 0.1757594740 0.0032513279 0.0000854722 0.0099559794

You can see the variability in the 20 independently simulated P values – this is what we can expect in independent “studies”. Let’s repeat this with n = 12 and n = 27 mice per group as above (in practice we will need to test a range of n values to achieve 80% power).

pt.test_12 <-
  replicate(n = 1000,        #1000 simulations
            t.test(x = rnorm(n = 12,       #n in group 1
                             mean = 25.6,  #mean 1
                             sd = 3.44),   #sd 1
                   y = rnorm(n = 12,       #n in group 2
                             mean = 29.6,  #mean 2
                             sd = 3.18)    #sd 2
            )$p.value)             #get P values           
pt.test_27 <-
  replicate(n = 1000,        #1000 simulations
            t.test(x = rnorm(n = 27,       #n in group 1
                             mean = 25.6,  #mean 1
                             sd = 3.44),   #sd 1
                   y = rnorm(n = 27,       #n in group 2
                             mean = 29.6,  #mean 2
                             sd = 3.18)    #sd 2
            )$p.value)             #get P values           

length(pt.test_12[pt.test_12 <= 0.05])*100/length(pt.test_12)
[1] 80.8
length(pt.test_27[pt.test_27 <= 0.05])*100/length(pt.test_12)
[1] 99.5

As you can see, power increases with larger sample size and roughly agrees with results above with formulae. The power is roughly 80% with 12 mice and 99% with 27 mice (which was suggested by safeguard power). This experiment could be re-designed as randomised blocks to use fewer mice.

For paired t tests, we could simulate ratios or differences and their SDs, or use randomised block designs as below.

8.2.2 Simulating ANOVAs as mixed effects models

Getting effect sizes for mixed models is not reliable, therefore power analysis is better done by simulating data. Simulations approaches are fully flexible if we have various SDs that make up the linear model.

The idea is simple, simulate data based on the linear model (which is described by \(y = mx +c + \epsilon\), where \(\epsilon\) is residual SD of a normal distribution of mean = 0), re-fit linear model to simulated data and the same fixed and random factors, and get P values. Do this thousands of times and measure the proportion that has P \(\le\) 0.05. This is the observed power based on parameters (e.g. means and SDs) of a small study or preliminary data. Simulations with a range of sample sizes will tell us the sample size that generates P \(\le\) 0.05 in 80% of simulations (1 – \(\beta\) = 0.8). These repetitive tasks can be automated in R by creating functions as shown below.

The overall steps with the example of the two-way randomised block data from Chapter 6 (1 mouse each in 2 experimental blocks, with 4 Strains, 2 Treatments, and GST as the response variable) are listed below:

  1. Fit measured preliminary data to a mixed model (we did this in Chapter 6)
  2. Use simulate to generate new data based on the linear model (this is an independent repeat of the study).
  3. Re-fit or update the linear model using simulated data, and get a P value.
  4. Repeat steps 2 & 3 thousands of times using replicate, and count the proportion of P values that are \(\le\) 0.05 (the conventional \(\alpha\) cut-off). This is the observed power at the sample size of study used at step 1.
  5. If power is less than 80%, repeat steps 2-4 by simulating data for a range of larger sample size using the same model parameters (e.g. means, residual SDs) as in step 1.
  6. Plan the main study with a sample size at least as large as that with P \(\le\) 0.05 in 80% of simulations.

See code below for above steps using the Micelmer linear mixed effects model we fit in Chapter 6. These are random simulations and will change every time the simulate function is run. You will get different results if you copy the code and run it! In lm and lmer models, the original data can be accessed by using $model or @frame on the model object, respectively. The update function will call the same formula used to fit the original model but will fit it to new (simulated) data.

#simulate data from model
simulate(Micelmer) #data frame of new responses for "GST"
      sim_1
1  651.5060
2  698.9782
3  445.8033
4  753.0420
5  463.9971
6  953.4387
7  518.9556
8  919.9650
9  555.7366
10 650.9038
11 570.9212
12 683.4539
13 451.2191
14 879.0971
15 551.7927
16 655.6375
#save simulated data to a new table with original data
SimMiceData <- cbind(Micelmer@frame,     #original data
                     simulate(Micelmer)) #simulated data

#fit the model again to sim_1 instead of GST
MicelmerUpdate <- update(Micelmer,         #update the model
                         sim_1 ~ .,        #new response variable and same RHS
                         data = SimMiceData) #new data

library(car, quietly = T)     #car package for Anova call
car::Anova(MicelmerUpdate, 
           test.statistic = "F")
Analysis of Deviance Table (Type II Wald F tests with Kenward-Roger df)

Response: sim_1
                        F Df Df.res   Pr(>F)    
Strain             2.8182  3      7 0.116969    
Treatment        134.8624  1      7 7.92e-06 ***
Strain:Treatment  12.0427  3      7 0.003748 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We’ve just simulated one independent repeat of the preliminary experiment! Note that these numbers are different from those with original GST data in Chapter 6 (and will be different for you if you copied the code because these are random draws based on the linear model).

Next, we need to do this thousands of times and count how may P values are \(\le\) 0.05! I created a function to do this, which as shown below starts with a linear model and first checks whether it is an ordinary (lm) or mixed effects model (lmer). This flexibility lets us run the power analysis on ordinary linear models fit using lm, but below we only run it on lmer models. Lines dopower onwards perform the simulation once, and replicate onwards do it as many number of times as we ask it to in the Nsim argument (default is 20 to save time, but should be 1000 or more). The code below performs an ANOVA (generates the F statistic and its P value based on Dfs in data) for each factor in the model.

#function to automate simulation of ordinary and mixed linear models
power_linmodels <- function(model = mix_mod, Nsim = 20, alpha = 0.05, Factor_row = 1){          #create a function
  if(class(model) %in% c("lmerTest",         #check if model is a mixed model                                        
                         "lme4", 
                         "lmerModLmerTest", 
                         "lmerMod")){
    datacols <- model@frame}               #extract original data
  if(class(model) %in% c("lm")){           #check if model is linear model
    datacols <- model$model}               #extract original data
  if(Factor_row > length(labels(terms(model)))){
    stop("Error: Factor row is greater than terms in the model.") #stop if Factor_row input is wrong
  }
  dopower <- function(x) {               #internal function for simulation
    sim_1 <- simulate(model)             #simulate data
    datacols2 <- cbind(datacols, sim_1)  #make new data frame
    modupdate <- update(model,     #update model
                        sim_1 ~ ., #with new formula
                        data = datacols2) #pass on new data to be fit
    pval <- car::Anova(modupdate,  #run ANOVA updated new model
                       test.statistic = "F")[[4]][[Factor_row]] #extract P values for required factor
    pval #result of 1 simulation
  }
  pvals <- replicate(n = Nsim, dopower(x)) #replicate Nsim times (usually ~1000)
  pvals #result of simulations
  length(pvals[pvals <= alpha])*100/length(pvals) #percent <= alpha cut-off is power
}

Now that we have created the function power_linmodels, we can run it on the Micelmer object for both factors.

suppressWarnings(power_linmodels(Micelmer,    #run the function
                Nsim = 20,  #saving time, use 1000 at least
                Factor_row = 1)) #factor on row 1 of ANOVA table (Strain)
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
[1] 45
#repeat for row 2 for Treatment
suppressWarnings(power_linmodels(Micelmer,    #run the function
                Nsim = 20,  #saving time, use 1000 at least
                Factor_row = 2)) #factor on row 2 of ANOVA table (Treatment)
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
[1] 100

The power for Strain is quite low, but it is high for Treatment. We need more samples (larger sample size) to reliably detect an effect Strain. How many? This data set had 2 blocks with one mouse per block. We need to simulate additional blocks and calculate power for a range, say 3-5 blocks. The code for doing this is below.

To add more blocks, we will have to generate new data based on the means & SDs we already have from our experiment. In the grafify package, I included functions to simulate data with ordinary or randomised block one-way or two-way designs. These functions use \(y = mx + c + \epsilon\) to generate random draws based on the SDs of random factor and residuals from the Gaussian/normal distribution. To use this function (or simulate data using your own similar functions), we need to find the residual and experimental error from the model: the Block or experimental SD = 123.14 and residual SD is 54.38. Means for Strain and Treatment are calculated using dplyr on the data and are entered below. I simulate data with 3, 4 and 5 randomised blocks.

library(grafify)
#Group1 is level 1 of Treatment (i.e.Control) & Group2 is Treated
#Four means within Group1 are four Strain means for Control
#Four means within Group2 are four Strain means for Treated

#add one more block, 3 blocks
mice_rb_3 <-
  make_2way_rb_data(Group1_means = c(526, #Control means for each strain
                                     508,
                                     504,
                                     604), 
                    Group2_means = c(742, #Treated means for each strain
                                     929,
                                     704,
                                     722),
                    Num_exp = 3,     #3 randomised blocks
                    Exp_SD = 123.14, #block SD
                    Residual_SD = 54.38) #residual SD

head(mice_rb_3, n = 16) #see the first 16 rows
   RandFac FixFac_1 FixFac_2    Values
1    Exp_1 Fx1Lev_1 Fx2Lev_1  750.9885
2    Exp_1 Fx1Lev_1 Fx2Lev_2  669.8272
3    Exp_1 Fx1Lev_1 Fx2Lev_3  744.4586
4    Exp_1 Fx1Lev_1 Fx2Lev_4  806.4465
5    Exp_1 Fx1Lev_2 Fx2Lev_1  887.1313
6    Exp_1 Fx1Lev_2 Fx2Lev_2 1127.1146
7    Exp_1 Fx1Lev_2 Fx2Lev_3  925.0431
8    Exp_1 Fx1Lev_2 Fx2Lev_4  914.4053
9    Exp_2 Fx1Lev_1 Fx2Lev_1  694.6045
10   Exp_2 Fx1Lev_1 Fx2Lev_2  567.9822
11   Exp_2 Fx1Lev_1 Fx2Lev_3  633.1064
12   Exp_2 Fx1Lev_1 Fx2Lev_4  744.1263
13   Exp_2 Fx1Lev_2 Fx2Lev_1  861.9655
14   Exp_2 Fx1Lev_2 Fx2Lev_2 1005.6116
15   Exp_2 Fx1Lev_2 Fx2Lev_3  825.8770
16   Exp_2 Fx1Lev_2 Fx2Lev_4  859.1151
#the function outputs factor names as FixFac_1 (Treated) and FixFac_2 (Strain)
#response variable is "Values"
#note formula for lmer uses these variable names

mice_rb_3lmer <- lmer(Values ~ FixFac_1 * FixFac_2 +
                        (1|RandFac), 
                      data = mice_rb_3)
#four blocks
mice_rb_4 <-
  make_2way_rb_data(Group1_means = c(526, #Control means for each strain
                                     508,
                                     504,
                                     604), 
                    Group2_means = c(742, #Treatment means
                                     929,
                                     704,
                                     722),
                    Num_exp = 4,     #4 randomised blocks
                    Exp_SD = 123.14, #block SD
                    Residual_SD = 54.38) #residual SD
mice_rb_4lmer <- lmer(Values ~ FixFac_1 * FixFac_2 +
                        (1|RandFac), 
                      data = mice_rb_4)

#five blocks
mice_rb_5 <-
  make_2way_rb_data(Group1_means = c(526, #Control means for each strain (FixFax_2)
                                     508,
                                     504,
                                     604), 
                    Group2_means = c(742, #Treatment means
                                     929,
                                     704,
                                     722),
                    Num_exp = 5,     #5 randomised blocks
                    Exp_SD = 123.14, #block SD
                    Residual_SD = 54.38) #residual SD

mice_rb_5lmer <- lmer(Values ~ FixFac_1 * FixFac_2 +
                        (1|RandFac), 
                      data = mice_rb_5)

#power of each model (running fewer simulations to save time here)

Now that we’ve created three models with increasing sample size, we can run power_linmomdels on these three models to find whether which size achieves ~80% power.

suppressWarnings(power_linmodels(mice_rb_3lmer, 
                Nsim = 100, Factor_row = 2)) #Strain is FixFac_2
[1] 100
suppressWarnings(power_linmodels(mice_rb_4lmer, 
                Nsim = 100, Factor_row = 2))
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
boundary (singular) fit: see help('isSingular')
[1] 99
suppressWarnings(power_linmodels(mice_rb_5lmer, 
                Nsim = 100, Factor_row = 2))
[1] 100

Results are more reliable with 1000s of simulations, which can take about 2-5 min for these models (longer for more complex designs). Given the relatively small effect of Strain, we will likely need many Blocks (experiments) to detect this tiny difference. If we were more interested in the Treatment factor, we can address that with fewer mice because of the larger effect of Treatment on the response GST.

Simulating data structure based on initial experiments is thus quite flexible for assessing power and determining sample sizes.

Head over to grafify to see the make_...data, and other functions that make it easier to run linear models, post-hoc comparisons and plot graphs.