Chapter 27 Simulating correlated variables
library(pwr)
library(tidyverse)
Experimental designs involving paired (or related/repeated) measures are executed when two or more groups of measurements are expected to be intrinsically-linked.
Take for example, a before and after design. A measure is taken before the imposition of some level of a predictor variable. Then the measure is taken afterwards. The difference between those two measures is the size of the effect.
Those two measures are intrinsically-linked because they arise from a common subject. Subjects are tuned differently. You can imagine a subject who displays a low basal level of the measure will generate a low-end response after some inducer, whereas one with a high basal level will generate a high-end response.
Statistically, these intrinsically-linked measurements within such designs are said to be correlated.
Monte Carlo simulations of experimental power afford the opportunity to account for the level of correlation within a variable. Building in an expectation for correlation can dramatically impact the expected power, and thus the sample size to plan for.
27.1 Estimating correlation between two variables
How to estimate correlation?
Inevitably you’ll run an experiment where the actual values of the dependent variables, at first blush, differ wildly from replicate to replicate.
But on closer inspection, a more consistent pattern emerges. For example, an inducer seems to always elicits close to a 2-fold response relative to a control, and this response is consistently inhibited by about a half by a suppressor. That consistency in the fold-response, irrespective of the absolute values of the variable, is the mark of high correlation!
Here are some data to illustrate this problem. Four independent replicates of the same experiment that measures NFAT-driven luciferase reporter gene output, on each of 4 different passages of a cultured cell line. The data have several other treatment levels, but those corresponding to vehicle
and drug
represent negative and positive responses, respectively.
Luciferase reacts with luciferin to produce light. The values here are in arbitrary light units on a continuous scale beginning at zero and linear for up to at least 5 orders of magnitude higher. Thus, values of the variable can be assumed to be normally-distributed.
Here’s the experimental data:id | vehicle | drug |
---|---|---|
P11 | 20.2 | 38.3 |
P12 | 5.7 | 9.1 |
P13 | 2.1 | 3.6 |
P14 | 9.9 | 15.5 |
The data show that the luciferase values in response to vehicle wanders substantially across passages over a 10-fold range. Yet the drug response as a ratio to the vehicle is more consistent from passage to passage.
In fact, the two variables, vehicle and drug, are actually very highly correlated:
cor(df$vehicle, df$drug)
## [1] 0.9955158
ggplot(df, aes(vehicle, drug))+
geom_point(size=4, color="#012169")
This example points to how you can derive an estimate for the correlation coefficient between two variables. Simply plot out their replicates as \(XY\) pairs and calculate their correlation coefficient using R’s cor
function.
Where do you find values for these variables? They can come from pilot or from published data.
27.3 Monte Carlo simulation
Here’s a Monte Carlo simulation of a paired t-test between an A and a B group. The “true” effect size programmed to be very modest. The code also factors in a fairly strong correlation between the two measures of the variable.
#Initial Parameters
# sample A intial true parameters
<- 3
nA <- 1.5
meanA <- 0.5
sdA
# sample B intial true parameters
<- 3
nB <- 2.0
meanB <- 0.5
sdB
<- 0.05
alpha <- 10000 #number of simulated experiments
nSims <-numeric(nSims) #set up empty container for all simulated p-values
p
# correlation coefficient
<- 0.8
r
# the monte carlo function
for(i in 1:nSims){ #for each simulated experiment
<-rnorm(n = nA, mean = meanA, sd = sdA) #produce n simulated participants
x#with mean and SD
<-rnorm(n = nB, mean = meanB, sd = sdB) #produce n simulated participants
y#with mean and SD
#correlated
<- r*x+sqrt(1-r^2)*y
w <-t.test(x,w, paired=T) #perform the t-test
z<-z$p.value #get the p-value and store it
p[i]
}
# the output
<- length(which(p < alpha));hits hits
## [1] 7027
<- hits/nSims;power power
## [1] 0.7027
#now plot the histogram
#main="Histogram of p-values under the null",
hist(p, col = "blue", ylim = c(0, 10000), xlim = c(0.0, 1.0), main ="Histogram of simulated p-values", xlab=("Observed p-value"))
The result as written above is a bit underpowered, but not too shabby.
Now run the code by dialing down the correlation to an r = 0. How much more underpowered is the planned experiment? Factoring in the correlation between variables makes a huge difference.