Chapter 7 Statistical Sampling

A statistically valid sample is comprised of independent replicates of the experimental unit, which are generated using some random process.

To unpack this let’s think about each of the following terms:

  • What are experimental units?
  • What do we mean by independent replicates?
  • What is a random process?
  • When is statistical validity even important?

7.1 Experimental units

The experimental unit is the source of the measurement. An experimental unit can generate one or many measurement values.

I prefer the concept of an experimental unit to the concept of subject, though they often mean the same thing. I’ve found the word subject carries more ambiguity, especially for people first learning sampling and sample size concepts.

In some experimental designs (eg, unpaired or completely randomized) each experimental unit generates a single measurement value. Here a one-to-one correspondence exists between the number of experimental units and the number of measurement values within a data set.

In other designs (eg, paired or matched or repeated/related measure), a single experimental unit can generate more than one measurement values. Such data sets have more measurement values than experimental units.

Here are some guidelines for deciding what is the experimental unit in an experiment, with full recognition that sometimes there are gray areas. Ultimately the researcher has to use scientific judgment to recognize or define the experimental unit.

7.1.1 A simple test to define the experimental unit

When defining an experimental unit I recommend using a simple test:

Are these measurements intrinsically-linked?

If two or more measurement values are intrinsically-linked then they would comprise paired or matched or related measures from a single experimental unit.

So how could you judge whether two or more measurements are intrinsically-linked? For the most part, this happens when the source of those measurements doesn’t differ.

Here are a few examples:

A before and after design. A mouse is scored on how well it performs a behavioral task at baseline. After that same mouse receives a treatment. Then it is run through the behavioral test once more to get a second score. Those two scores are intrinsically-linked because they were taken from the same mouse. All that differs between the measured scores is the absence or presence of the treatment.

We would also say those two scores are matched, paired or related/repeated measures. A single mouse from which two scores are derived is an independent replicate of the experimental unit.

Twinning. Take for example a study involving human identical twins. In these studies identical twin pairs are modeled as a single experimental unit due to their high level of intrinsic relatedness. They are two distinct human subjects. But for statistical purposes they are modeled as a single pair; as a single experimental unit. The two measurements would be analyzed using a statistical method configured for paired or matched or repeated/related measures.

One of the pair receives a control condition while the other receives a treatment condition. A measurement is taken from each person. There are two measurements in total, and two people, but only a single experimental unit.

The two measurements would be analyzed using a statistical method configured for paired or matched or repeated/related measures.

Unpaired or completely randomized In contrast, imagine a study using the same control and treatment conditions using unrelated humans (or some other outbred animal species) as subjects. Each subject is assigned either a treatment or a control, and only a single measurement is taken from them. Since the subjects are each very different from each other, we could not conclude that measurements taken from them are intrinsically-linked. Each person stands alone as an experimental unit. The data would be analyzed using an unpaired, unmatched or completely randomized test.

Intrinsically-linked measurements are very common in bench work. In fact they are too often overlooked for what they are and mistakenly analyzed as unmatched. Experiments involving batches of biological material, cultured cells and/or littermates of inbred animal strains routinely involve intrinsically-linked measurements. As a general rule, these should always be designed and analyzed using matched/paired/related measures procedures.

Cell cultures Cell cultures are remarkably homogeneous. The typical continuous cell line is a monoculture passaged across many doubling generations.

Imagine a test conducted on a 6 well multi-well cell culture plate. Each well receives a different level of some treatment condition, such as a dosing or time-course study.

All of the wells were laid down at the same time from a common batch of cells. Each well is very highly related to all of the other wells. The intrinsic differences between wells would be relatively minor and mostly due to technical variation. There’s no real inherent biological variation from well-to-well other than that attributable to the level of treatment the well receives.

As a result, all of the measurements taken from a plate of wells are intrinsically-linked to each other. The experimental unit is the plate. They should be designed and analyzed using matched/paired/related measure statistical procedures.

Furthermore, any other plates laid down at the same time from the same source of cells are virtually identical clones of each other. If we were to expose the wells in all of those plates to various treatments followed by taking some measurement, then it is pretty easy to argue that all of those measurements taken on that passage of cells are intrinsically-linked. None of the wells are independent of any of the other wells, irrespective of the plate. Together, all of the plates represent a single experimental unit.

Inbred mice In many regards, the high level of relatedness within inbred mouse strains doesn’t differ from human identical twins, or from cultured cells, for that matter.

A given strain of these animals are inbred to genetic homogeneity across several generations. For all intents and purposes all mice derived from a given strain are immortalized clones of each other. Two mice from the same litter are identical twins. Indeed, two mice from different litters from the same strain are identical twins.

Due to their clonal identity all measurements taken from any of these highly related subjects are intrinsically-linked.

Just as for cell culture, protocols must be contrived to break up the homogeneity. A common approach is to treat the litter as the experimental unit and take measures from littermates as intrinsically-linked.

Split tissue Imagine two slices of an organ (or two drops of blood) taken from a single animal. Although the two slices (or drops of blood) are obviously different from each other, any measurements derived from each are intrinsically-linked. The experimental unit would be the animal from which that biological material is derived.

Batches Finally, imagine a batch of a purified protein or other biochemical material. The batch was isolated from a single source and prepared through a single process. The material in the batch is highly homogeneous, irrespective of whether it is stored away in aliquots. Any measurement taken from that batch are highly related to any other measurement. They are intrinsically-linked. The batch would be the experimental unit.

###Blocking

We have to contrive protocols to break up experimental units that have high inherent homogeneity. The statistical jargon used for this is blocking, such that blocks are essentially grouping factors that are not scientifically interesting.

Going back to culture plates. Let’s say we prepared three plates on Friday. An assay performed on one plate on Monday would represent one experimental unit of intrinsically-linked measures. An assay repeated on Tuesday on a second plate would represent a second experimental unit. Wednesday’s assay on the third plate is also its own experimental unit.

Here the blocking factor is the day of the week. Assuming we created fresh batches of reagents each day, there would be some day-to-day variation that wouldn’t exist if we assayed all three plates at once on a single day. But we’re not particularly interested in that daily variation, either.

More conservatively, cell line passage number can be used as a blocking factor to delineate experimental units. Each passage number would represent an experimental unit and the overall replicated experiment would be said to be blocked on passage number.

Defining the experimental unit and any blocking factors requires scientific judgement. That can be difficult to do when dealing with highly homogeneous material. What should be avoided is creating a design that limits random chance too severely. To measure on Monday all three plates that were laid down on Friday will probably yield tighter results than if they were blocked over the course of the week.

This has to be thought through carefully by the researcher in each and every case. Reasonable people can disagree what whether one approach is superior to some other. Therefore, what is important is to make defensible decisions. To do that, you need to think through this problem carefully. When in doubt, I suggest leaning towards giving random chance a fair shot at explaining the result you’re observing.

For example, you can make the case that measurements from two cell culture plates that were laid down on the same day but are collected on different days are not intrinsically-linked. That’s a harder case to make if they are collected on the same day.

You will almost certainly have to make the case that measurements taken from two mice on different days or if they are from different litters are not intrinsically-linked.

Before going there, we need to chat about what we mean by independent replication.

7.2 Independent Replicates

That we should strive for biological observations that are repeatable seems self evident.

An experiment is comprised of independent replicates of treatment conditions on experimental units. The total number of independent replicates comprises an experiment’s sample size.

A primary goal in designing an experiment is to assess independent replicates that are not biased to the biological response of a more narrowly defined group of experimental units.

A replicate is therefore independent when a repeat is on an experimental unit that differs materially from a previous experimental unit. A material difference could involve a true biological replicate. Measurements taken from two unrelated human subjects have a material difference. In bench biological work with fairly homogeneous systems (eg, cell lines and inbred animals) a material difference will usually need to be some separation among replicates in time and space in applying the experimental treatments.

7.2.1 A simple test for independence

How willing am I to certify this is a truly repeatable phenomenon when replicated in this way?

A new scientific discovery would be some kind of repeatable phenomenon.

7.2.2 Some replication examples

If we are performing an experiment using pairs of human twins, each pair that is studied stands as an independent replicate. Because the pair is the experimental unit, a study involving 5 pairs will have five, rather than ten, independent replicates.

If we conduct an experiment using unrelated human volunteers, or some other out bred animals, each person or animal from whom a measurement is recorded is considered an independent replicate. Their biological uniqueness defines their independence.

We wander into gray areas pretty quickly when thinking about the independence of experimental units in studies involving cultured cells, batches of biological material, and inbred mice. Working with these systems it is difficult to achieve the gold standard of true biological independence. The focus instead should be on repeatability….“Working with new batches of reagents and different days do I get the same response?”

Imagine a 6 well plate of cultured cells. No well differs biologically from any other. If each well received a repeat of the same treatment at the same time we shouldn’t consider any measurements from that plate independent from others. Otherwise, the sample would be biased to that plate of cells measured at that particular time with a given set of reagents under those particular conditions. It is too biased to that moment. What if we screwed up the reagents and don’t know it?

Rather than being independent, it is best to consider the 6 measurements drawn from the plate as technical replicates or pseudo replicates. The data from the 6 wells should be averaged or totaled somehow to improve the estimate of what happened on that plate that day.

A better approach with cultured cells is to use passage numbers to delineate independence. Thus, a 6 well plates from any one passage are independent experimental units relative to all other passages.

Obviously, given the homogeneity of cells in culture, it’s unlikely there is much biological variation even by these criteria. But to achieve true biological independence would require re-establishing the cell line each time an independent replicate was needed. That’s rarely feasible.

Inbred mice pose much the same problem. Scientific judgment is needed to decide when 2 mice from the same strain are independent of each other.

One mark of delineation is the litter. Each litter would be independent of other litters. Outcomes of two (or more) littermates could be considered matched or related-measures and thus one experimental unit.

7.3 Random process

You can probably sense intuitively how randomization can guard against a number of biases, both systematic and cognitive. Systematic artifacts become randomly distributed among the sample replicates, whereas you are less tempted to treated a replicate as preferred if you don’t know what is its treatment level.

Mathematical statistics offers another important reason for randomization. In classical statistics the effect size of some treatment is assumed to be fixed. Our estimate of that real value is the problem. Thus, when we measure a value for some replicate, that value is comprised of a combination of these fixed effects and unexplained effects.

The variation we observe in our outcome variables, the reason it is a random variable, arises from these unexplained effects. These can be particularly prominent in biological systems. Randomization procedures assures those random effects are truly random. Otherwise we might mistake them for the fixed effects that are of more interest us! This concept will be discussed more formally in the section on general linear models.

Suffice to say for pragmatic purposes that random sampling is crucial for limiting intentional and unintentional researcher biases.

Either the experimental units should be selected at random, or the experimental units should be assigned treatments at random, and/or the outcome data should be evaluated at random (eg, blind). Sometimes, doing a combination of these would be even better.

Usually, the researcher supervises this randomization using some kind of random number generator. R’s sample() function gets that job done for most situations.

Let’s design an experiment that involves two treatments and a total of 12 independent experimental units. Thus, 6 experimental units will each receive either of the two treatments. Let’s say that my experimental units each have an ID, in this case, a unique letter from the alphabet. Using sample(1:12) we randomly assign a numeric value to each ID. This numeric value will be the order by which the experimental unit, relative to the other experimental units, is subjected to the experimental treatment. ID’s that are assigned even random numbers get one of the two treatments, and odd numbered ID’s get the other treatment.

What we’ve done here is randomize both the order of replication and the assignment of treatment. That’s a well-shuffled deck. You can see how this approach can be readily adapted to different numbers of treatment levels and sample sizes.

set.seed(1234)
ID <- letters[1:12]
order <- sample(1:12, replace=F)
plan <- data.frame(ID, order)
plan
##    ID order
## 1   a     2
## 2   b     7
## 3   c    11
## 4   d     6
## 5   e    10
## 6   f     5
## 7   g     1
## 8   h    12
## 9   i     3
## 10  j     8
## 11  k     4
## 12  l     9

7.4 Statistically valid samples

For any statistical test to be valid, each replicate within a sample must satisfy the following two criteria:

  • The replicate should be generated by some random process.
  • The replicate must be independent of all other replicates.

Why? Statistical tests are one of the last stages of a hypothesis testing process. All of these tests operate, formally, on the premise that at least these two conditions are true.

When these conditions have not been met the researcher is collecting data without testing a hypothesis. To run a statistical test is to pretend a hypothesis has been tested, when it has not.

7.4.1 Select random subjects

Let’s say we want to do an experiment on graduate students and need to generate a representative sample. There are 5 million people in the US who are in graduate school at an given time. Let’s imagine they each have a unique ID number, ranging from 1 to 5,000,000. We can use R’s sample() function to randomly select three individuals with numbers corresponding to that range.

Sampling with replacement involves throwing a selection back into a population, where it can potentially be selected again. In that way, the probability of any selection stays the same throughout the random sampling process.

Here, the replace = FALSE argument is there to ensure I don’t select the same individual twice.

sample(x=1:5000000, size=3, replace = FALSE)
## [1] 1413668 4617167 1461579

All that needs to be done is to notify the three people corresponding to those IDs and schedule a convenient time for them to visit so we can do our experiment.

You can imagine several variations to randomly select graduate students for measurements. You just need a way to find graduate students, then devise a way(s) to ensure the sampling is as representative as possible. Selecting subjects from a real population is pretty straight forward, a bit like picking 8 lotto balls from a spinning container.

A lot of times in experimental work the number of subjects available to the researcher is fixed and smaller. The size of the population to be sampled can be much closer to the number of replicates needed for the experiment rather than a sample from a large pool.

In these cases we have to come up with other ways to randomize.

7.4.2 Randomize to sequence

For example, let’s say we want to compare condition A to condition B. We have 6 subjects to work with, each of which will serve as an independent replicate. We want a balanced design so will have 3 replicates for each of the 2 conditions.

Let’s imagine we can only perform an experiment on one subject, one day at a time. In that case, it makes sense to randomize treatment to sequence.

We can randomly generate a sequence of 6 even and odd numbers, and assign them to the daily sequence (MTWTFM) based on which random number is first on its list. We can make a rule that subjects assigned even numbers will receive condition A, whereas condition B is meted out to subjects associated with odd numbers.

sample(x=11:16, size=6, replace = FALSE)
## [1] 16 12 15 11 13 14

7.4.3 Randomize to location

Let’s imagine 3 treatments (negative control, positive control, experimental), that we will code 1,1,2,2,3,3. These will be applied in duplicate to cells on 6-well cell culture plate. We’ll code the plate wells with letters, a, b, c, d, e, f from top left to bottom right (ie, a and b are wells in the top row).

Now we’ll generate a random sequence of those six letters.

sample(letters[1:6], replace=F)
## [1] "b" "a" "e" "d" "f" "c"

Next, we’ll map the sequence 1,1,2,2,3,3 to those letters. Thus, negative control goes to the wells corresponding to the first two letters in that sequence, positive control to the 3rd and 4th letters, and so forth.

7.4.4 Randomize to block

In statistical lingo, a block is a subgroup within a sample. A blocked subject shares some feature(s) in common with other members of its block compared to other subjects in the overall sample. But usually, we’re not interested in block as a variable, per se.

Here are some common blocks at the bench are

  • One purified enzyme preparation vs a second preparation of the same enzyme, nominally purified the same way. The two enzyme preps represent two different blocks.
  • A bunch of cell culture dishes plated on Friday from passage number 15 vs ones plated on Tuesday from passage number 16. The two passages represent 2 different blocks.
  • A litter of mouse pups born in January vs a litter born in February. The two different litters represent two different blocks.
  • An experiment run with freshly prepared reagents on Monday vs one run on Tuesday, with a new set of freshly prepared reagents. Each experimental day represents a block.

Frequently, each block is taken as an independent replicate.

7.5 Independence of replicates

In biomedical research the standard is for biological independence; when we speak of “biological replicates” we mean that each independent replicate represents a distinct biological entities.

That standard is difficult to meet when working with many common biological model systems, particularly cell lines and inbred animals.

The definition of statistical independence is grounded in the mathematics of probability: Two events are statistically independent when they convey no information about the other, or \[p(A \cap B)=p(A)p(B)\].

Here the mathematics is not particularly helpful. Imagine two test tubes on the bench, each receives an aliquot of biological material from a common prep (eg, a purified protein). One tube then receives treatment A and the other treatment B. As best we know, the two tubes aren’t capable of influencing each other. But we can reasonably assume their responses to the treatments will at least be correlated, given the common source of biological material. Should each tube be treated as if it were statistically independent?

Replicate independence that meets statistical validity therefore has to take on a more pragmatic and nuanced definition. My preference is to define a replicate as the independent experimental unit receiving treatment. I like this because it allows for defining the experimental unit differently depending upon the experimental design.