When the general public critiques research, I often hear them say that the samples are "too small." It's true that sample sizes (N) in psychology research should be large. One of the outcomes of the so-called "replication crisis" is that large samples are more and more important in psychology. But why?

A common misconception--held by both students and the general public--is that large samples are important because they ensure external validity. This misconception is incorrect. External validity (that is, the ability to generalize from a sample to a population of interest) is about *how* a sample has been recruited, not *how many* people are in it (see Chapter 7, 14). For example, say you recruited a sample of 1000 fans attending the national championship college game. You'd have a pretty large sample, but you couldn't generalize from that sample to college students in the U.S. (for example). In fact, unless the 1000 fans were selected *at random* from the 70,000 fans at the game, you couldn't even generalize from this sample to "people attending the national championship football game."

If not external validity, why are large samples important? It's about accuracy of our statistical estimates. When estimating values in the population such as means or differences between means, large samples are less likely to be influenced by chance variability. For example, imagine you're estimating the mean height of kindergarteners in your local school. Now imagine that you select 5 kindergarteners at random, one of whom, by chance, turns out to be extremely tall for her age. That tall kindergartener is going to "pull" the mean estimate upwards when combined with only 4 other kids. But what if you select 25 kindergarteners instead? Now the tall kindergartener is going to be balanced out by 24 other scores, and her height will have less influence on the mean estimate.

Below is a pair of animations that illustrate this principle. They come from the data science blog R Explorations. The animation used the program R to run a simulation study over and over and over. First, they created a very large population of scores whose mean was known to be 10.0 and whose standard deviation was known to be 1.0. Then they asked the computer to draw a random sample of size 10, compute the mean of the 10 scores, and plot them. You can watch the samples appear in real time on the animation below. Here, *xbar* is the sample's mean and *s* is the sample's standard deviation. The red line represents the mean for each sample as it is drawn:

a) First, watch the top animation, where N = 10. What do you notice about the movement of the vertical red line representing the mean in the top animation? What is it doing, and what does that represent?

b) Now watch the bottom animation, where N = 1000. What do you notice about the movement of the vertical red line representing the mean in this second animation? What is it doing, and what does that represent?

c) What do you notice about the s values of the two animations? Which animation has a steadier estimate of s?

d) Answer this one only if you've had a statistics course: Which of the two animations will have a smaller standard error? How is the standard error represented in the two animations?

e) Given the behavior of the two animations, explain why a large sample is important for research.

f) Which validity does sample size best address, if not external validity?

g) Let's tie this concept back to the "replication crisis" (or, as some are now calling it, "credibility revolution"*). When a finding in psychology has not replicated in a direct replication study, one reason might be that the *original study* used a small sample. Another reason might be that the *replication study* used a small sample. Why might the sample size of a study be linked to its replicability? Explain in your own words.

Both animations come from the R Explorations blog

*Thanks to Dr. Simine Vazire for showing us this animation at NITOP conference and for introducing me to the term "credibility revolution"