**What Underlies Sample Size Calculations
**Gerard E. Dallal, Ph.D.

Just as the analysis of a set of data is determined by the research question and the study design, the way the sample size is estimated is determined by the way the data will be analyzed. This note (at least until the next draft!) is concerned with comparing population means. There are similar methods for comparing proportions and different methods for assessing correlation coefficients. Unfortunately, it is not uncommon to see sample size calculations that are totally divorced from the study for which they are being constructed because the sample sizes are calculated for analyses that will never be used to answer the question prompting the research. The way to begin, then, is by thinking of the analysis that will ultimately be performed to insure that the corresponding sample size calculations have been used. This applies even to comparing two population means. If experience suggests a logarithmic transformation will be applied to the data prior to formal analysis, then the sample size calculations should be performed in the log scale.

Studies are generally conducted because an investigator expects to see
a specific treatment effect.^{*} Critical regions and tests of
significance are determined by the way data should behave if there is no
treatment effect. Sample sizes are determined by the way data should
behave if the investigator has estimated the treatment effect
correctly.^{**}

Independent Samples

Consider a study using two independent samples to compare their population means. Let the common population standard deviation be 60. The behavior of the difference in sample means under the null hypothesis of equal population means is illustrated by the normal distributions on the left-hand side of displays (a) through (d) for sample sizes of 12, 24, 48, and 96 per group, respectively.

Suppose the investigator expects the difference in population means to be 50 units. Then, the behavior of the difference in sample means is described by the curves on the right-hand side of the displays.

Things to notice about (a)--(d):

- The horizontal scales are the same.
- The normal curves on the left-hand side of the display are centered at 0.
- As the sample size increases, the distribution of the difference in sample means as given by the normal curves on the left-hand side of the display are more tightly concentrated about 0.
- The critical values for an 0.05 level test--sample mean differences that will lead to rejecting the hypothesis of equal population means--are given by the vertical dashed lines. The critical region is shaded red. If the mean difference falls outside the vertical lines (in the critical region), the hypothesis of equal population means is rejected.
- As the sample size increases, the critical values move closer to 0. This reflects the common sense notion that the larger the sample size, the harder it is (less likely) for the sample mean difference to be at any distance from 0.

Other things to notice about (a)--(d):

- The normal curves on the right-hand side of the display are centered at 50.
- As the sample size increases, the distribution of the difference in sample means as given by the normal curves on the right-hand side of the display are more tightly concentrated about 50.
- As the sample size increases, more of the curve on the right-hand
side of the displays falls into the critical region. The portion of the
distribution on the right-hand side of the displays that falls into the
critical region is shaded blue.
The region shaded blue gives the power of the test. It is 0.497, 0.807, 0.981, and 1.000 for panels (a) through (d), respectively.

Choosing a sample size is just a matter of getting the picture "just right", that is, seeing to it that there's just the right amount of blue.

It seems clear that a sample size of 12 is too small because there's a large chance that the expected effect will not be detected even if it is true. At the other extreme, a sample size of 96 is unnecessarily large. Standard practice is to choose a sample size such that the power of the test is no less than 80% when the effect is as expected. In this case, the sample size would be 24 per group. Whether a sample size larger than 24 should be used is a matter of balancing cost, convenience, and concern the effect not be missed.

The pictures show how the sample size is a function of four quantities.

- the presumed underlying difference (),
that is, that is, the
*expected difference*between the two populations means should they be unequal. In each of the displays, changing the expected difference moves the two distributions further apart or closer together. This will affect the amount of area that is shaded blue. Move them farther apart and the area increases. Move them closer together and the area decreases. - the
*within group standard deviation*(), which is a measure of the variability of the response. The width of the curves in the displays is determined by the with group standard deviation and the sample size. If the sample size is fixed, then the greater/smaller the standard deviation, the wider/narrower the curves. If the standard deviation is fixed, then the larger/smaller the sample size, the narrower/wider the curves. Changing width of the curves will move the critical values, too. Displays (a)--(d) were constructed for different sample sizes with the population standard deviation fixed. However, the same pictures could have been obtained by holding the sample size fixed but changing the population standard deviation. - the size or
*level of the*statistical*test*(). Decreasing the level of the test--from 0.05 to 0.01, say--moves the critical valued further away from 0, reducing the amount of area that is shaded red. It also reduces the amount of area shaded blue. This represents a trade off. Reducing the amount of area shaded red reduces the probability of making an error when there is no difference. This is good. Reducing the amount of area shaded blue reduces the probability of making the correct decision when the difference is as expected. This is bad. - the probability of rejecting the hypothesis of equal means if the
difference is as specified, that is, the
*power of the test*() when the difference in means is as expected. This is the area that is shaded blue.

The sample size is determined by the values of these four quantities. Denoting the expected mean difference locates the centers of the distributions on the number line. Picking the size of the test determines the amount of area that will be shaded red. For a fixed sample size, it also determines the critical values and the amount of area that will be shaded blue. Increasing the sample size makes the distributions narrower which moves the critical values closer to the mean of the distribution of the test statistic under the null hypothesis. This increases the amount of area shaded blue.

In practice, we don't draw lots of diagrams. Instead, there is a formula that yields the per group sample size when the four quantities are specified. For large samples, the per group sample size is given by

Technical detail:For small sample sizes, percentiles of the t distribution replace the percentiles of the normal distribution. Since the particular t distribution depends on the sample size, the equation must be solved iteratively (trial-and-error). There are computer programs that do this with little effort.

The sample size increases with the **square** of the within group
standard deviation and decreases with the **square** of the expected
mean difference. If, for example, when testing a new treatment a
population can be found where the standard deviation is half that of
other populations, the sample size will be cut by a factor of 4.

The alternative to equality must be realistic. The larger the expected difference, the smaller the required sample size. It can be QUITE TEMPTING to overstate the expected difference to lower the sample size and convince one's self or a funding agency of the feasibility of the study. All this strategy will do, however, is cause a research team to spend months or years engaged in a hopeless investigation--an underpowered study that cannot meet its goals. A good rule is to ask whether the estimated difference would still seem reasonable if the study were being proposed by someone else.

The power, --that is, probability of rejecting H0 when the alternative holds--can, in theory, be made as large or small as desired. Larger values of require larger sample sizes, so the experiment might prove too costly. Smaller values of require smaller sample sizes, but only by reducing the chances of observing a significant difference if the alternative holds. Most funding agencies look for studies with at least 80-% power. In general, they do not question the study design if the power is 80-% or greater. Experiments with less power are considered too chancy to fund.

When The Response Is a Single Measurement

The estimate of the within group standard deviation often comes from similar studies, sometimes even 50 years old. If previous human studies are not available to estimate the variability in a proposed human study, animal studies might be used, but animals in captivity usually show much less variability than do humans. Sometimes it is necessary to guess or run a pilot study solely to get some idea of the inherent variability.

Many investigators have difficulty estimating standard deviations simply because it is not something they do on a regular basis. However, standard deviations can often be obtained in terms of other measures that are more familiar to researchers. For example, a researcher might specify a range of values that contains most of the observations. If the data are roughly normally distributed, this range could be treated as an interval that contains 95% of the observations, that is, as an interval of length 4. The standard deviation, then, is taken to be one-fourth of this range. If the range were such that it contains virtually all of the population, it might be treated as an interval of length 6. The standard deviation, then, is taken to be one-sixth of this range.

Underestimating the standard deviation to make a study seem more feasible is as foolhardy as overestimating an expected difference. Such estimates result in the investment of up resources in studies that should never have been performed. Conservative estimates (estimates that lead to a slightly larger sample size) are preferable. If a study is feasible when conservative estimates are used, then it is well worth doing.

When the Response Is a Difference

When the response being studied is change or a difference, the sample size formulas require the standard deviation of the difference between measurements, not the standard deviation of the individual measurements. It is one thing to estimate the standard deviation of total cholesterol when many individuals are measure once; it is quite another to estimate the standard deviation of the change in cholesterol levels when changes are measured.

**One trick that might help:** Often a good estimate of the
standard deviation of the differences is unavailable, but we have
reasonable estimates of the standard deviation of a single measurement.
The standard deviations of the individual measurements will often be
roughly equal. Call that standard deviation . Then, the standard deviation of the paired
differences is equal to

**Many Means**

Sometimes a study involves the comparison of many treatments. The
statistical methods are discussed in detail under *Analysis of Variance
(ANOVA)*. Historically, the analysis of many groups begins by asking
whether all means are the same. There are formulas for calculating the
sample size necessary to reject this hypothesis according to the
particular configuration of population means the researchers expect to
encounter. These formulas are usually a bad way to choose a sample size
because the purpose of the experiment is rarely (never?) to see whether
all means are the same. Rather, it is to catalogue the differences. The
sample size that may be adequate to demonstrate that the population means
are not all the same may be inadequate to demonstrate exactly where the
differences occur.

When many means are compared, statisticians worry about the problem of
multiple comparisons, that is, the possibility that some comparison may
be call statistically significant simply because so many comparisons were performed. Common sense
says that if there are no differences among the treatments but six
comparisons are performed, then the chance that something reaches the
level of statistical significance is a lot greater than 0.05. There are
special statistical techniques such as *Tukey's Honestly Significant
Differences (HSD)* that adjust for multiple comparisons, but there are
no easily accessible formulas or computer programs for basing sample size
calculations on them. Instead, sample sizes are calculated by using a
Bonferroni adjustment to the size of the test, that is, the nominal size
of the test is divided by the number of comparisons that will be
performed. When there are three means, there are three possible
comparisons (AB,AC,BC). When there are four means, there are six
possible comparisons (AB,AC,AD,BC,BD,CD), and so on. Thus, when three
means are to be compared at the 0.05 level, the two-group sample size
formula is used, but the size of each individual comparison is taken to
be 0.05/3 (=0.0167). When four means are compared, the size of the test
is 0.05/6 (=0.0083).

**The Log Scale**

Sometimes experience suggests a logarithmic transformation will be
applied to the data prior to formal analysis. This corresponds to
looking at ratios of population parameters rather than differences. When
the analysis will be performed in the log scale, the sample size
calculations should be performed in the log scale, too. If only summary
data are available for sample size calculations and they are in the
original scale, the behavior in the log scale can be readily
approximated. The expected difference in means in the log scale is
approximately equal to the log of the ratio of means in the original
scale. The common within group standard deviation in the natural log
scale (base *e*) is approximately equal to the coefficient of
variation in the original scale (the roughly constant ratio of the within
standard deviation to the mean). If the calculations are being performed
in the common log scale (base 10), divide the cv by 2.3026 to estimate
the common within group standard deviation.

Example: (=0.05, =0.80) Suppose a response will be analyzed in the
log scale and that in the original scale, the population means are
expected to be 40 and 50 mg/dl and the common coefficient of variation
(/) is estimated
to be 0.30. Then, in the (natural) log scale the estimated effect is
ln(50/40) = ln(1.25) = 0.2231 and common within group standard deviation
is estimated to be 0.30 (the cv). The per group sample size is
approximately 1+16(0.30/0.2231)^2 or 30. In the common log scale, the
estimated effect is log(50/40) = 0.0969 and the estimated common within
group standard deviation is estimated to be 0.30/2.3026 = 0.1303. The per
group sample size is approximately 1+16(0.1301/0.0969)^2 or 30. It is not
an accident that the sample sizes are the same. The choice of a
particular base for logarithms is like choosing to measure height in cm
or in. It doesn't matter which you use **as long as you are
consistent!** No mixing allowed!
A few things worth noting:

- log(40/50) = -0.0969, that is, -log(50/40). Since this quantity is squared when sample sizes are being estimated, it doesn't matter which way the ratio is calculated.
- The cv estimates the common within group SD for log transformed data works only for natural logs. When you take the log of the ratio to estimate the treatment effect in the log scale, you pick the particular type of log you prefer. Since cv estimates the common within group SD for natural-log transformed data, you have to adjust it accordingly if you calculate the treatment effect in logs of a different base.
- 2.3026--the factor which, when divided into natural logs, converts
*ln*s to*log*s-- = ln(10).

A potential **gotcha!**: When calculating the treatment effect in
the log scale, you can never go wrong calculating the log of the ratio of
the means in the original scale. However, you have to be careful if the
effect is stated in terms of a percent increase or decrease. Increases
and decreases are not equivalent. Suppose the standard treatment yields
a mean of 100. A 50% increase gives a mean of 150. The ratio of the
means is 150/100(=3/2) or 100/150(=2/3), Now consider a 50% decrease from
standard. This leads to a mean of 50. The ratio is now 100/50(=2) or
50/100(=1/2). There's no trick here. The mathematics is correct. The
message is that you have to be careful when you translate statements
about expected effects into numbers needed for the formal
calculations.

Dealing With Paired Responses

Sometimes responses are truly paired. Two treatments are applied to the same individual or the study involves matched or paired subjects. In the case of paired samples, the formula for the total number of pairs is the same as for the number of independent samples except that the factor of 2 is dropped, that is,

It is clear from the formulas why paired studies are so attractive. First, is the factor of 2. All other things being equal, a study of independent samples that requires, say, 100 subjects per group or a total of 200 subjects, requires only 50 pairs for a total of 100 subjects. Also, if the pairing is highly effective, the standard deviation of the differences within pair can be quite small, thereby reducing the sample size even further. However, these saving occur because elements within the same pair are expected to behave somewhat the same. If the pairing is ineffective, that is, if the elements within each pair are independent of each other, the standard deviation of the difference will be such that the number of pairs for the paired study turns out to be equal to the number of subjects per group for the independent samples study so that the total sample size is the same.

There is a more important concern than ineffective pairing. When some investigators see how the sample sizes required for paired studies compared to those involving independent samples, their first thought is to drop any control group in favor of "using subjects as their own control". Who wouldn't prefer to recruit 50 subjects and look at whether their cholesterol levels change over time rather than 200 subjects (100 on treatment; 100 on placebo) to see if the mean change in the treatment group is different from that in the control group? However, this is not an issue of sample size. It is an issue of study design. An investigator who measured only the 50 subjects at two time points would be able to determine whether there was a change over time, but s/he would not be able to say how it compared to what would have happened over the same time period in the absence of any intervention.

----------------

^{*}There are exceptions such as equivalence trials where the
goal is to show that two population means are the same, but they will not
concern us here.

^{**}It may sound counter-intuitive for the investigator to
have to estimate the difference when the purpose of the study is to
determine the difference. However, it can't be any other way. Common
sense suggests it takes only a small number of observations to detect a
large difference while it takes a much larger sample size to detect a
small difference. Without some estimate of the likely effect, the sample
size cannot be determined. Sometimes there will be no basis for
estimating the likely effect. The best that can be done in such
circumstances is a pilot study to generate some preliminary data and
estimates.

[back to LHSP]