Statistical notes for clinical researchers: post-hoc multiple comparisons
Article information
For comparison of three or more group means we apply the analysis of variance (ANOVA) method to decide if all means are equal or there is at least one mean which is different from others. If we get a significant result, we can conclude a global decision that there is difference in group means. However then we need to know what specific pairs of group means show differences and what pairs do not. The procedure is performed by post-hoc multiple comparison procedures.
Multiple comparisons and type I error (α error)
Multiple comparisons are procedures of comparing many group means simultaneously. For an example, when we are interested in comparing means of A, B and C groups, we may consider performing a set of three comparisons as following:
Hypothesis 1: mean values of group A and group B are equal (comparison of A and B).
Hypothesis 2: mean values of group A and group C are equal (comparison of A and C).
Hypothesis 3: mean values of group B and group C are equal (comparison of B and C).
The set of comparisons is referred as a 'family of test'. The multiple comparison procedure that tests a set of hypotheses at the same time is also called a 'simultaneous test'. The most important issue in multiple comparisons is the control of type I error. Type I error is defined as the probability of committing error that a true null hypothesis is rejected. We call the type I error as α error. An α error level of 0.05 is frequently used for assessing a hypothesis. The overall error level for the family of tests is different from the α error level for a comparison (Table 1).
If the same α error level is adopted for each comparison (αPC) in multiple k comparisons, the overall α error level for the family of tests (αFW) is calculated as following procedure:
Probability of no α error for a comparison = 1 - (probability of α error per comparison [αPC]
Probability of no α error for overall family of k tests = (1 - αPC) × (1 - αPC) × … × (1 - αPC) = (1 - αPC)k, for comparisons independent of each other.
Probability of α error for overall family of k independent tests (αFW) = 1 - (1 - αPC)k
For the example of family of independent three tests, if we set αPC at the conventional α error level 0.05, αFW is obtained as 1 - (1 - 0.05)3 = 1 - 0.8574 = 0.1426. The familywise α level is not only greater than the αPC but also greater than an acceptable α error level. If we want to control the below 5 percent (0.05), we need to reduce αPC to a certain degree. For example if we set αPC at 0.01, the αFW is calculated as 1 - (1 - 0.01)3 = 1 - 0.9703 = 0.0297, which is smaller than 0.05.
Developed various multiple comparison methods
Many statisticians have devised various multiple comparison methods to correct the αFW within an acceptable level of 0.05. The methods can be categorized into three types according to the size and nature of family of tests, such as restricted sets of contrasts, pairwise comparisons, and post-hoc error correction. Table 2 shows the category and characteristics of various multiple comparison tests.
1. Restricted sets of contrasts
The multiple comparison methods in 'Restricted sets of contrasts' are appropriate for relatively small families of tests composed of less than ten tests (or contrasts) approximately. The results by the methods in the category can be somewhat conservative when applied for a large number of tests. Therefore before the test, the contrasts of specific interest should be chosen such as a set of planned comparisons or comparisons between a control group and other experimental groups.
Before introducing each specific method a brief overview of related terms in statistics may help.
- Liberal/conservative: 'Liberal' refers to a tendency that rejection of null hypothesis is relatively easy. A liberal test has a large power in accepting true alternative hypothesis. Contrarily, 'conservative' refers a relative difficulty in rejecting null hypothesis and possession of small power.
- Balanced design/unbalanced design: Balanced design refers a condition that all the compared groups have equal sample sizes; unbalanced design refers that the groups have unequal sample sizes.
- Single-step/step-down methods: A single step test refers a test which is implemented by a single test procedure; a step-down method refers a test which is performed according to repeated sequential steps. Generally while a single step test provides a confidence interval, a step-down method does not.
- Common step of step-down methods: At first step, all k means are tested at a αFW level considering comparison of k means. If the result is significant, then the following step starts; if insignificant, it stops. Next, each subset of k - 1 means is tested at an increased αFW level considering comparison of k - 1 means. Continue in this manner until no subsets remain to be tested.
1.1. Single step methods
1.1.1. The Bonferroni procedure
The Bonferroni correction of α error is a completely general method which is widely applicable to any sort of statistical procedures other than multiple comparisons following ANOVA. The α error for overall family of k independent tests (αFW) '1 - (1 - αPC)k' is the largest value among αFW of any set of tests including both independent and correlated tests. After some algebra the Bonferroni inequality for any set of tests is expressed as αFW < kαPC. Therefore, to control the αFW to be smaller than 0.05, we apply α error level for each comparison as αPC = αFW/k. For a family of three tests as the example above, we apply
1.1.2. The Šidák-Bonferroni procedure
The Šidák-Bonferroni procedure was developed to improve the power of tests because the Bonferroni procedure produces conservative results. The significance level per comparison is applied as
1.1.3. Dunnett's test
Specifically when one control group is being compared to all other experiment groups, the Dunnett's test is appropriate. In the situation the Dunnett's test shows a large power. The standard t which is uncorrected is used as a test statistic, and compared with the particular value of tDunnett that Charles Dunnett devised.
1.2. Step-down procedures
1.2.1. Holm-Bonferroni procedure
A step-down repeated test similar to the Bonferroni procedure is performed according to ordered p-value of each comparison. As the step progress, comparisons are assessed with successively increased α error levels. It shows more power compared to the Bonferroni procedure.
1.2.2. Shaffer's modified sequentially rejective Bonferroni procedure
It is a modification of Holm-Bonferroni procedure by partly adopting increased α error levels, having more power compared to the Holm-Bonferroni procedure.
2. Pairwise comparisons
The pairwise comparison is comparing all possible pairs of group means. If we want to compare all possible pairs from k groups, then the total number of comparisons is k(k - 1)/2. Following procedures are appropriate for all pairwise comparison and are expected to obtain reasonable results. Though it is possible to apply Bonferroni correction, overcorrected result is expected.
2.1. Single step methods
2.1.1. Tukey's honestly significant difference (HSD) procedure
Tukey's HSD procedure provides the simplest way to control αFW and is considered as the most preferable method when all pairwise comparisons are performed. The studentized range statistic (q statistic) is used to determine the critical values based on number of groups and number of observations in a group. As Tukey's HSD procedure assumes equal size of all compared groups, a modified Tukey-Kramer method can be applied for comparisons of unequal-sized groups.
2.2. Step-down procedure
2.2.1. Student-Newman-Keuls (SNK) procedure
The Student-Newman-Keuls (SNK) procedure is a step-down procedure which constructs equivalent subset similar to Tukey's procedure. The SNK procedure is following a very complex process. Though it shows an increased power, it often comes with an increased family-wise error level and may result in a too liberal tendency.
2.2.2. Duncan's multiple range test
The Duncan's multiple range test is performed using steps similar to SNK procedure. Changed α error levels are applied following the step-down procedure. It shows more liberal tendency than the SNK procedure. Generally Duncan's multiple range test is not recommended when sample sizes are unequal because of the liberal tendency.
2.2.3. Ryan-Einot-Gabriel-Walsch (REGW) procedure
The REGW procedure is a modification of SNK procedure by introducing more strict control of family-wise α error. The REGW procedure is considered to be recommendable because it shows not only good power and but also tight error control. The REGWF uses F statistic and REGWQ uses the studentized range statistic (q statistic).
3. Post-hoc error correction
The procedure in this category is performed as a completely post-hoc analysis after all planned comparisons are assessed. The procedure explores all possible complex relationships and applies with the most stringent error control.
3.1 Scheffé's procedure
The Scheffé's procedure comprises all possible contrasts not only paired comparisons. Its advantage is that it covers a broad range of complex tests including post-hoc relationships among many groups. The procedure tends to be too conservative and power is less than other methods. Generally Scheffé's procedure is not recommended when only pairwise comparisons are of interest.
4. No control of family-wise α error level
4.1. Least Significant Difference (LSD) test
The LSD method does not control family-wise α error level. Therefore it is inappropriate for multiple comparison procedure where the control of family-wise α error level is necessary.
Summary of multiple comparison methods
In the choice of multiple comparison methods, it is important to consider the exact situation. The standard of choice is the ability to control family-wise α error level and the degree of power detecting significant difference.
For usual post-hoc pairwise comparisons, Tukey's HSD procedure or REGWQ may be preferable.
For comparisons of small number of group means or preplanned comparisons of selected groups, the Bonferroni procedure or Šidák-Bonferroni procedure may be preferable.
When a control group is compared with other experimental groups, Dunnett's test may be of choice.
If interested in a broad range of complex tests, Scheffé's procedure may be appropriate.
Note: For practical convenience, most statistical packages show adjusted p-values which are comparable with a conventional α error level instead of the reduced αPC. Clinical researchers may simply compare p-values provided by statistical packages with conventional α error level such as 0.05 and may make decisions comfortably.