Well-designed and conducted clinical research can support clinician utilization, regulatory approval, and payer coverage decisions for orthopaedic surgical procedures. Distinct study designs have advantages and drawbacks in each of these three areas.

Although experimental designs such as the randomized clinical trial (RCT) may be ideal in principle, well-conducted observational studies also have value, especially with respect to clinically established practice. Within RCTs, superiority studies (the experimental treatment is shown to be superior to the control treatment) are a gold standard. However, even well-conducted superiority trials can fail to maintain randomization and contain patients who are lost to follow-up or who cross over between randomized arms. This so-called partial compliance blurs the distinction between experimental and observational studies and complicates the interpretation of the resulting data. A separate article will address noninferiority trials, a recent trend in orthopaedic RCTs.

**Basics of superiority trials**

A superiority trial is a classic RCT that tests whether the experimental treatment is superior to the control treatment, which could be nonsurgical care or a sham procedure, if ethical. The null hypothesis is that the effect of the experimental treatment is less than (or equal to) the control procedure; superiority is suggested if the effect of the experimental device is greater than the control in a statistically significant way.

In the language of statistical hypothesis testing, this sets a *P*-value cutoff corresponding to the confidence interval. In effect, this means researchers are willing to accept a small chance of a false-positive difference, rejecting the null hypothesis, when the null hypothesis is in fact true (type I error). These concepts are familiar and clear to many practicing clinicians.

**Power and effect size**

However, the related concept of power is less universally understood by clinicians. A type I error reflects the chance of accepting a false-positive difference when, in reality, no difference exists. Due to the nature of probabilities, another error rate must also be specified, the chance of a false-negative difference. In this case, the null hypothesis that no difference exists is accepted, when, in reality, there is a difference between experiment and control (type II error). Power is the probability of detecting a difference and accepting the alternative hypothesis when it is true that a real difference between the experiment and the control group exists.

If a conceptual symmetry exists between setting the false-positive (type I) and false-negative (type II) error rates, why is the concept of power not universally understood? The reason is that the type II error rate and the power are not set as explicitly as the type I error rate. Instead, they are set implicitly by choosing the sample size. Specifically, for a given statistic and a type I error rate, equations and computational methods are used to estimate the power of the study as a function of the sample size (N).

However, the equations for estimating the power also require knowledge of another parameter, the effect size. The effect size is an estimate of the magnitude of the difference between the experimental treatment and the control treatment, taking into account both the difference and variability of the differences (error bars). Although this is the point of the experiment in the first place, a rough estimate of effect size can be determined in a smaller pilot experiment (or with very well-controlled historical data). This estimate can then be used to determine the N required to achieve a given power in the full study.

After the trial is complete, the effect size can be recalculated from the full data. The power actually achieved in the full study (as opposed to predicted from the pilot) can also be recalculated as a double-check for the possibility of a type II (false-negative) error.

In summary, estimating the type II error rate, the power, the N, and the effect size includes a component of circularity. Further, this estimation requires mathematical knowledge and is best done iteratively, with a pilot experiment followed by a full experiment (and additional experiments if necessary).

Underpowered experiments have led to numerous false negatives because the authors lacked a pilot study to estimate the effect size, lacked the funding to do a larger study regardless, or used insufficient statistical methods. Another problem leading to type II errors occurs when the effect size of the actual study is not similar to that predicted *a priori* when calculating N.

**Clinical versus statistical significance**

Based on this, it might seem prudent to increase N to a very large number, to maximize power and minimize false negatives. A large N is cost-prohibitive in practice and leads to concerns about clinical significance in principle. A trial with a large N may demonstrate a small but statistically significant difference that is of little clinical importance.

The minimum clinically important difference (MCID) attempts to quantify what difference is clinically significant and must be estimated for each outcome measure. For example, the MCID for the Physical Component Subscale of the Short Form-36 clinical outcome instrument has been estimated quantitatively. So, there is little point in designing an experiment so large that it can detect a difference smaller than the MCID.

Because the MCID depends on the diagnosis and procedures, an estimate may not be readily available. Even when possible, estimating an MCID is difficult. It requires substantial clinical data, and the estimate may be sensitive to the so-called *anchor*â€”the measure of whole-person well-being that is the base for determining what matters to patients.

The MCID (as typically estimated) may also underestimate the degree of improvement that patients actually expect and desire if they were to recommend a procedure to a friend or revisit the decision to have the procedure. So, although a useful concept, the MCIS should be regarded as only a rough guide to both patient expectations and experiment design.

**RCT analyses sets**

Even well-conducted RCTs have imperfections in the outcome measures, follow-up, and inclusion/exclusion criteria. One particularly important issue relates directly to public policy and payer decisions: the degree of randomization as it is affected by dropout and crossover of patients between arms as the study progresses. These are termed RCTs with partial compliance. They blur the line between an experiment (randomized trial) and an observational study and can turn an RCT into an observational study if a large number of patients cross over from their randomized arm to another arm.

For example, one of the early publications from the Spine Patients Outcomes Research Trial (SPORT) appeared to demonstrate no difference between surgical and nonsurgical treatment of disk herniation at 2-year follow-up. However, the value of this result was clouded by a high crossover rate and the use of an intent-to-treat (ITT) analysis. Under ITT, the data are analyzed as if each patient had received the assigned treatment, regardless of what treatment was actually received.

Although it seems counterintuitive, an ITT approach is highly regarded in the analysis of superiority RCTs. It is conservative in the statistical sense, tending to favor the null hypothesis of no difference in treatments. This is viewed as beneficial in that ITT is biased toward avoiding false-positive treatment effects and randomization controls for confounding. Unfortunately, ITT may mask an important treatment effect, especially in the context of very high crossover.

Alternatives to ITT include the per-protocol analysis, which includes only those patients who complied with randomization, and the as-treated (AT) analysis, in which the data are analyzed by the treatment each patient actually received. Although conceptually simple, these data subsets and analysis methods can be confounded by covariates and do not necessarily isolate the placebo effect, similar to observational studies.

A mathematical approach known as instrumental variables can mitigate the effects of confounding in RCTs with partial compliance, but its application is complex and subject to its own limitations. Although the ITT analysis of the SPORT data for herniated disk did not support a well-established surgical treatment, the observational cohort showed a significant treatment effect for surgery, as did subsequent AT analyses of randomized SPORT data.

A faulty hypothesis may be another problem of surgical versus nonsurgical RCTs. In many conditions, surgery is indicated only after suitable nonsurgical treatments have failed. Patients randomized to nonsurgical treatment that has already failed may readily cross over and bias any results.

**Conclusion**

Though the RCT is a powerful scientific tool, it is not a panacea. Even well-designed and conducted trials, fueled by adequate funding, are subject to false conclusions, biases, and complex interpretations. As orthopaedic surgery emerges from a historical preponderance of small observational studies into the current and future reality of more RCTs, the interplay of these issues will become more important for patient care and public policy.

**S. Raymond Golish, MD, PhD**, is spine surgeon and medical director of research at Jupiter Medical Center, Palm Beach, Fla.; **Paul A. Anderson, MD**, is a professor in the department of orthopaedic surgery at the University of Wisconsin, Madison, Wisc. Both are members of the AAOS Biomedical Engineering Committee.

*Information on potential conflicts of interest by the authors may be accessed at www.aaos.org/disclosure*