We will be performing site maintenance on our learning platform at learn.aaos.org on Sunday, February 5th from 12 AM to 5 AM EST. We apologize for the inconvenience.


Published 6/1/2015
S. Raymond Golish, MD, PhD; Paul A. Anderson, MD

Non-inferiority Trials for Orthopaedic Implants

An increasing trend in randomized controlled trials (RCTs), specifically those involving orthopaedic implants, is the use of non-inferiority trials. In these trials, a novel treatment is compared to a clinically established treatment to show that the effect of the novel treatment is not inferior to—or at least within a small margin—the effect of the established treatment.

Although non-inferiority trials can be valuable, interpreting the results may be challenging for clinicians assessing their implications for patient care, regulators, and payers making coverage decisions.

The classic RCT is a superiority trial, testing whether the experimental treatment is better than the control treatment. The null hypothesis is that the effect of the experimental device is less than (or equal to) the effect of the control procedure. Superiority is suggested if the treatment effect of the experimental device is greater than the effect of the control in a statistically significant way.

In principle, an experimental device could be compared to either a sham procedure that isolates the placebo effect or an active control in a superiority trial. But increasingly, the goal of an RCT with an active control arm is to show that the experimental device is not less effective (non-inferior) than the control (at least not by much) (Fig. 1).

The most powerful rationale for this approach has to do with ethics. If a treatment is strongly believed to be effective and the condition being treated is serious, researchers may consider it unethical to withhold that treatment from patients in the sham group of an RCT comparing an experimental treatment to a sham control procedure.

Another practical consideration for selecting a non-inferiority design might be sample size. Demonstrating non-inferiority to an active control may require a smaller sample size than attempting to demonstrate superiority to that control. As treatments improve over time, the bar for new treatments is raised, necessitating ever larger sample sizes.

Sample size is often coupled with another argument. If the experimental device is shown to be non-inferior, it may have other advantages not directly measured in the trial, such as cost, ease-of-use, or additional safety considerations beyond those included in the trial’s primary outcome measure.

Basics of non-inferiority trials
In a non-inferiority trial, the null hypothesis is that the effect of the experimental device is inferior to the effect of the control procedure by more than a small amount, called the inferiority margin (M). Non-inferiority is suggested if the experimental device is inferior to the control device by less than M (ie, the null hypothesis is rejected).

M is critical to designing and understanding non-inferiority trials. In a typical two-armed trial, this number is not measured but must be assumed. Selecting such a number can be complex, but a few key points are relevant clinically.

At the very least, a well-conducted placebo (sham)-controlled RCT is required to estimate the effect size of the active control, and the more trials in the literature that confirm the estimate, the better. Logically, this estimate of the effect size of the active control places an “upper bound” on M; it makes no sense to have M larger than the effect size of the control or the experimental device could be non-inferior to the control but inferior to sham.

For this reason, the effect size of the active control is sometimes called M1. (In practice, M1 may not be the mean of the estimated effect sizes from all prior RCTs, but the lower end of the confidence interval, a stricter criterion.) Additionally, the non-inferiority margin can be smaller than M1. It may be desirable to attempt to guarantee some of the effect of the active control; defining M2 as a non-inferiority margin smaller than M1 on clinical grounds establishes this stricter criterion.

Thus, non-inferiority studies require both an active control that is clearly superior to sham and an estimate of the control’s effect size from a well done sham-controlled trial. This is a critical point, because many clinically accepted procedures in orthopaedic surgery are not supported by that degree of scientific evidence. If the estimate of the non-inferiority margin is based on equivocal evidence, a non-inferiority trial could demonstrate that the experimental device is non-inferior to control without being superior to sham.

Assay sensitivity
Another key clinical concept is that of assay sensitivity, which is defined as the ability to detect a difference based on all trial design parameters, outcome measures, and time points. This general concept applies to both superiority and non-inferiority trials.

To illustrate, suppose one conducts an idealized superiority trial for osteoarthritis of the knee. An injectable drug is compared with a placebo injection in a two-armed study. The primary outcome measure is the Western Ontario and McMaster Universities Arthritis Index (WOMAC) instrument, measured at numerous timepoints both early and late. Patient compliance is ideal, and follow-up is perfect. The trial shows that the drug is superior to placebo. Success!

By definition, the assay is sensitive to the clinical effect of the injected drug. Not only is the WOMAC instrument sensitive to the disease state of knee arthritis in general (from prior studies), but also the time points, follow-up, trial sites, biases, and entire conduct of the trial are sensitive to a clinical improvement in patients. A positive trial demonstrates assay sensitivity.

But what if the trial were negative? Beyond being undersized, any number of factors could have caused the trial to appear negative due to lack of assay sensitivity. Perhaps the drug is effective but only for the short-term, and mostly longer term endpoints were measured. Perhaps the WOMAC was not the best measure to use, and some other outcome metric might have shown a difference.

Three-armed trials
Non-inferiority trials are built on the assumption, based on previous superiority trials of the predicate treatment to sham (or other control), that the trial methodology, including the outcomes measures, time points, and other structural elements, is sensitive to real differences.

Importantly, the issue of assay sensitivity transcends choosing the best outcome measure. Numerous disease-specific clinical instruments and general health-related quality-of-life measures have been devised and validated. However, the concept of assay sensitivity applies to the entire structure of the trial, not just to a well-validated outcome metric. So how can one design a two-arm non-inferiority trial with a reasonable expectation of assay sensitivity?

Perhaps the simplest answer is to design a non-inferiority trial that is otherwise similar to the sham-controlled superiority trial that demonstrated a positive effect for the active control. By definition, that trial demonstrated assay sensitivity, so similar outcome measures and time points in the context of a well-conducted trial should preserve assay sensitivity. But this important question has another answer.

Perhaps the ideal approach would be to design a three-arm trial by adding a sham arm to a non-inferiority study. A clinical effect would be confirmed by demonstrating non-inferiority of the experimental device to the active control device, and superiority of the control to sham. The experimental device would also be expected to be superior to the sham. Although this would be a very powerful design in principle, the obvious drawbacks include increasing recruitment requirements, cost, and the ethical question of subjecting patients to a sham treatment.

Analyzing non-inferiority trials
Even the best designed RCTs may be affected by dropouts and crossovers of patients between arms as the study progresses. For this reason, superiority trials use the intent-to-treat (ITT) principle, analyzing the data as if each patient had received the assigned treatment, regardless of what treatment was actually received.

Although ITT analysis may seem counterintuitive, it is often regarded as a gold-standard in the analysis of superiority RCTs because it is conservative in the statistical sense—it tends to favor the null hypothesis that there is no difference in treatments. This is viewed as a beneficial bias toward avoiding false-positive treatment effects. Using ITT for superiority trials is not perfect, but it is a clear, well-known principle that superiority trials and their analysts aspire to.

The situation for non-inferiority trials is more complex. For non-inferiority trials, ITT is not conservative. Rather, it is biased toward showing an effect that may or may not be genuine and thus may favor the alternative hypothesis of non-inferiority. The alternatives include the per-protocol (PP) analysis, which analyzes only those patients who complied with randomization, and the as-treated analysis, which analyzes the data based on the treatment each patient actually received. Neither represents the ideal that ITT represents for superiority trials. When multiple approaches are used, interpreting the data becomes more complicated.

In the design of randomized trials for orthopaedic devices, the trend is toward the use of non-inferiority trials, when ethical considerations favor avoiding a sham procedure as a control group. Even well designed and conducted non-inferiority trials are subject to complex technical issues and nuances in their interpretations. Non-inferiority trials require the following:

  • an active control that is clearly superior to sham
  • an estimate of the control’s effect size relative to sham from at least one well-done RCT
  • an inferiority margin that is no larger than, and perhaps significantly smaller than, the control’s effect size
  • a reassurance of assay sensitivity for the trial’s outcome measures and structure
  • a thorough analysis of how the ITT and PP datasets affect the results of the trial

Without these elements, a trial could result in uncertain conclusions that carry significant ethical risk. Before designing and conducting a two-armed non-inferiority study, the researcher must carefully appraise these ethical and scientific tradeoffs.

S. Raymond Golish, MD, PhD, is director of research, Jupiter Medical Center, Palm Beach, Fla.; Paul A. Anderson, MD, is a professor in the department of orthopaedic surgery at the University of Wisconsin, Madison Both are members of the AAOS Biomedical Engineering Committee.