Fig. 1 Shaded area (green) represents values in the distribution with probability greater than alpha (typically set at alpha = 0.05)
Adapted from Wikimedia Commons at https://commons.wikimedia.org

AAOS Now

Published 8/1/2019
|
Ayoosh Pareek, MD; Chad Parkes, MD; R. Kyle Martin, MD, FRCSC; Lars Engebretsen, MD, PhD; Aaron J. Krych, MD

P Value: Purpose, Power, and Potential Pitfalls

A P value indicates the probability that an observed result occurred by chance. In most modern literature, this is interpreted in the context of a type 1 error, which is defined as the probability of finding a difference between treatments by chance when a difference does not actually exist (Fig. 1). To reject a null hypothesis, which is the assumption that two groups are the same, it is generally accepted that a P value has to be less than the standard significance level (alpha) of 0.05. This establishes a less than 5 percent chance of finding a difference between two groups when no difference actually exists and supports an alternate hypothesis, which is the assumption that two groups are, in fact, different.

Many misconceptions exist regarding the purpose and power of P values. For example, a common notion is that a smaller P value implies that a treatment or variable has a larger effect size than a comparatively larger P value. P values inherently provide no direct measure of magnitude or direction of effect. A variable representing differences between two groups/variables must be provided to assess the magnitude or direction of the effect (e.g., confidence intervals, odds ratios, hazard ratios, etc.). Additionally, many do not realize that the standard significance level or alpha of 0.05 to reflect significant differences is an arbitrary convention and, in many cases, does not accurately represent the data. A significant difference may not be meaningful with a small effect size, and a marginally insignificant difference may be clinically relevant with a large effect size. Each study ideally should explore its undertaken assumptions and associated statistical analyses and therefore modify its “alpha” appropriately.

As intuitive beings, we must not simply dichotomize study results with P < >0.05 as significant and P > 0.05 as insignificant. Clearly, P = 0.45 and P = 0.55 are intuitively similar and may not be different based on study characteristics. Effects are often on a continuous scale, and keeping this complexity in mind allows us to more comprehensively understand the data. Moreover, most studies now use two-sided statistical tests when they are not always appropriate. A t-test is used to determine whether there is a significant difference between the means of two groups. Most t-tests conducted in the orthopaedic literature are two-sided, indicating that the authors want to determine whether group A is either greater or less than group B (simply, whether group A is different than group B). Hence, with the test, there is an understanding that the alpha (0.05) will be divided into two (each with a value of 0.025). The sidedness of the tests should be based on the hypothesis of the study. If the hypothesis is to assess superiority alone, a one-sided test may be more appropriate.

As a thought experiment, one can understand why recent researchers, especially those who work closely with data, have asked to recall the P value as nothing more than another hammer in the toolbox—not the toolbox itself. If one were to examine data in a project with an alpha of 0.05, it would be common to conduct 15 to 20 statistical tests (univariate, multivariate, survival, etc.). An alpha of 0.05 would suggest that reporting a false-positive significant effect (P < 0.05) would be one out of 20. If one conducted 20 statistical tests, reporting at least one significant effect certainly should not be surprising and may very well be a false-positive.

We may want to curb our enthusiasm for this test result by (1) preemptively adjusting alpha to a lower value (0.01 or lower, or applying a Bonferroni correction [for multiple comparisons within each test]), (2) exploring the data further, and (3) reviewing effect sizes to see whether statistical significance is of clinical relevance. If multiple hypotheses are tested in one statistical test (e.g., a comparison among multiple groups), the chance of obtaining a false-positive result increases, as does the chance of a type 1 error (incorrectly rejecting the null hypothesis). The Bonferroni correction compensates for that by dividing the alpha by the number of groups and setting the new alpha to that newly corrected value. Parsons et al., revealed that 39 percent of 100 surveyed orthopaedic studies did not use the correct statistical analysis; in 17 percent of those, the statistical test would have changed the outcome. Similarly, Bhandari et al., denoted that less than 10 percent of the orthopaedic studies they examined had appropriate Bonferroni correction for multiple comparisons, revealing a significantly higher risk of false-positive results in orthopaedic studies (37 percent) compared to other medical studies (10 percent).

Many scientists have recently come to question the significance of P values in the current literature and have called to remove the values. This is easy to understand, as multiple studies have found P values to (1) be inaccurate representations of the data due to often incorrectly undertaken statistical tests, (2) be associated with falsely interpreted results, and (3) have a lack of nuance involved in their interpretation.

P values are easily understandable, and their value is in their interpretability, but we must remember that study characteristics, effect sizes, confidence limits, and statistical tests must all be taken into account for interpretation of data. Even then, statistical difference may not equate to meaningful clinical difference or relevance.

An editorial comment on this exact phenomenon in an article by Katz and Losina noted that although one study they examined showed an increased risk of pulmonary embolism (PE) in older patients after total joint arthroplasty, with a P value < 0.0001, the risk ratio (RR) of 1.12 (signifying a 12 percent increase) may not be clinically relevant because clinically significant PEs are very rare. Conversely, the article discussed a study that found that age was not significantly associated with readmission after lumbar arthrodesis due to a P value of 0.07, although the RR was 1.96. In that setting, the effect size (RR) is quite large and therefore clinically significant, and if we treat the P value nondichotomously, we may be able to more appropriately appreciate the results of the study and its associated drawbacks (such as possibilities of it being underpowered).

Alternatives exist, such as different cutoffs or Bayesian statistics. Frequentist statistics (commonly used) draw conclusions from analyzed data by emphasizing frequency without prior information. That approach uses available information in the data to determine whether an event (hypothesis) will occur with a certain probability. Bayesian statistics, on the other hand, use prior information (previous experiments and studies) to provide the conditional probability of an event (hypothesis) occurring. Still, different cutoffs or types of statistical analysis do not completely solve the issue. As stated in a recent article, the only way to be sure we are doing a commendable job is to “accept uncertainty [and] be thoughtful, open, and modest” when examining the literature and conducting data analysis.

Certainly, reproducibility of research hinges on more than just data analysis and P value interpretation, as issues such as faulty study design and inherent bias still plague studies. Whichever method of statistical analysis is used, replacing our nondichotomous way of thinking with informed contextual judgement will assist in making appropriate decisions in both clinical research and daily patient care.

Ayoosh Pareek, MD, is an orthopaedic surgery resident and chair of the AAOS Resident Assembly Research Committee.

Chad Parkes, MD, is an orthopaedic surgery resident at Mayo Clinic in Rochester, Minn.

R. Kyle Martin, MD, FRCSC, is an orthopaedic surgery fellow at Mayo Clinic in Rochester, Minn.

Lars Engebretsen, MD, PhD, is a professor of orthopaedic surgery at University of Oslo and head of medical sciences for the International Olympic Committee.

Aaron J. Krych, MD, is a professor of orthopaedic surgery at Mayo Clinic in Rochester, Minn., and director of the sports medicine fellowship.

References:

  1. Ranganathan P, Pramesh CS, Buyse M: Common pitfalls in statistical analysis: “P” values, statistical significance and confidence intervals. Perspect Clin Res 2015;6:116-7.
  2. Gagnier JJ, Morgenstern H: Misconceptions, misuses, and misinterpretations of P values and significance testing. J Bone Joint Surg Am 2017;99:1598-603.
  3. Ludbrook J: Should we use one-sided or two-sided P values in tests of significance? Clin Exp Pharmacol Physiol 2013;40:357-61.
  4. Amrhein V, Greenland S, McShane B: Scientists rise up against statistical significance. Nature 2019;567:305-7.
  5. Wasserstein RL, Schirm AL, Lazar NA: Moving to a world beyond “p < 0.05”.>Am Stat 2019;73(suppl 1):1-19.
  6. Parsons NR, Price CL, Hiskens R, et al: An evaluation of the quality of statistical design and analysis of published medical research: results from a systematic survey of general orthopaedic journals. BMC Med Res Methodol 2012;12:60.
  7. Bhandari M, Whang W, Kuo JC, et al: The risk of false-positive results in orthopaedic surgical trials. Clin Orthop Relat Res 2003:63-9.
  8. Katz JN, Losina E: Uses and misuses of the P value in reporting results of orthopaedic research studies. J Bone Joint Surg Am 2017;99:1507-8.
  9. Amrhein V, Korner-Nievergelt F, Roth T: The earth is flat ( > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 2017;5:e3544.