Fig. 1 2 × 2 contingency table for a diagnostic test.
Courtesy of S. Raymond Golish, MD, PhD, MBA


Published 6/1/2017
S. Raymond Golish, MD, PhD, MBA

No Test Is Too Sensitive, But Those Who Think So Are Almost Correct

Diagnostic tests, especially laboratory tests and imaging, are at the heart of orthopaedic decision-making. Some tests are highly sensitive, generating a lot of true positive results, but also generating a lot of false positive results. For example, the use of broad-range PCR (polymerase chain reaction) testing to detect joint infection may have high sensitivity but has been critiqued for a high rate of false positives. MRI for ligamentous injuries of the cervical spine has been subject to similar critiques.

Such tests are sometimes colloquially said to be "too sensitive." Most orthopaedic surgeons recognize that the term "too sensitive" is scientifically loose language, and such tests are more accurately said to be "nonspecific." However, that loose language captures an important intuition about the nature of diagnostic tests that should be clarified, not discarded.

Using this common colloquialism as a starting point to review sensitivity, specificity, and analysis of 2 × 2 tables is instructive. Taken as a whole, the statistical, scientific, ethical, and economic trade-offs involved in such analyses strike to the heart of surgical decision-making for individual patients. Even more deeply, they underpin the development of evidence-based algorithms and public policy considerations for healthcare utilization and economics.

2 × 2 contingency table analysis
Most orthopaedic surgeons are familiar with the properties of diagnostic tests derived from the analysis of 2 × 2 tables, called contingency tables in probability theory. Fewer surgeons have memorized the exact definitions from the tables. Knowing these definitions sharpens understanding and is easy to do (Fig. 1).

A good memory aid is to make a habit of writing down the label "diagnosis" first and on top, since the true diagnosis is the primary concern. A positive diagnosis is D+, and a negative one is D–. Next, write "test" on the side, since a secondary concern is using the value of the test as an aid to the diagnosis. A positive test is T+, and a negative one is T–. Finally, label the cells of the table a, b, c, d in typical left to right, top to bottom writing order. Surgeons are concerned with simple probabilities defined from the columns of the table and determined as follows:

  • sensitivity = probability of a positive test given a known-positive diagnosis = a/(a + c)
  • specificity = probability of a negative test given a known-negative diagnosis = d/(b + d)

As shown in Fig. 1, sensitivity is also called the true positive rate, and the specificity is also called the true negative rate.

ROC analysis
Most diagnostic tests have cutoff values that take a continuous parameter and dichotomize it into a positive result versus a negative result. For example, in an imaging test, the total percentage of the area exhibiting a certain signal characteristic might be used as a cutoff value to determine a positive versus a negative result.

In practice, such cutoff values and parameters may or may not be under the direct control of the surgeon at any point. But in principle, most tests have values that can be varied or "tuned" to trade sensitivity for specificity. The relationship of the sensitivity and specificity to a changing cutoff value is called the receiver operating characteristic (ROC) curve (Fig. 2).

As shown in Fig. 2, varying a cutoff value permits trading sensitivity for specificity in any given test, at least conceptually. This explains the intuition behind the loose language that a test is "too sensitive." More precisely, a test that is sensitive but nonspecific can be made more specific at the expense of decreased sensitivity by tuning the cutoff value. In the practical design of diagnostic tests and diagnostic algorithms, optimizing such cutoff values is routine practice.

Fig. 1 2 × 2 contingency table for a diagnostic test.
Courtesy of S. Raymond Golish, MD, PhD, MBA
Fig. 2 Idealized receiver operating characteristic (ROC) curve, shown as a solid line.
Courtesy of S. Raymond Golish, MD, PhD, MBA

In an ideal world, any test should be made both more sensitive and more specific simultaneously. This reveals the looseness in the phrase "too sensitive," because increased sensitivity cannot be bad if it includes increased specificity. In Fig. 2, this corresponds to pushing the curve farther into the left upper quadrant, resulting in an overall superior test.

By contrast, the diagonal line in Fig. 2 represents the null hypothesis of a completely random ROC curve (such as tossing two fair coins and trying to predict the results of one with those of the other). However, improving sensitivity and specificity simultaneously often involves issues of biology and physics that are constrained by Mother Nature or issues of engineering economics that are not under the surgeon's control.

Predictive values for clinical decision-making
In Fig. 1, sensitivity and specificity are defined column-wise. For that reason, they are properties of a diagnostic test itself, irrespective of how much disease occurs in any population. In one sense, that definition is beneficial, because the properties of the test can be measured and discussed in isolation. However, the probabilities needed for clinical decision-making are the probabilities of the diagnosis for the population of which the patient is a member and are defined as follows:

  • positive predictive value = probability of disease given a positive test = a/(a + b)
  • negative predictive value = probability of health given a negative test = d/(c + d)

Note that these definitions are row-wise and take into account the patient population. The positive predictive value is defined as the probability of disease given a positive test result, a satisfying concept. The positive predictive value is proportional to the sensitivity multiplied by the prevalence of disease in a population.

A similar result holds for the negative predictive value. In probability, this is known as Bayes' theorem. It encapsulates the intuitive notion that the probability of a disease, given a test result, is affected by both the probability of disease in the population and the sensitivity/specificity of the test.

Decision trees
Importantly, Bayes' theorem also allows the results of multiple tests to be "chained" together. Thus, the probability of a diagnosis can be updated in light of new information. Updating can be done with all tests available at the same time or sequentially as new tests are ordered over time, based on the current working diagnosis and the results obtained.

This mathematical process is equivalent to the clinical concept of a diagnostic algorithm. In the example of diagnostic imaging, the addition of MRI to an algorithm might increase sensitivity slightly, at the expense of decreased specificity, leaving overall accuracy largely unchanged. The development of sequential diagnostic algorithms (and their close cousins, scoring systems) is a major endeavor within evidence-based medicine. Such algorithms are often analyzed quantitatively as decision trees.

The ultimate goal of probabilistic modeling of diagnostic algorithms is combining them with models of surgical outcomes and economics to set healthcare policy. Each test or procedure in a decision tree can be assigned an economic cost. What's more, each outcome, state of health, duration, and quality of life must be measured and valued to support policy decisions. Clearly, ethics, values, and patient care interact in complex ways in assigning value to such human concepts. To advocate for patients, orthopaedic surgeons should use quantitative tools deftly, judiciously, and accurately to frame decisions that are uniquely human.

S. Raymond Golish, MD, PhD, MBA, chairs the AAOS Biomedical Engineering Committee. He can be reached at


  1. Panousis K, Grigoris P, Butcher I, Rana B, Reilly JH, Hamblen DL. Poor predictive value of broad-range PCR for the detection of arthroplasty infection in 92 cases. Acta Orthop, 76-3, 341–346.
  2. Mascarenhas D, Dreizin D, Bodanapally UK, Stein DM. Parsing the Utility of CT and MRI in the Subaxial Cervical Spine Injury Classification (SLIC) System: Is CT SLIC Enough? AJR Am J Roentgenol, 206-6, 1292–1297.
  3. Golish SR, Hanna LS, Bowser RP, Montesano PX, Carragee EJ, Scuderi GJ. Outcome of lumbar epidural steroid injection is predicted by assay of a complex of fibronectin and aggrecan from epidural lavage. Spine, 36-18, 1464–1469.