Fig. 1 Artificial neural networks utilize the input and through multiple layers (input layer, hidden layer, and output layer) to determine which aspects of the input are important for predictive ability. Each line is a connection between the neurons in various layers, and each layer may perform different transformations on the input.
Adapted from Wikimedia Commons at


Published 11/1/2020
Ayoosh Pareek, MD; Yining Lu, MD; R. Kyle Martin, MD, FRCSC; Lars Engebretsen, MD, PhD; Aaron Krych, MD, FAAOS

Machine Learning in Orthopaedics Is Ready for Prime Time

Over the past decade, machine learning (ML) has become an important tool, with promising early results, yet its adoption in orthopaedic surgery has been slow. Overall, although the ideal role of ML in orthopaedics is still being determined, we can rest assured that it is here to stay. Therefore, we should become familiar with both its potential to improve patient care and pitfalls when creating and interpreting it.

ML can be defined as a study of computer algorithms that improve automatically with experience, rather than explicit computer programming. Artificial intelligence (AI), or deep learning, is a subset of ML, which uses neural networks (NNs), algorithms initially based on the workings of the human brain, to create patterns for decision making (Fig. 1).

In the real world, ML has already made palpable contributions. For example, Hershey was able to use premade ML algorithms to save $500,000 per batch of candy without ever employing a data scientist. Using premade algorithms in a cloud-based ML platform, Hersey utilized data from its sensors to optimize manufacturing of candies by finding conditions that provide maximum accuracy in candy weight so none is wasted and determining optimal production lines so the manufacturing process could be run for multiple candy lines.

The use of ML in medicine and orthopaedic surgery can be divided simply into fields of diagnosis, prediction, and automation.

ML and AI have already made a mark on improving diagnostic capabilities. In two papers describing the use of NNs to assess thousands of hip and wrist fracture radiographs, AI had an overall accuracy of 93.7 percent and 98.0 percent, respectively. In both studies, AI alone performed at an “expert” level of reading radiographs. That performance level is not always available to patients presenting to a primary care physician or an emergency department. Additionally, in both studies, AI was able to help the nonexpert reach similar levels of accuracy, detecting fractures as an expert would.

Algorithms in ML have also demonstrated exceptional predictive capabilities given the appropriate circumstances. In a study published in Nature, ML analysis of 59,000 ICU patients was able to develop a model with an overall accuracy of 83 percent in predicting septic shock. The accuracy was similar to previous ML models. However, the algorithm was able to maintain accuracy while predicting shock more than 20 hours earlier than other currently available clinical methods and algorithms. In orthopaedics, not only has ML been able to predict which patients will obtain minimal clinically important difference after total joint arthroplasty, it also has used MRI to determine the most important factors for predicting progression to total knee arthroplasty within five years, with an accuracy of 94 percent. In addition, multiple studies have used these advanced methods to predict patient survival, complications, cost, and outcomes to assess how these areas can be optimized.

Automation remains an area in which ML implementation can significantly increase accuracy while maintaining, or in some cases decreasing, costs. In a recent study, authors utilized natural language processing, a method to automatically extract data from unstructured free text, to collect data retrospectively for an arthroplasty registry with an overall accuracy of 96 percent. Another study utilized deep learning to automatically detect implant loosening with radiographic and patient data with an accuracy of close to 86 percent. Those types of studies can be used clinically to automatically provide physicians with information that can help guide patient-specific treatment.

Although ML is a valuable tool, it is important to illustrate some potential pitfalls that are critical for clinicians to understand. First, ML requires significant resources—the biggest of which are data. Although most ML studies rely on the analysis of data from thousands of patients to make accurate predictive models, the quality of the data is paramount. Additionally, although ML can be used on smaller datasets, its predictive ability may not be significantly greater than traditional statistical analyses. ML also may be nuanced in some situations, as making models more transparent may actually hamper people’s ability to detect when a model has made a sizeable mistake.

Fig. 1 Artificial neural networks utilize the input and through multiple layers (input layer, hidden layer, and output layer) to determine which aspects of the input are important for predictive ability. Each line is a connection between the neurons in various layers, and each layer may perform different transformations on the input.
Adapted from Wikimedia Commons at
Fig. 2 Receiver operating characteristic curves are graphical illustrations of binary classification by plotting the true positive rate, also known as sensitivity, against the false-positive rate. This curve provides tools to select optimal models based on desired sensitivity, specificity, and overall accuracy.
Adapted from Wikimedia Commons at

The increased implementation of learned algorithms in healthcare research has highlighted the need for standardization of reporting guidelines to optimize the interpretability and effectiveness of published data. The transparent reporting of a multivariable prediction model for individual prognosis or diagnosis was established in 2015 to ensure “full and clear reporting of information on all aspects of a prediction model.” The Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research were published in 2016 with the similar goal of attaining standardization in models developed with ML methods. The guidelines outline a workflow for model development, as well as propose minimum requirements for reporting of model performance metrics for adequate assessment of bias and usefulness. In the current literature, these requirements most often include the following four categories: (1) overall performance, (2) concordance, (3) goodness of fit, and (4) clinical utility. With regard to the most commonly encountered ML classification problems, measures for these categories include area under the receiver operating characteristic curve, calibration curves, overall performance assessed with the Brier Score, and clinical utility with decision curve analysis (Fig. 2).

One may wonder, then, how ML can be incorporated into a physician’s practice; ultimately, it starts with the quality and quantity of data. ML has robust algorithms that can account for some missing data, though the quality of input data is essential. Typically, advanced algorithms such as AI require thousands if not tens of thousands of rows of data (especially for rare events). Premade algorithms are available in cloud-based ML services (also noted as “AutoML”), as are customized algorithms in popular platforms such as R (R Core Team, Vienna, Austria) or Python (Python Software Foundation). It is important to note that those platforms require experienced ML data scientists or engineers who can understand the nuances of the data analyses. Typically, data are split into testing and training sets so the algorithm can test its predictions on unseen data. After an accurate and well-calibrated model is obtained, it can be utilized right away, validated from external data at other centers, or improved with additional data. Continuous improvement from additional data over time is a process called reinforcement learning. The algorithm can also be hosted online for use by the public.

In summary, although many have recognized the value of ML in today’s era of “big data,” dissemination of this innovation has been slow in health care. Previous articles have discussed this exact phenomenon and attributed it largely to the fact that healthcare professionals are “late majority” adopters. In this sense, it is our duty to invest in and become early adopters of ML, trust and enable this innovation, create room for this change in our system, and lead by example. When we evaluate which innovations will help our patients in the future, ML should be at the forefront.

Ayoosh Pareek, MD, is an orthopaedic surgery resident and chair of the AAOS Resident Assembly Research Committee.

Yining Lu, MD, is a  first-year orthopaedic surgery resident with a research interest in machine learning and data analytics.

R. Kyle Martin, MD, FRCSC, is an assistant professor and orthopaedic surgeon specializing in knee, hip, and shoulder orthopaedic sports medicine with the University of Minnesota and CentraCare in Saint Cloud, Minn.

Lars Engebretsen, MD, PhD, is a professor in the orthopaedic clinic at the University of Oslo, cochair of the Oslo Sports Trauma Research Center, and head of scientific activities for the International Olympic Committee.

Aaron Krych, MD, FAAOS, is cochair of sports medicine and professor of orthopaedic surgery at Mayo Clinic in Rochester, Minn.


  1. Obermeyer Z, Emanuel EJ: Predicting the future – big data, machine learning, and clinical medicine. N Engl J Med 2016;375:1216-9.
  2. Chen JH, Asch SM: Machine learning and prediction in medicine – beyond the peak of inflated expectations. N Engl J Med 2017;376:2507-9.
  3. Liu Y, Chen PC, Krause J, et al: How to read articles that use machine learning: users’ guides to the medical literature. JAMA 2019;322:1806-16.
  4. Maddox T: How Hershey used IoT to save $500K for every 1% of improved efficiency in making Twizzlers. Available at: Accessed September 22, 2020.
  5. Krogue JD, Cheng KV, Hwang KM, et al: Automatic hip fracture identification and functional subclassification with deep learning. Radiology: Artificial Intelligence 2020;2:e190023.
  6. Lindsey R, Daluiski A, Chopra S, et al: Deep neural network improves fracture detection by clinicians. Proc Natl Acad Sci U S A 2018;115:11591-6.
  7. Fagerström J, Bång M, Wilhelms D, et al: LiSep LSTM: a machine learning algorithm for early detection of septic shock. Sci Rep 2019;9:15132.
  8. Fontana MA, Lyman S, Sarker GK, et al: Can machine learning algorithms predict which patients will achieve minimally clinically important differences from total joint arthroplasty? Clin Orthop Relat Res 2019;477:1267-79.
  9. Tolpadi AA, Lee JJ, Pedoia V, et al: Deep learning predicts total knee replacement from magnetic resonance images. Sci Rep 2020;10:6371.
  10. Bini SA: Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care? J Arthroplasty 2018;33:2358-61.
  11. Shah RF, Bini S, Vail T: Data for registry and quality review can be retrospectively collected using natural language processing from unstructured charts of arthroplasty patients. Bone Joint J 2020;102-b(7_Supple_B):99-104.
  12. Shah RF, Bini SA, Martinez AM, et al: Incremental inputs improve the automated detection of implant loosening using machine-learning algorithms. Bone Joint J 2020;102-b(6_Supple_A):101-6.
  13. Poursabzi-Sangdeh F, Goldstein D, Hofman J, et al: Manipulating and measuring model interpretability. ArXiv 2018;abs/1802.07810.
  14. Shillan D, Sterne JAC, Champneys A, et al: Use of machine learning to analyse routinely collected intensive care unit data: a systematic review. Crit Care 2019;23:284.
  15. Collins GS, Reitsma JB, Altman DG, et al: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ 2015;350:g7594.
  16. Luo W, Phung D, Tran T, et al: Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J Med Internet Res 2016;18:e323.
  17. Steyerberg EW, Vickers AJ, Cook NR, et al: Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 2010;21:128-38.
  18. Dankers F, Traverso A, Wee L, et al: Prediction modeling methodology. In: Kubben P, Dumontier M, Dekker A, eds. Fundamentals of Clinical Data Science. Cham (CH)2019:101-20.
  19. Berwick DM: Disseminating innovations in health care. JAMA 2003;289:1969-75.