Article Text

Download PDFPDF
Development of machine learning-based mpox surveillance models in a learning health system
  1. Harry Reyes Nieva1,2,3,
  2. Jason Zucker1,3,
  3. Emma Tucker4,
  4. Jacob McLean3,
  5. Clare DeLaurentis3,
  6. Shauna Gunaratne3,
  7. Noémie Elhadad1,5
  1. 1Department of Biomedical Informatics, Columbia University, New York, New York, USA
  2. 2Department of Medicine, Harvard Medical School, Boston, Massachusetts, USA
  3. 3Division of Infectious Diseases, NewYork-Presbyterian Hospital/Columbia University Irving Medical Center, New York, New York, USA
  4. 4Vagelos College of Physicians and Surgeons, Columbia University, New York, New York, USA
  5. 5Department of Computer Science, Columbia University, New York, New York, USA
  1. Correspondence to Dr Harry Reyes Nieva; harry.reyes{at}columbia.edu

Abstract

Objectives This study aimed to develop robust machine learning (ML)-based and deep learning (DL)-based models capable of detecting mpox cases for surveillance efforts using clinical notes.

Methods As part of a learning health system initiative, we conducted a retrospective study of clinical encounters at the Columbia University Irving Medical Center in New York City. We included patients with mpox diagnoses confirmed by PCR testing between 15 May 2022 and 15 October 2022 and three matched controls for each case based on patient age, sex, race, ethnicity and visit month. We trained three mpox surveillance models using: (1) logistic regression with L1 regularisation (least absolute shrinkage and selection operator (LASSO)), (2) ClinicalBERT and (3) ClinicalLongformer. We evaluated model performance using precision, recall, F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC) and recall at 80% precision (RP80).

Results The study included 228 PCR-confirmed mpox cases and 698 controls. LASSO regression outperformed the DL models with a precision, recall and F1 score of 0.93, AUROC of 0.97, AUPRC of 0.93 and RP80 of 0.89. ClinicalBERT achieved a precision of 0.88, recall of 0.89, F1 score of 0.88 and AUROC of 0.93. ClinicalLongformer achieved a precision of 0.87, recall of 0.88, F1 score of 0.87 and AUROC of 0.92. Phrases related to symptoms (eg, lesions and pain) were among the most predictive features in LASSO regression.

Conclusions ML and DL models based on clinical notes show promise for identifying mpox cases. In this study, LASSO regression outperformed DL models and excelled in minimising false positives. These findings highlight the potential for ML and DL methods to support case surveillance for mpox and other infectious diseases. These methods may also prove helpful for flagging missed or delayed diagnoses as part of continuous quality improvement.

  • Population Surveillance
  • Disease Transmission, Infectious
  • INFORMATION TECHNOLOGY

Data availability statement

The electronic health record and administrative data underlying this article are not available due to restrictions to preserve patient confidentiality. Code used to perform the study is available on reasonable request.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Data availability statement

The electronic health record and administrative data underlying this article are not available due to restrictions to preserve patient confidentiality. Code used to perform the study is available on reasonable request.

View Full Text

Footnotes

  • Handling editor Eric P F Chow

  • X @harryreyesnieva, @jason10033

  • Presented at Preliminary data from this study were presented at the Symposium on Artificial Intelligence in Learning Health Systems in Rio Grande, Puerto Rico, on 10 May 2023; STI & HIV 2023 World Congress in Chicago, IL, USA, on 25 July 2023; and the American Medical Informatics Association 2023 Annual Symposium in New Orleans, LA, USA, on 13 November 2023.

  • Contributors All authors: acquisition, analysis or interpretation of data; critical revision of the manuscript for important intellectual content; and approved the final manuscript. HRN, NE and JZ: concept and design. HRN: drafting of the manuscript; statistical analysis; and administrative, technical or material support. NE and JZ: obtained funding and supervision. All authors had full access to the data in the study and they take responsibility for the integrity of the data and accuracy of the analysis. HRN is the guarantor.

  • Funding This work was supported by grants from the National Institutes of Health (T15-LM007079 to NE and K23-AI150378 and UM1AI069470 to JZ) and Association for Computing Machinery Special Interest Group in High Performance Computing (Computational and Data Science Fellowship to HRN).

  • Competing interests None declared.

  • Patient and public involvement statement There was no patient or public involvement during the design, conduct, reporting, interpretation or dissemination of the study.

  • Provenance and peer review Not commissioned; externally peer reviewed.