Using Machine Learning to Estimate the Prevalence and Onset of a Disease
Machine learning tools can be used to detect previously unobserved relationships among different kinds of data. This was an important consideration for an Analysis Group team – Managing Principal James Signorovitch, Principal Jimmy Royer, Vice President Irina Pivneva, and Managers Jutong Pan and Tom Cornwall – tasked with estimating the probability of the onset of a difficult-to-diagnose disorder, as well as the true prevalence of this disorder in the population at large. The team constructed an optimal classification tree (OCT) model to analyze data from both health care claims and electronic medical records (EMRs), which was critical for a disease whose symptoms often overlap with those of other diseases.
The team’s estimation model, pictured below, sorts through a number of predictors – for example, whether a particular patient had undergone an immunoglobulin test or visited an internal medicine specialist. The colored boxes (the “leaves” at the end of each “branch”) denote both the probability (p) of developing the condition (either green or yellow) based on the preceding series of decisions, and the number (n) of instances on which that probability is based.
Work of this kind can give health care providers a starting point for more effective diagnostic and therapeutic progress. ■