Machine Learning Struggles to Predict Antimicrobial Resistance

Sampling biases driven by the structure of bacterial populations dramatically affect the ability of machine learning (ML) to predict anti-microbial resistance (AMR), researchers have warned.
The findings, in PLOS Biology, stress the importance of methods that take this population structure into account and use more diverse datasets to improve AMR prediction and surveillance.
“Addressing the interplay between population structure and AMR prediction will require a multifaceted approach that includes improved sampling, algorithmic innovation, and systematic evaluation of proposed prediction methods,” the researchers, led by Yanying Yu, PhD, from Harvard Medical School, recommend.
“Only by confronting these challenges can we unlock the full potential of machine learning to provide actionable insights into AMR, advancing both our surveillance capabilities and our understanding of resistance mechanisms.
“These efforts will be critical to combating the global threat of AMR and ensuring the continued efficacy of life-saving antimicrobial therapies.”
AMR poses a severe threat and is associated with nearly five million deaths each year. ML methods have emerged as promising detection tools, uncovering determinants of resistance from genomic data.
However, most classical ML methods assume training data are independent and identically distributed, which is not true of pathogen surveillance samples due to the underlying structure of bacterial populations.
During an epidemic, successful clones spread rapidly, and if this spread is due in part to acquisition of AMR determinants, it could lead to an association between the phenotype and phylogenetic markers that do not directly contribute to AMR.
These non-causative associations are likely further exacerbated by biased sampling that focuses on human disease in high-income countries, leaving large regions of the phylogeny unexplored.
By constructing real pathological scenarios, the researchers comprehensively evaluated the impact of bacterial population structure in predicting AMR.
They collected between 3,204 and 7,188 genomes for three Gram-negative and two Gram-positive species representing current WHO priority pathogens.
These included the gastrointestinal and urinary tract pathogen Escherichia coli; the opportunistic pathogen Klebsiella pneumoniae; the gastrointestinal pathogen Salmonella enterica; the skin commensal and opportunistic pathogen Staphylococcus aureus; and the major agent of community-acquired pneumonia Streptococcus pneumoniae.
The dataset included resistance phenotypes for 27 antibiotics altogether, spanning multiple drug classes and diverse sequence types.
To limit the effects of sample size and class imbalance, the researchers excluded antibiotic-organism combinations with less than 1,000 genomes or with resistant or susceptible strains exceeding 80% of the data set.
The average number of genomes was around 2,700, with approximately 44% resistant strains, and 80% of the data was between 1,134 and 3,955 genomes, with 25% to 71% resistant.
A whole genome alignment was constructed for each species, and discrete clades identified based on deep divergences between branches in the phylogenetic tree.
Results showed that ML models continued to conflate lineage markers with genuine indicators of resistance, even in the presence of a large training dataset.
“Our results demonstrate that current ML approaches are particularly vulnerable to the confounding effects of biased sampling and population structure,” the researchers reported. “Even when sample sizes are increased, models struggle to generalize beyond the specific clades included in the training set, underscoring the limitations of scaling as a solution to bias.”
While their machine-learning framework used a combination of core genome single-nucleotide polymorphisms and accessory gene presence–absence patterns to captures broad genomic differences, beyond the strict core genome, it still did not capture sequence variation within accessory genes.
“This limitation is particularly relevant for species with large and diverse accessory genomes, such as E. coli and K. pneumoniae, where within-gene variation may carry additional predictive signals for AMR,” the authors noted.
They concluded: “Expanding surveillance efforts to underrepresented regions and settings, particularly in low- and middle-income countries (LMICs), will provide more balanced datasets and reduce biases toward high-income countries and outbreak scenarios.
“Targeted sampling that prioritizes diversity, both within and across clades, will be critical for developing models capable of generalizing across phylogenetic backgrounds.
“Beyond sampling, careful experimental design, including the use of balanced test sets and rigorous evaluation metrics such as those employed here, will further ensure that models are robust to confounding effects.”
link
