A machine learning approach to leveraging electronic health records for enhanced omics analysis

A machine learning approach to leveraging electronic health records for enhanced omics analysis

In general, COMET can be applied when EHR data are available for a large cohort of patients, and omics data are available for a smaller sub-cohort. A model trained on patients with only EHR data (the ‘pretraining cohort’) has its weights transferred to a multimodal network, which is further trained and tested on the smaller population with both EHR and omics data (the ‘omics cohort’) (Fig. 1a). COMET consists of three parts: a method to embed longitudinal EHR data20 (Fig. 1b), pretraining and multimodal modelling (Fig. 1c). Here we used COMET to analyse two independent cohorts, one pregnancy cohort from Stanford Health Care and one cancer cohort from the UK Biobank. In each cohort, we demonstrate COMET’s state-of-the-art performance for a clinically meaningful predictive modelling task: days to the onset of labour or three-year all-cause mortality, respectively. We perform all the modelling experiments 25 times with different train, test and validation splits, and compute the performance metrics using the average predictions from the validation set.

Fig. 1: COMET is a deep learning framework that uses large, observational EHR databases and transfer learning to improve the analysis of small datasets from omics studies.
figure 1

a, The input to COMET is EHR data and (for a subset of patients) paired, tabular omics data. The patients who only have EHR data are used to pretrain (PT) a neural network predict patient outcomes using only EHR data. The weights from this EHR network are transferred to a multimodal neural network used to analyse both EHR and omics data; the neural network is used for predictive modelling and post hoc analysis of the network is used for biological discovery. The COMET framework is flexible and can be used to predict any continuous or binary outcome. b, One-hot encoded vectors of EHR data (shown in white) are converted into embeddings (shown in blue) using word2vec; the embeddings for each code that occur within a particular day are averaged to compute sequential, summary embeddings. c, COMET uses a multimodal deep learning architecture to analyse both EHR data and omics data. Only EHR data are used in the pretraining stage; the core architecture is an RNN with gated recurrent units. After pretraining, the RNN weights are frozen and transferred into a multimodal architecture that analyses both EHR and omics data. Panel a created with BioRender.com.

COMET accurately predicted days to the onset of labour

We first applied COMET to predict days to the onset of labour in a population of pregnant people (n = 30,904 patients) who delivered newborns at Stanford from 2013 to 2021. The EHR data for all individuals were extracted from the Stanford STARR OMOP database. For a subset of pregnant patients (n = 61 patients, the omics cohort), multiple blood (plasma) samples were collected throughout the last 100 days of their pregnancy and used to generate a targeted proteomics dataset that measured 1,317 different proteins21. We used the EHR data from the beginning of pregnancy up to the time of the blood sampling and aimed to predict days to the onset of labour (from the time of sampling). For the patients with only EHR data (n = 30,843 patients, the pretraining cohort), there is no sampling time (as these patients do not have proteomics data). Therefore, we randomly chose a time point within the last 100 days of pregnancy, used the EHR data up to the sampled time as features and predicted days to the onset of labour from the sampled time as the pretraining task (Fig. 2a,b).

Fig. 2: Multimodal data reveals EHR–proteomics interactions related to pregnancy progression and time to the onset of labour.
figure 2

a, For patients with proteomics data, input features were constructed using EHR data from the beginning of pregnancy up to the proteomics sampling time (shaded in green); for patients in the pretraining (PT) cohort without proteomics data (and therefore without a sampling time), we randomly sample a time point at which to cut off the EHR data (features are constructed from the time shown in blue; we use days from that time point until labour as our predictive modelling task). b, We utilized data from women who gave birth at Stanford, and split the women into two populations based on whether or not they had omics data available. c, Predictions using the COMET framework are compared with actual days to the onset of labour, with the regression line shown in red. d, t-SNE visualization of the onset of labour data. The dots represent individual features and are coloured based on modality; dots are sized based on the feature’s univariate Pearson correlation with days to the onset of labour. The clusters with only protein variables are annotated based on gene ontology enrichment analysis and clusters containing both clinical and protein variables are annotated based on clinical themes. e, Heat map showing the number of significant correlations (after Bonferroni correction) between the EHR features and proteins; the 25 proteins with the greatest number of statistically significant correlations with EHR features are shown. f, Distribution of the maximum absolute correlation of each individual EHR feature with all proteins in the onset of labour data.

We embedded longitudinal EHR data using word2vec, averaging embeddings for codes within each day. After pretraining the EHR-only architecture with the data from the pretraining cohort, we transferred weights to the full multimodal architecture, which was trained to make predictions on the omics cohort. The Pearson correlation between the predicted days to the onset of labour and actual days to the onset of labour using COMET was strong, indicating that COMET can make highly accurate predictions in small cohorts with high-dimensional data (r = 0.868, 95% confidence interval (CI) [0.825, 0.900], P = 3.9 × 10–53, root mean square error (r.m.s.e.) = 16.0) (Fig. 2c). Agreement is measured using Lin’s concordance correlation coefficient and reported in Supplementary Table 1, which confirms that COMET’s predictions align well with actual time-to-onset values without systematic bias.

We compared COMET with baseline models using only EHR data, only proteomics data or both (‘joint baseline’). These baselines use only omics cohort data without pretraining, with architectures matching the corresponding parts of COMET. The EHR-only baseline uses only the EHR part of the network (Fig. 1c, light blue). The proteomics-only baseline uses only the omics part of the network (Fig. 1c, light green). Last, the joint baseline uses both data modalities and matches the full COMET architecture. The only difference between the joint baseline and the COMET framework is that the joint baseline excludes the pretraining stage. The EHR-only baseline performed the worst (r = 0.768, 95% CI [0.699, 0.823], P= 1.55 × 10–34, r.m.s.e. = 20.4 days), and was slightly outperformed by the proteomics-only baseline (r = 0.796, 95% CI [0.733, 0.845], P = 1.3 × 10–38, r.m.s.e. = 20.2 days). The joint baseline was the highest-performing baseline (r = 0.815, 95% CI [0.757, 0.860], P = 7.8 × 10–42, r.m.s.e. = 18.4 days), but is still inferior to COMET. To confirm that COMET provides benefit across different omics modalities, we ran a similar set of experiments using metabolomics for the same cohort and show that predictive modelling results with COMET (r = 0.839, 95% CI: [0.782, 0.881]) exceed the performance of predictions from metabolites alone (r = 0.758, 95% CI: [0.678, 0.820]). Supplementary Table 2 lists the full results.

We have also compared COMET with baselines using ridge regression, and computed performance for EHR-only, proteomics-only and joint baselines. To determine if we could incorporate EHR pretraining in different ways, we trained an EHR-only ridge regression model using data from the pretraining cohort and use an adapted version of ridge regression inspired by another work7 that incorporates the coefficients from the pretrained model as priors on the weight in the joint (that is, multimodal) model. Incorporating pretraining improves the Pearson correlation of the joint baseline (from r = 0.572, 95% CI: [0.461, 0.665] to r = 0.799, 95% CI: [0.737, 0.847]), with COMET still outperforming all approaches (full results are in Supplementary Table 3).

Last, we wanted to compare between COMET’s word2vec and a recurrent neural network (RNN)-based approach to compute a latent representation of EHR data to an approach that utilizes a transformer, including learning token embeddings in an end-to-end manner (which we call COMET Transformer). There is strong correlation between the predictions from COMET and COMET Transformer (r = 0.94). The Pearson correlation for the transformer variant is 0.848 (95% CI: [0.800, 0.885]), slightly underperforming COMET (full results are listed in Supplementary Table 4). Taken together, these results demonstrate the value of incorporating pretraining regardless of the model architecture, and the superior ability of COMET to predict days to the onset of labour.

Analysis of COMET EHR–proteomics feature correlations revealed biological insights into pregnancy

COMET’s superior performance prompted us to further investigate the relationships between EHR and proteomics features, with the goal of gaining deeper insights into the complex biological processes during pregnancy. First, we used t-distributed stochastic neighbour embedding (t-SNE) to visualize the multimodal data by projecting the correlation matrix into two dimensions; features close together in this space have similar correlations with all other variables (Fig. 2d). We annotated these clusters based on the medical concepts that the EHR and/or protein features within each cluster represent. For example, the ‘metabolic dysregulation and abnormal foetal growth’ cluster contains clinical codes representing abnormal glucose tolerance in the mother, maternal obesity and excessive foetal growth. It also contains proteins like betacellulin and oncostatin M, which are known to play a role in glucose homeostasis and insulin sensitivity22,23.

We similarly visualized each EHR modality individually, and used lines to connect significantly correlated cross-modality variables (Supplementary Fig. 1). These visualizations revealed that there are many EHR variables that are highly correlated with other features (including proteins), suggesting redundancy in information across modalities. However, 46.5% of proteins have no significant correlations with any EHR features, indicating that the proteomics data also provide some complementary information (Supplementary Fig. 2).

Several proteins showed high numbers of significant correlations with EHR variables (Fig. 2e), such as interferon alpha and beta receptor subunit 1, which correlates with multiple infection-related variables, aligning with its known role in immune function. To investigate the additional value of the clinical data, we performed a complementary analysis and computed the correlation of each clinical variable with all proteins and plotted the distribution of maximum correlation (Fig. 2f). These analyses show both overlapping and unique information across both modalities. The pretraining stage of COMET allows the RNN to extract the most useful information and avoids the inclusion of redundant, highly correlated EHR features, which may contribute to its superior performance compared with the baseline models.

COMET aligned EHR and proteomics data

We examined EHR–proteomics relationships through the EHR latent representation, visualizing correlations between each of the 400 latent dimensions and each protein (Fig. 3a,b). There were 3,201 significant correlations (after multiple hypothesis test correction) between the dimensions of the EHR latent representations learned in the joint baseline models and all proteins. COMET’s EHR latent representations showed 5,364 significant correlations, indicating better alignment of the EHR and proteomics data. The increased alignment suggests that the information COMET learns from the EHR data more closely captures the underlying biological processes of the patient.

Fig. 3: COMET induced alignment between EHR latent representations and proteomics data.
figure 3

a, t-SNE visualization of the proteomics data and EHR latent representation in the joint baseline models; lines connect statistically significantly correlated proteins and dimensions of the EHR latent representation. The red dots represent three proteins with the greatest number of statistically significant correlations with the dimensions of the EHR latent representation. sICAM-3, soluble intercellular adhesion molecule 1; LRRTM1, leucine-rich repeat transmembrane neuronal protein 1; ANGPT4, angiopoietin-4; CST3, cystatin C; PLXB2, plexin-B2; IL1RL1, interleukin-1 receptor-like 1. b, t-SNE visualization of the proteomics data and EHR latent representation in the COMET models. The lines connect statistically significantly correlated proteins and dimensions of the EHR latent representation. The red dots represent three proteins with the greatest number of statistically significant correlations with the dimensions of the EHR latent representation. c, Comparison of protein feature importance in the COMET models and joint baseline models. d, Distribution of absolute correlations between protein abundance and days to the onset of labour in an external dataset (n = 12 correlations from important proteins in the baseline model and n = 14 correlations from important proteins in the COMET models). The box plots show the median (centre line), 25th and 75th percentiles (box bounds), with whiskers extending to the most extreme data points within 1.5 times the interquartile range from the box edges. The difference between the average absolute correlations is 0.109 (95% CI: [0.0134, 0.2052]) with a two-sided t-statistic of –2.39 (P = 0.0276) with an estimated degree of freedom of 19.1.

This pattern was the strongest in proteins most correlated with EHR latent representation dimensions. Using COMET, interleukin-1 receptor-like 1 (also known as the suppression of tumourigenicity 2 protein), cystatin C and plexin-B2 showed significant correlations with 76%, 68% and 68% of the dimensions, respectively. These proteins are known to play a role in pregnancy progression and labour timing, and are consistent with discoveries from previous studies24,25,26,27,28. The EHR latent representation’s high correlation with these proteins suggests that it was capturing meaningful information about the patients’ underlying biological state, potentially contributing to improved predictive modelling performance. By contrast, the joint baseline models do not exhibit this phenomenon. In the baseline experiments, the proteins most correlated with the EHR latent representation were soluble intercellular adhesion molecule 1, leucine-rich repeat transmembrane neuronal protein 1 and angiopoietin-4. Although angiopoietin-4 does have a known association with pregnancy progression, the other two proteins are primarily known for other biological functions unrelated to pregnancy, suggesting that the EHR latent representation from the baseline models does not reflect the underlying pregnancy biology as strongly. Further discussion of these proteins is provided in Supplementary Note 1.

COMET identified proteins associated with labour onset timing

Last, we computed the feature importance for each protein using integrated gradients to understand how the alignment between the EHR latent representation and the proteins influenced the features ultimately used by the models to make their predictions (Fig. 3c; full feature importance is provided in Supplementary Data). The proteins with greater feature importance in the COMET models are known to be associated with gestational age, foetal development or pregnancy complications, all of which have implications for time to labour. Conversely, the proteins that are more important in the joint baseline models have no known role in pregnancy. We expand on the known biological role of these proteins in Supplementary Note 2 and further validate the relevance of the important proteins in the COMET models by computing the correlation between these proteins and days to the onset of labour in an external dataset (Fig. 3d). The average Pearson correlation magnitude for the proteins more important in COMET was 0.22 (s.d. = 0.13), whereas the average for the proteins less important with COMET was 0.12 (s.d. = 0.09). These analyses suggest that COMET improves predictive modelling not only through learning a more biologically meaningful representation of the EHR data but also helps the model learn accurate biology.

COMET improved cancer prognosis prediction

To show the generalizability of our COMET framework, we next applied it to a different prediction problem in an independent population. We used COMET to predict the three-year cancer mortality from a population of cancer patients in the UK Biobank (n = 36,901 patients)11. The studied population consisted of all the patients who received a diagnosis of any type of cancer (determined by the presence of an ICD10 code beginning with C) within 5 years of enrolment in UK Biobank, or up to 12 months prior. A subset of these patients had blood samples collected when they enrolled in the UK Biobank study, which were analysed for the proteomics data29. We included these patients in our omics cohort if they had their samples collected within 12 months following their initial cancer diagnosis (n = 559 patients, the omics cohort). For patients with proteomic data, we used EHR data from the time of sampling and earlier as features; for other patients (n = 36,342 patients, the pretraining cohort), we used EHR data from the time of cancer diagnosis and earlier (Fig. 4a,b).

Fig. 4: Multimodal data provided insights into cancer mortality risk.
figure 4

a, For patients with proteomics data, we construct input features from all EHR data up to the sampling time (shaded in green); for patients without proteomics data, we use EHR data up until cancer diagnosis (shaded in blue). b, We utilized data from patients with a cancer diagnosis in the UK Biobank (UKBB), and split the population into two groups based on whether or not they had omics data available. c, Predictions from COMET were better than predictions from the highest-performing baseline (n = 559 predictions). The box plots show the median (centre line), 25th and 75th percentiles (box bounds), with whiskers extending to the most extreme data points within 1.5 times the interquartile range from the box edges. The difference in mean for the baseline predictions was 0.089 (95% CI: [0.0469, 0.1277]) with a two-sided Wilcoxon rank sum statistic of 3,431 (P = 8.37 × 10–8). The difference in mean for the COMET predictions was 0.149 (95% CI: [0.0892, 0.3084]) with a two-sided Wilcoxon rank sum statistic of 2,537 (P = 1.54 × 10–10). d, t-SNE visualization of cancer mortality data. The dots represent individual features and are coloured based on modality. They are sized based on univariate correlation with cancer mortality. The clusters with only protein variables are annotated based on GO enrichment analysis and clusters containing both clinical and protein variables are annotated based on clinical themes. e, Heat map showing the number of significant correlations (after Bonferroni correction) between the EHR features and all proteins. f, Distribution of the maximum absolute correlation between each EHR feature and all proteins in the cancer mortality data.

When using COMET to predict three-year cancer mortality by using the pretraining cohort to pretrain the EHR part of the model and transferring those weights to a multimodal model to make predictions on the omics cohort, it demonstrates superior performance compared with all the baselines (area under the receiver operating characteristic curve (AUROC) = 0.842, 95% CI: [0.744, 0.922], P = 0, area under the precision–recall curve (AUPRC) = 0.504, 95% CI: [0.341, 0.670], P = 0; Fig. 4c). The prevalence of three-year mortality is 5.5% in the omics cohort. The baselines have the same design as the onset of labour analyses (see the ‘COMET accurately predicted days to the onset of labour’ section for details). The joint baseline performed the best (AUROC = 0.786, 95% CI: [0.664, 0.882], P = 0, AUPRC = 0.365, 95% CI: [0.217, 0.555], P = 0). The EHR-only (AUROC = 0.749, 95% CI: [0.636, 0.843], P = 0, AUPRC = 0.205, 95% CI: [0.122, 0.349], P = 0) and proteomics-only (AUROC = 0.737, 95% CI: [0.634, 0.838], P = 0, AUPRC = 0.325, 95% CI: [0.179, 0.495], P = 0) baselines also show some signal for predictive modelling. Agreement is measured using Cohen’s kappa and is reported in Supplementary Table 5, demonstrating consistent and reliable classification performance that exceeds all baselines.

Like the onset of labour experiments, we compared the performance of COMET with a logistic regression baseline, including an adaptation that incorporates prior knowledge that similarly shows a benefit from the baseline AUPRC of 0.263–0.279 when incorporating priors from the pretrained model. Full results are listed in Supplementary Table 6, which show that COMET exceeds all logistic regression baselines, including the adaptation that incorporates priors from pretraining. We also ran the experiments using COMET Transformer, which again show a strong correlation between predictions (r = 0.72) with COMET outperforming COMET Transformer (Supplementary Table 7). Regardless of the model architecture, predictive modelling performance improved when pretraining was included, and the performance of COMET exceeds all other approaches.

Multimodal data uncovered biology of cancer prognosis

We used t-SNE to visualize the correlation matrix among all pairs of variables across modalities to better understand their relationships (Fig. 4d). In contrast to the onset of labour data, there was less overlap between the proteomics data and the EHR data modalities. However, we do see significant correlations between the proteomics data and EHR data modalities when visualizing a correlation network with each modality individually projected into two dimensions (Supplementary Fig. 3).

To gain insights into this phenomenon, we computed the number of significant correlations each protein variable has with all the EHR variables (Fig. 4e). Among all the proteins, mortality factor 4-like protein 2 had the greatest number of correlations with EHR variables, especially drug prescriptions. Mortality factor 4-like protein 2 has been associated with tumour dynamics and treatment response, which may explain its high correlation to drug orders30. We found a large proportion of the proteins in cancer patients (65.9%) had no significant correlation with any of their EHR variables (Supplementary Fig. 4). We computed the correlation of each EHR feature with all proteins and computed the maximum correlation across all proteins for each EHR feature (Fig. 4f) and found many EHR features with low correlations to all proteins in the cancer patients. This finding reiterates the value of including multiple data modalities in our analysis. When looking at the strong correlations between EHR features and proteins, it allowed us to uncover interesting relationships across data modalities. For example, a diagnosis of chronic B cell lymphocytic leukaemia has the highest correlation with lymphocyte-activation gene 3 protein intensity (r = 0.46, 95% CI: [0.333, 0.571], P = 8.4 × 1031); lymphocyte-activation gene 3 is an immune checkpoint that is expressed on leukaemia cells and has been shown to be an effective prognostic marker (Supplementary Fig. 5)31.

COMET EHR representations reflected known cancer biology

We again visualize the relationship between EHR latent representation and proteomics data (Fig. 5a,b). The dimensions of the EHR latent representation learned in the joint baseline experiments have no significant correlations with any proteins, whereas the dimensions of the EHR latent representation from COMET had 7,591 statistically significant correlations, showing that this alignment effect occurs across datasets. All the proteins with the greatest number of significant correlations with the COMET EHR latent representation have been shown to be prognostic biomarkers for cancer. We elaborate on these proteins in Supplementary Note 3. These findings demonstrate that COMET not only effectively aligns the EHR and protein data but also reveals biologically meaningful correlations that are consistent with known cancer prognostic markers, underscoring the potential of this approach for identifying clinically relevant biomarkers and therapeutic targets across diverse datasets.

Fig. 5: COMET induced alignment between EHR latent representations and proteomics data, and produced models that are more biologically aligned with known pregnancy biology.
figure 5

a, t-SNE visualization of the proteomics data and EHR latent representation in the joint baseline models. The lines connect statistically significantly correlated proteins and dimensions of the EHR latent representation. The red dots represent three proteins with the greatest number of statistically significant correlations with the dimensions of the EHR latent representation. b, t-SNE visualization of the proteomics data and EHR latent representation in the COMET models. The lines connect statistically significantly correlated proteins and dimensions of the EHR latent representation. c, Comparison of protein feature importance in the COMET models and joint baseline models. d, Distribution of univariate P values (from a t-test) comparing protein levels based on three-year mortality in an external dataset (n = 18 P values from important proteins in the baseline model and n = 18 P values from important proteins in the COMET models). The green dotted line represents the Bonferroni-adjusted significance threshold.

COMET models validated established cancer prognostic markers

Proteins with higher feature importance in COMET models aligned with known prognostic biomarkers (Fig. 5c, full feature importance is provided in Supplementary Data). We elaborate on these proteins in Supplementary Note 4. We further validated that the proteins more important in the COMET models are more highly associated with mortality than the proteins that are more important in the baseline models (Fig. 5d). We found that 9 out of the 18 matching proteins that were most important in the COMET models are statistically significantly associated with mortality status, whereas only 8 were from the joint baseline models. Furthermore, the median P value for the COMET proteins was lower. These findings further validate that COMET models better align with known biology.

COMET acted as a form of regularization by initialization

To better understand which part of the network was responsible for the predictive modelling improvements, we looked at the performance of the intermediate nodes in the penultimate layer of the network (Fig. 6a,b). As expected, we saw improvements in the EHR node with COMET, presumably due to the additional EHR data used to pretrain the model. The improvements in the biological representations discussed above also suggest that the proteomics and/or joint nodes may also have improvements. Indeed, we see that effect (from the proteomics node in the onset of labour analysis and from the joint node in the cancer analysis). These findings support the hypothesis that COMET not only improves the model’s ability to learn from the EHR data but also from the the omics data. We also show that the weights in the omics and joint parts of the network are a function of these transferred weights (Supplementary Note 5); therefore, such a finding is also supported theoretically.

Fig. 6: COMET acted as a form of regularization, allowing the neural network to access parts of the parameter space that would not be accessible otherwise.
figure 6

a, Pearson correlation of the values at each intermediate node with days to the onset of labour in the joint baseline model compared with COMET. b, AUROC of the values at each intermediate node for predicting three-year mortality in the joint baseline model compared with COMET. c, Training loss versus test loss for each iteration of the onset of labour experiment at each epoch, comparing the joint baseline with COMET; the mean loss is shown in bold. d, Training loss versus test loss for each iteration of the cancer mortality experiment at each epoch, comparing the joint baseline with COMET; the mean loss is shown in bold. e, Visualization of the parameter space for joint baseline and COMET models to predict days to the onset of labour; each point represents the parameters at one epoch during training. Earlier epochs are shown in lighter colours. Protein parameter space (i), EHR parameter space (ii), joint parameter space (iii) and overall parameter space (iv). f, Visualization of the parameters for joint baseline and COMET models to predict cancer mortality. Each point represents the parameter space at one epoch during training. Earlier epochs are shown in lighter colours. Protein parameter space (i), EHR parameter space (ii), joint parameter space (iii) and overall parameter space (iv).

To understand the mechanism of this improvement, we compared the training loss against the test loss between the COMET models and the baseline models (Fig. 6c,d). We observed that the test loss was lower for any given training loss when using COMET, suggesting that COMET improves generalizability, potentially by acting as a form of regularization. We explored how this regularization effect impacted the actual model parameters.

By visualizing the parameter space (Methods), we can see that the COMET models occupy separate parts of the parameter space compared with the baseline models (Fig. 6e,f). This suggests that the regularization effect allows parameters to converge to a part of the parameter space that leads to more generalizable and more biologically accurate models. The paths of each of the 25 iterations of the models through the parameter space throughout training are visualized in Supplementary Figs. 6–13. In conclusion, the improved performance of the COMET models occurs due to the model’s ability to better learn from both EHR and omics data, enabled by the regularization effect that is a result of COMET’s initialization of weights in the RNN from transfer learning.

link