Accurate RNA 3D structure prediction using a language model-based deep learning approach

Table of Contents

Automated end-to-end platform for RNA 3D structure prediction

The development of RhoFold+ was guided by RNA-specific knowledge and the limitations of existing RNA 3D structure data. To build our training dataset, we curated all available RNA 3D structures from the PDB, using the BGSU representative sets of RNA structures (version 2022-04-13)²⁴. We focused on single-chain RNAs and reduced redundancy by clustering sequences with Cd-hit²⁵ at an 80% sequence similarity threshold, resulting in 782 unique sequence clusters from 5,583 RNA chains. These RNA sequences were then processed through our pipeline, RhoFold+. First, the sequences were transformed using RNA-FM, our large RNA language model, to extract evolutionarily and structurally informed embeddings. Concurrently, MSAs were generated by searching through extensive sequence databases. The embeddings and MSA features were then fed into our transformer network, Rhoformer, and iteratively refined for ten cycles. Following this, our structure module employed a geometry-aware attention mechanism and an invariant point attention (IPA) module to optimize local frame coordinates and torsion angles for key atoms in the RNA backbone. Structural constraints, such as secondary structure and base pairing, were applied after reconstructing the full-atom coordinates (Fig. 1a and detailed discussion in Supplementary information). After developing RhoFold+, we rigorously benchmarked and evaluated its performance across a broad range of tests (Fig. 1b).

**Fig. 1: The architecture of RhoFold+ and the tasks used for performance evaluation.**

Benchmarking RhoFold+ on RNA-Puzzles

We performed a comprehensive retrospective comparison between RhoFold+ and other existing computational methods on two previously held community-wide challenges: RNA-Puzzles and CASP15. We first used the results from the RNA-Puzzles^{26,27,28,29,30} competition, where the submissions were produced and optimized by human knowledge or computational methods. Importantly, here RhoFold+ was trained using nonoverlapping training data with respect to the RNA-Puzzles targets tested (Methods). We conducted preprocessing to obtain 24 single-chain RNA targets and excluded RNA complexes. This set of RNA targets contained two puzzles (PZs), PZ34 and PZ38, that were introduced after our development of RhoFold+ (Fig. 2a and Supplementary Fig. 3) and thus served as a blind test. After collecting the predictions of other methods from the official server ( we found that the performance of RhoFold+ surpassed that of all other methods, including FARFAR2/ARES, on nearly all targets, except for PZ24. Notably, RhoFold+ outperformed the second-best method on more than half of the targets by ~4 Å r.m.s.d. On 17 targets, RhoFold+ achieved r.m.s.d. values of <5 Å, and only one target exhibited an r.m.s.d. of >10 Å (Fig. 2a and Supplementary Table 5). As a whole, RhoFold+ produced an average r.m.s.d. of 4.02 Å, 2.30 Å better than that of the second-best model (FARFAR2: top 1%, 6.32 Å). Assessed using the template modeling (TM) score³¹, RhoFold+ achieved an average of 0.57 (Supplementary Table 5), higher than the scores of other top performers (0.41 and 0.44).

**Fig. 2: Benchmarking RhoFold+ on previously held community-wide challenges.**

To show that the promising results on RNA-Puzzles did not arise from overfitting, we studied whether the sequence similarity between the test set and our training data was substantially positively correlated with the performance of RhoFold+, as measured by the TM score and the local distance difference test (LDDT), a superposition-free score that evaluates local distance differences for all atoms in a model^32,33. Such a correlation was previously found in protein structure prediction¹³, yet here we found that R² values, which represent whether the slope is significantly nonzero, were 0.23 for the TM score and 0.11 for the LDDT (Fig. 2b,c), indicating no significant correlation between model performance and the similarity of our training and testing sets. These results suggest that RhoFold+ can generalize in predicting accurate RNA structures. A case study of a representative RNA-Puzzles target, PZ7 (a 186-nucleotide-long Varkud satellite ribozyme RNA), exemplifies this finding. Here, the structure of the most similar RNA in the training set differed substantially from the structure of PZ7 (Fig. 2b): the r.m.s.d. between these structures was 34.48 Å. As another example, PZ38 exhibited the highest sequence similarity of 53% with respect to all RNAs in our training set, and the r.m.s.d. between the structure of the most sequence-similar RNA and PZ38 was 16.46 Å (Fig. 2b). This was larger than the r.m.s.d. of 8.92 Å between PZ38 and the RhoFold+ prediction.

To test the ability of RhoFold+ to generalize for structure-dissimilar (in addition to mainly sequence-dissimilar) targets, we sought to determine whether the predictions of RhoFold+ could surpass the best single template (the most structurally similar model) in the training set for a given query. To investigate this, we compared the TM scores between our predictions and experimentally determined structures against the TM scores between the best single templates and experimentally determined structures across all RNA-Puzzles. For the majority of puzzles, RhoFold+ produced predictions with a higher global similarity and an average TM score of 0.574, surpassing the best single template by 0.05 (Fig. 2e and Supplementary Table 13). It is important to highlight that for proteins, surpassing the best single template required substantial progress. Indeed, it was only during CASP14 that computational methods outperformed the best single template. Although RhoFold+ generated considerably more accurate predictions than other methods under the conventional sequence similarity data splitting paradigm, we further tested the adaptability of RhoFold+ by eliminating 3D structures from the training set whose TM score, with respect to any target, surpassed a specified threshold (Supplementary Fig. 6 and Supplementary Tables 6 and 10). Even under this more demanding condition, RhoFold+ continued to exhibit a promising performance (Supplementary Table 10).

In applying computational models to large-scale, real-world settings, speed is often a top priority. In addition to generating largely accurate folding results, we found that RhoFold+ is fast, with typical RNA-Puzzles predictions completed within ~0.14 s (Fig. 2d). In contrast, other approaches, including SimRNA¹², FARFAR2¹⁰ and RNAComposer³⁴, exhibited significantly longer running times, probably due to the large-scale sampling processes employed by these methods (Fig. 2d).

Benchmarking RhoFold+ on CASP15 targets

As RNA-Puzzles was first released over a decade ago²⁶, we next used RhoFold+ to predict RNA targets from the more recent CASP15 (refs. ^35,36). We focused on CASP15’s six natural RNA targets (Fig. 2h and Supplementary Fig. 4). Artificially designed targets, which fell outside the expected domain of application for RhoFold+, were not included: in particular, the excluded targets were characterized by their lack of homology and divergence from our training set or their being RNA–protein complexes. We followed the CASP15 guidelines, which specified that participating teams were permitted to submit up to five models. Utilizing different, randomly sampled MSAs (Methods), we modeled five candidate structures for each target using RhoFold+ and considered only the highest-performing prediction (Supplementary Table 6).

Several top-ranking CASP15 groups and recent published works on RNA 3D structure prediction^{17,18,20,21,23} were included in our benchmarking. Particularly, CASP15 groups were divided into two categories, ‘server’ and ‘expert’, depending on whether or not human expert knowledge and fine tuning were used. Regardless of the category, many CASP15 groups employed computational pipelines that were based on comparative or statistical learning for natural targets, thus allowing us to assess the learning capability of RhoFold+. Our preliminary model, AIchemy_RNA (RhoFold), was a participant in the ‘expert’ category. Building on RhoFold, RhoFold+ represents a fully automated and end-to-end pipeline that is more similar to participants in the ‘server’ category. Here, we found that RhoFold+ outperformed RhoFold on CASP15’s natural RNA targets by an average r.m.s.d. of ~1 Å. Furthermore, RhoFold+ outperformed other methods whose predictions were available for all six natural RNA targets, including the first-ranked AIchemy_RNA2, the second-ranked Chen method and other computational methods, including DRfold²³, DeepFoldRNA¹⁷, AlphaFold3²¹ and trRosettaRNA¹⁸ (Fig. 2h,i). Although RhoFold+ outperformed AIchemy_RNA2 marginally by 0.06 Å (average r.m.s.d.; Fig. 2i), AIchemy_RNA2 required expert knowledge. Additionally, RhoFold+ demonstrated accuracy comparable to each top-performing method on almost every natural RNA target, with the exception of R1156 (Fig. 2h).

Following CASP15’s assessment approach³⁶, we also computed Z-scores for the predictions from all participating groups. CASP15 prioritized the TM score and the global distance test-total score (GDT-TS), which evaluates both overall structure similarity and local alignment, leading us to assess these models based on the cumulative Z-scores of these metrics (Fig. 2h). On the six natural RNA targets and among the subset of all CASP15 participants ranked on these specific targets, RhoFold (AIchemy_RNA) was fourth, while the performance of RhoFold+ was on par with that of AIchemy_RNA2 (with a difference of 0.4 in the Z-score) and surpassed that of other methods. In a detailed analysis of performance on specific targets, we found that, for target R1108, RhoFold+ achieved the best Z-score and r.m.s.d. Interestingly, RhoFold+ also attained the best Z-score for R1116, although the r.m.s.d. was ~1 Å higher than that of UltraFold (other methods produced predictions with significantly lower accuracy, all with r.m.s.d. >10 Å). Upon further investigation, we found that, while UltraFold outperformed RhoFold+ on this metric by producing accurate local predictions, the predicted global structure was less accurate, as evidenced by a TM score of 0.497 and a GDT-TS score of <0.4. In contrast, RhoFold+ inaccurately predicted a helix angle, resulting in an r.m.s.d. of 8.92 Å, but its correctly predicted topology resulted in a higher TM score of >0.55. For this target, AIchemy_RNA2 incorrectly predicted the stem stackings and RNA topology, resulting in a high r.m.s.d. of 17.26 Å and a TM score of ~0.49. Notably, the RhoFold+ prediction for R1116 did not arise from overfitting, as indicated by the low maximum structural similarity (TM score) and maximum sequence similarity of R1116 with respect to the training set (Fig. 2k and Supplementary Table 6).

We also looked into targets where RhoFold+ may achieve reduced performance and found that higher MSA quality correlated with better performance. While RhoFold+ accurately predicted local structural topologies, it struggled with aligning helices, particularly at junctions. This discrepancy may be due to the dynamic and flexible nature of RNA junctions, which often adopt multiple conformations^37,38,39, making them challenging for fully automated models to represent accurately (Fig. 2k,l and detailed discussion in Supplementary information).

Factors influencing prediction accuracy

Building on the findings above, we performed a more comprehensive study involving all CASP15 natural RNAs and RNA-Puzzles targets. We observed that the prediction accuracy of RhoFold+ is sensitive toward the query’s MSA profile similarity (Supplementary information) against the training set (Fig. 2g) and the complexity of RNA structures (query length; Fig. 2j). Additionally, predicted LDDT (pLDDT) scores were found to correlate with the confidence of RhoFold+, providing a useful metric for identifying regions with lower prediction accuracy, especially in more complex or less homologous queries (Fig. 2f and detailed discussion and analysis in Supplementary information).

Benchmarking RhoFold+ on all determined RNA 3D structures

After benchmarking RhoFold+ with RNA-Puzzles and CASP15, we next evaluated RhoFold+ in greater detail using all experimentally determined RNA structures, as defined by the BGSU representative sets of RNA structures (preprocessed to remove redundancy). To further study the performance of RhoFold+, we performed tenfold cross-validation by iteratively masking 80 sequence clusters for validation and leaving 702 sequence clusters for training. We found that the performance of RhoFold+ across all RNA structures was robust regardless of the train–test data split and fairly consistent across all folds (Fig. 3a–c). Slight variations in TM score might be caused by challenging targets such pseudoknot cases in Fold2 and Fold7 similar to PZ24 (Fig. 3c,e), and we expect that the predictions of RhoFold+ on such targets could be improved if secondary structure constraints were provided. Also, during our cross-validation test, the accurate predictions of RhoFold+ were not due to merely mimicking the most sequence-similar training data (Fig. 3b,d,e). A plot of the r.m.s.d. against the sequence length shows that r.m.s.d. values were largely distributed below 10 Å, independent of the sequence length (Fig. 3a). Outliers with r.m.s.d. >20 Å were more likely to occur for sequences longer than 200 nt, where we expect further improvement by more tuning on long RNAs (detailed discussion in Supplementary information).

**Fig. 3: Benchmarking RhoFold+ on all experimentally determined RNA structures supports the accuracy and ability of RhoFold+ to generalize to unseen structures.**

As a further evaluation of the capabilities of RhoFold+, we considered the model’s performance on newly determined RNA single-stranded structures released subsequent to the compilation of our training dataset. This approach acted as an additional blind test, similar to the CASP15 competition. We included comparisons against FARFAR2 and recent deep learning methods^17,18,21,23, all of which have inference code and/or servers available and some of which also participated in CASP15 (Methods). RhoFold+ outperformed all benchmarked models, achieving the highest average accuracy as measured by r.m.s.d. RhoFold+ produced an average r.m.s.d. of 7.74 Å, which was approximately 0.8 Å and 10.5 Å better than the second-ranked DeepRNAFold and the lowest-ranked FARFAR2, respectively. Notably, on average, RhoFold+ also outperformed AlphaFold3 and RoseTTAFold2NA by approximately 2.2 Å and 1.8 Å, respectively (Fig. 3f and detailed discussion in Supplementary information). These results were consistent with the performance observed in our previous benchmark on CASP15, suggesting that RhoFold+ accurately generalizes to newly determined structures not seen in our training set. Furthermore, these results support that AlphaFold3 and RoseTTAFold2NA, which are designed to predict biomolecular complexes, do not perform as well as RhoFold+ when applied to single RNA molecules. Further examining sequence and structural similarities to our training set reveals that RhoFold+ maintained strong performance even with sequence similarities below 0.5 (Fig. 3g), and the TM score was greatly influenced by MSA profile similarity while local accuracy (LDDT) remained high and robust (Fig. 3h). Additionally, RhoFold+ demonstrated strong generalizability, accurately folding structures such as 7QR3 despite its low similarity to the closest training template, 7DLZ (TM score of 0.40, r.m.s.d. of 16.45 Å; Fig. 3e).

RhoFold+ generalizes to unseen RNA types and families

Having demonstrated that RhoFold+ can generalize to predicting RNA structures with divergent sequence similarities, structural similarities and dates of release, we next investigated the ability of RhoFold+ to handle different RNA types and families defined by expert knowledge. In particular, RNA types and families—such as those curated in Rfam⁴⁰—are often classified manually based on factors including function, structure and co-evolutionary information. Addressing the challenge of generalizing to different RNA types and families may be considerably more demanding for deep learning methods such as RhoFold+ as such a task requires larger domain shifts.

We benchmarked the cross-type performance of RhoFold+ by training the model on a subset of all RNA types while testing on the others. RhoFold+ showed robustness across RNA types. Though struggling with introns and riboswitches, it performed well on transfer RNA (tRNA) and micro RNA (miRNA) types, achieving TM scores up to 0.73 (Fig. 3i). When compared with FARFAR2, RhoFold+ outperformed it across all RNA types, particularly in tRNAs and ribosomal RNAs (rRNAs), with smaller margins for riboswitches (detailed discussion in Supplementary information). For cross-family tests, RhoFold+ achieved an average r.m.s.d. of 6.69 Å (Fig. 3j), but struggled with complex families such as group I introns (RF00028). This difficulty is consistent with challenges observed in cross-type tests, such as for complex RNA types such as introns and CRISPR RNA elements (RF01344). These elements interact with various proteins and enzymes, and focusing solely on RNA structure without considering these interactions may limit the prediction accuracy (detailed discussion in Supplementary information). Overall, these tests demonstrate the ability of RhoFold+ to generalize across unseen RNA types and families, though challenges remain for complex structures and datasets with limited available data.

RhoFold+ predicts secondary structures and substructures

RhoFold+ can accurately predict RNA 3D structures, but the limited number of experimentally determined RNA structures and types makes it difficult to understand the space of all possible RNA folds. This is particularly true for complicated and large RNA types, including internal ribosomal entry sites, introns, synthetic RNAs and long noncoding RNAs. RNA secondary structures, however, can be more easily determined in experiments and accurate secondary structure predictions can supplement the predictions of 3D structures, offering valuable insights into RNA folding and function. Therefore, we adapted RhoFold+ to predict secondary structures as well. As RhoFold+ was designed to predict RNA 3D structures, we incorporated a postprocessing module that utilizes the features retrieved from RhoFold+’s Rhoformer to predict secondary structures (since Rhoformer’s features show attention maps highly aligned with the contact maps; Supplementary Fig. 8 and Supplementary Table 14). This module takes into account the same structural information as the module performing 3D reconstruction but operates under distinct geometric and biological constraints imposed to predict secondary structure.

We benchmarked the performance of RhoFold+ on newly determined PDB structures (the ‘new PDB set’) and the ArchiveII dataset⁴¹, which includes secondary structure information for diverse RNAs. On the new PDB set, RhoFold+ outperformed UFold⁴¹ by 0.035 in the average F1 score (Fig. 4a), even when UFold was trained on all available data (PDB and bpRNA-1M, a database with over 100,000 annotated RNA secondary structures). On the ArchiveII dataset comprising 2,975 RNA samples, RhoFold+ also outperformed other secondary structure prediction methods (Fig. 4b), particularly on larger RNA types (Fig. 4c). For instance, it achieved an F1 score of 0.60 on structured domains in the dengue virus transcriptome (Supplementary Table 19), aligning with results from mutational profiling (RING-MaP)^42,43. Similarly, the strong performance of RhoFold+ did not stem from mimicking training data, as it maintained an F1 score of ~0.7 even when sequence similarity dropped below 50% (Fig. 4e), and achieved a perfect F1 score of 1.0 on the CASP15 target R1117 (Fig. 4f). These results suggest that RhoFold+ not only excels in predicting 3D structures, but also generates rich, meaningful representations that enable state-of-the-art secondary structure prediction.

**Fig. 4: RhoFold+ accurately predicts secondary structures and IHAs from experimental data.**

We further evaluated substructures within RNA secondary structures, finding that RhoFold+ consistently outperformed SPOT-RNA⁴⁴ and UFold⁴¹ across all substructures, with the most significant improvements in multiloops and external loops, while internal loops and pseudoknots showed similar performance across methods (Fig. 4d). These results underscore the potential capability of RhoFold+ in predicting RNA secondary structures and enhancing our understanding of RNA function.

Correcting artifacts and IHA prediction

As RhoFold+ accurately predicts RNA structures at both the secondary and tertiary levels, we asked whether we could leverage RhoFold+ for experimental efforts. Toward this, we investigated two use cases of RhoFold+: (1) for correcting experimental structural artifacts and (2) for guiding RNA construct engineering.

X-ray crystallography is widely used to resolve RNA 3D structures, but it can introduce artifacts such as domain-swapped dimers⁴⁵, potentially misleading machine learning models that do not generalize well. In one case, the RhoFold+ prediction for 3SUH initially yielded a high r.m.s.d. of 10.11 Å compared with the PDB structure. However, further analysis revealed that the crystal structure involved a domain-swapped dimer. When comparing the RhoFold+ prediction with the inferred monomeric structure, the r.m.s.d. improved to 5.71 Å, indicating RhoFold+ accurately predicted the biologically relevant structure (Fig. 4g). Similar findings were also observed for the ZTP riboswitch⁴⁶ (Supplementary Fig. 9), suggesting that RhoFold+ can effectively correct for such experimental artifacts.

When comparing experimental data with RNA 3D models, additional geometric metrics, such as interhelical angles (IHAs), can provide insights beyond standard global alignment measures such as r.m.s.d., LDDT and TM score. IHAs, which can be estimated using experimental methods, are useful for validating predicted models and guiding RNA nanostructure design. We introduced the IHA difference (IHAD) as a metric to benchmark the predictions of RhoFold+ (Fig. 4h and Supplementary information), finding that IHAD can reveal discrepancies in stem orientations that are not captured by r.m.s.d. alone (Fig. 4i). Our analysis shows that RhoFold+ generally predicted stem directions accurately (Fig. 4j,k), though performance decreased for IHAs near 0° or 180°, probably due to underfitting of parallel stems in large and complex structures (Fig. 4k and detailed discussion in Supplementary information). We further demonstrated the practical application of IHAs by predicting values for RNA constructs such as the FMN riboswitch and the P4–P6 domain from the Tetrahymena group I intron (Supplementary Fig. 9).

Ablation studies and generation of multiple predictions

Given the high accuracy and speed of RhoFold+, we finally conducted ablation studies to understand which components and information are important to the RhoFold+ predictions. The architectural components we investigated included four different modules (Fig. 5a and Methods). Ablation studies were performed on 138 PDB targets (collected between April 2022 and December 2023) with sequence similarities below 80% to our training set and lengths ranging from 16 to 300 nt (the ‘Ablation set’). By removing each RhoFold+ component, we observed that all contributed to improving the performance, with the MSA module being the most critical, followed by the RNA-FM language model (Fig. 5a). The RNA-modified version of AlphaFold2, without the MSA module, performed worse than RhoFold+ (Fig. 5a). Notably, removing RNA-FM led to a sharper performance decline for dissimilar sequences (Fig. 5b), and the RNA-FM module seemed to compensate for the loss of the MSA module, maintaining higher TM scores (Fig. 5c). Additionally, removing the recycling module most significantly affected predictions for longer sequences, probably due to its role in effectively deepening the model (Supplementary Fig. 7 and detailed discussion in Supplementary information).

**Fig. 5: Ablation studies of RhoFold+ and sampling of multiple models.**

These findings are consistent with our results for CASP15’s natural RNA targets and RNA-Puzzles, where MSA quality significantly impacts predictions. We also explored how the number of sequences in the extracted MSA influences accuracy. While RhoFold+ is limited to 256 MSAs due to training constraints, this limit did not compromise its effectiveness. A key enhancement in RhoFold+ is its ability to generate multiple predictions by sampling or clustering from a fixed number of MSAs, allowing for broader prediction selection and improved outcomes. Performance on RNA-Puzzles showed an inverse correlation with reduced MSA counts, with a marked improvement when the MSA number exceeded 100 (Fig. 5d), indicating that a larger MSA pool enhances model optimization (detailed discussion in Supplementary information). With this expanded MSA sampling, the lowest r.m.s.d. of the RhoFold+ Top5 predictions significantly decreased compared with RhoFold, correlating positively with increased MSA depth and yielding an up to 10 Å improvement (Fig. 5e). This improvement was more pronounced when the MSA profile similarity between the query and training sequences was high, resulting in smaller gains when similarity was already strong (Fig. 5f). Overall, additional MSA sampling is crucial for high performance, as demonstrated for CASP15 target R1116 and PDB 7VPX_L (Fig. 5g,h).

link

Accurate RNA 3D structure prediction using a language model-based deep learning approach