Multistage deep learning methods for automating radiographic sharp score prediction in rheumatoid arthritis

This study enhances automated joint assessment in rheumatoid arthritis through deep learning. Our model effectively manages variability in X-ray images and significantly improves the evaluation of joint damage, particularly for milder cases where early intervention is critical. We demonstrate notable progress in automatic segmentation and joint identification in RA hand X-rays. Utilizing the ResNet50 architecture for image orientation and a combination of U-Net and YOLOv7 for segmentation and joint identification, we achieved an impressive accuracy of 99%, outperforming Radke KL et al.‘s 87% accuracy using RetinaNet. This high level of performance closely aligns with results from Chaturvedi N, who reported near-perfect joint identification rates using RetinaNet, as shown in Table 2.

Table 2 Comparison segmentation and identification of region of interest (ROI) in X-ray images of rheumatoid arthritis between current and previous studies.

Compared to previous studies, like the one that used 226 hand X-ray images from 40 RA patients and employed DeepLabCut for detection and joint classification, our study utilized a much larger dataset (970 patients).

As illustrated in Fig. 3c, our model demonstrates flexibility in handling varying numbers of joints, including cases with missing joints due to disease progression. Figure 3c(I) shows successful detection of all joints, while Fig. 3c(II) depicts instances where the model misses some joints due to hand deformation. Figure 3c(III) highlights severe cases where some joints have disappeared entirely.

Table of Contents

Computational trade-offs in fine-tuning ViT models

We used a range of fine-tuning techniques, such as adjusting the model’s architecture, applying stratified sampling, and using customized data augmentation to improve the model’s performance. Among these, customized data augmentation had the most significant impact on improving the model’s performance. The embedding dimension and patch size are critical hyperparameters in Vision Transformer (ViT) models and must be carefully chosen based on the computational resources and the complexity of the task. They directly affect the size of the model’s parameters, computational complexity, and the ability of the model to capture and process features. Smaller patch sizes and higher embedding dimensions provide better feature representation but increase the parameter count and computational cost. Due to the computational limitations, we should simplify the model architecture. Increasing the number of attention heads or embedding dimensions resulted in memory crashes while reducing these parameters led to underfitting. The embedding dimension is critical in Vision Transformers as it reduces the high-dimensional input patch vectors to a manageable size while retaining meaningful features.

We adjusted the ViT’s architecture by setting the number of layers to 3, using two attention heads, and setting the embedding dimension to 8. These adjustments make the model more computationally efficient.

In Fig. 5a, the RMSE for the score range (0–270) dropped from 49.41 to 44.28 after fine-tuning. For the < 50 score range, RMSE decreased even more, from 14.19 to 9.74. A similar trend is seen with the MAE and the Huber Loss, dropping from 29.31 to 20.71 for scores 0–270 and decreasing from 11.37 to 4.96 for scores < 50. These metrics indicate that the model learned to make more accurate predictions after fine-tuning. Figure 6 depicts the distribution of prediction errors. Figure 5b demonstrates the improvements in correlation and reliability metrics achieved through fine-tuning. The ICC values also improved significantly, from 0.06 to 0.31 for scores 0–270 and from 0.19 to 0.70 for scores < 50, proving the model’s reliability. Figure 6 gives a closer look at the model’s prediction accuracy, showing the error distribution for the score range (0–270) and scores < 50, where the error distribution narrows even more after fine-tuning. This narrowing implies that the model is more precise in predicting mild joint damage. Figure 7 illustrates the relationship between the predicted and expert-assessed overall Sharp score using a joint plot to visualize the spread of predictions across the score range. After fine-tuning, predictions become closer to the line of agreement, especially for scores < 50. This tighter clustering indicates the model’s improved ability to detect patterns in early-stage joint damage.

We focused on the < 50 score range for several reasons. First, this range covers most of our data set, giving the model more data to learn from. Second, lower scores indicate milder joint damage, which is essential to identifying early treatment and preventing further disease progression. Lastly, focusing on this range makes our results more comparable to previous studies, which also concentrated on this score range.

Despite successfully enhancing predictive accuracy for early and moderate joint damage, our approach still encounters challenges when predicting across the score range (0–270). The main issue is the imbalance in the dataset; even with stratified sampling, cases with scores > 50 are limited. This shortage makes it harder for the model to learn patterns associated with higher scores, leading to a broader error distribution. Although fine-tuning reduced the RMSE for scores > 50, the model still struggles to capture the full range of variations in higher scores. Additionally, in severe RA cases, joints can be absent due to disease progression, leaving the model with fewer features to learn from and potentially impacting its prediction accuracy.

Compared to the RA2-DREAM challenge’s winning model (Team Shirin), which was trained on 367 patients (each with four images: left/right hand and foot), our model demonstrates significant advantages in handling data imbalances. Team Shirin’s model achieved an RMSE of 0.44 for overall joint damage prediction and 0.7 for joint-by-joint scores. However, their dataset was highly imbalanced, with over 90% of cases showing no joint damage, which limited the model’s ability to generalize, particularly in cases with moderate to severe joint damage^14,47.

In another study, a Tiny ViT model was used to predict erosion and joint space narrowing (JSN), achieving mean absolute errors (MAEs) of 0.7 for erosion and 0.36 for JSN ⁴⁹. These findings demonstrate the potential of Vision Transformers (ViTs) to improve Sharp scoring. However, direct comparisons with our model are challenging due to differences in dataset composition. While the study identified Tiny-ViT as the most effective backbone, specific details on key hyperparameters, such as bounding box adjustments, pre-training strategies, and augmentation methods, were not reported.

Furthermore, their dataset included only 330 hand X-rays, which did not cover the full range of Sharp scores or extreme rheumatoid arthritis (RA) cases with joint disappearance or severe deformation. Additionally, the model was not externally validated. External validation is crucial for robust performance, as illustrated by the RA2-DREAM model. When applied to an external dataset of 205 patients through reverse engineering, its root mean square error (RMSE) increased dramatically from 0.44 to 23.65 ⁵⁰. This result underscores the importance of external validation to ensure reliable performance beyond the training dataset.

In contrast, our model was developed using a border and more diverse dataset totaling 970 patients, with stratified sampling across score bins (scores < 50 and > 50) to ensure a balanced representation of varying damage levels. Including patients with missing joints enhanced our model’s ability to generalize across all RA stages, making it more adaptable to real-world clinical applications. After fine-tuning, our ViT model achieved an RMSE of 9.74, MAE of 5.36, and Huber loss of 4.36 on external data set 291 for OSS < 50. These results demonstrate our model’s ability to reliably predict early-stage joint damage, where accurate scoring is essential for effective RA management and intervention.

Limitations and Challenges

Our analysis was performed using Google Colab Pro+, which presented certain limitations. Our custom Vision Transformer (ViT) model, developed in TensorFlow, exhibited superior performance on TPU v2 (Tensor Processing Unit) compared to the NVIDIA V100 GPU. The TPU v2 comprises 8 cores, with 8 GB of High Bandwidth Memory (HBM), totaling 64 GB. With a peak memory bandwidth of approximately 600 GB/s per core, the TPU v2 is particularly efficient for TensorFlow/Keras models, especially those involving large batch sizes, extensive data sets, or distributed training.

However, the compute unit cap 500 on Google Colab Pro + posed a significant challenge for processing our large dataset and complex model. To overcome this limitation, we had to purchase additional compute units multiple times, as the model frequently exceeded the available resources during training. Despite these constraints, the training process took approximately 7 hour.

Although our approach successfully enhanced predictive accuracy, particularly for early and moderate joint damage, it still faces challenges when predicting the scores range > 50. Future work will explore advanced techniques, such as attention mechanisms and the integration of radiomic features ⁵¹, to more effectively capture the complexities of severe joint damage.

Another limitation is that our current framework is based solely on hand radiographs and focuses on predicting the Overall Sharp Score (OSS) by identifying joints and assessing overall joint damage. However, our data set lacks detailed information on joint narrowing and erosion, as it only includes the OSS without a joint-by-joint assessment. In future research, we plan to work closely with expert radiologists to develop a more comprehensive data set with detailed joint-level annotations.

Despite these limitations, the models have demonstrated substantial predictive accuracy for OSS using the available data. In future research, we aim to collaborate with expert radiologists to obtain joint-specific data, which will help improve the model’s ability to predict localized changes in individual joints.

Future Outlooks

Our study presents significant advancements in automated radiographic scoring for rheumatoid arthritis (RA). Although this work primarily focuses on hand radiographs, future works will extend to include foot joints.

We also aim to enhance the model by incorporating joint-specific narrowing and erosion data to provide a more detailed analysis of disease progression. We will work closely with expert radiologists to build a comprehensive dataset with detailed joint-level annotations.

Additionally, we encourage further research to continue improving the capabilities of the automated scoring system. Developing a publicly accessible, multicenter RA database that includes molecular markers, blood tests, and radiographic imaging would foster global collaboration and allow for deeper, more comprehensive analyses. By integrating this data with clinical assessments, we can significantly improve the accuracy of automated scoring systems and advance the use of precision medicine in managing rheumatoid arthritis.

link