Osteoporosis is a geriatric disease characterized by decreased bone density, commonly treated with bisphosphonates. This class of drugs inhibits bone resorption and increases bone mineral density, effectively reducing overall fracture risk. However, long-term bisphosphonate therapy has been associated with atypical femoral fractures, which are insufficiency fractures that can occur with minimal or no trauma. This effect necessitates careful assessment when considering extended bisphosphonate treatment, and therefore accurately classifying such fractures from normal femoral fractures that usually stem from high-trauma impact is of clinical concern.
This thesis explores deep learning models, namely Transformer-based models, to aid in classifying atypical femoral fractures from normal femoral fractures. Fusion of two modalities, radiographic images of fractures and electronic health records of the patients, was performed to allow the networks to make predictions using more available information. For this, a vision transformer and a tabular transformer models were employed. The fusion was done using a conventional fusion strategy, but also an attention-based one.
The dataset used comprises data from 1, 073 patients in Sweden who suffered a femur fracture, with radiographs obtained from 72 hospitals in 2011, totaling 4, 014 images. All images have a fracture present. This was coupled with detailed register information from the Swedish National Patient Register. The data was preprocessed using common techniques for each respective modality and split on patient-level using 5-fold cross-validation.
Five models were employed to perform the binary classification of the fracture, (1) unimodal vision transformer, (2) unimodal tabular transformer, (3) multimodal conventional late fusion, (4) multimodal intra self-attention fusion, (5) multimodal inter cross-attention fusion. The models were assessed using performance metrics (Accuracy, AUC, F1-score, Matthews Correlation Coefficient) and prediction uncertainties from MC dropout. Model predictions were also aggregated from the image-level to the patient-level using the inverse uncertainties to increase clinical relevance of the results. Model evaluation was performed using the DeLong test to compare AUCs and Wilcoxon rank-sum test to compare uncertainties.
Among the models, the unimodal vision model had majority of best test metrics at the patient-level when averaged over the folds, with an accuracy of 99.07%, F1-score of 0.9714, and an MCC of 0.9665. The multimodal conventional late fusion model and the multimodal inter cross-attention fusion model had the highest average AUC of 0.9994. There were no significant differences between the AUC of the respective models. The comparison of uncertainty showed significant outcomes for all tests except for the conventional late fusion compared to the multimodal inter cross-attention fusion model. In conclusion, the unimodal vision model performed best according to the performance metrics, however when accounting for uncertainties, the fusion models showed an increased performance.