Preparing for downstream tasks in artificial intelligence for dental radiology: a baseline performance comparison of deep learning models
Fernandes, Fara Aninha; Ge, Mouzhi; Chaltikyan, Georgi; Gerdes, Martin; Omlin, Christian Walter Peter
Journal article, Peer reviewed
Published version

View/ Open
Date
2024Metadata
Show full item recordCollections
Original version
Fernandes, F. A., Ge, M., Chaltikyan, G., Gerdes, M. W., & Omlin, C. W. (2024). Preparing for downstream tasks in artificial intelligence for dental radiology: A baseline performance comparison of deep learning models. Dentomaxillofacial Radiology. Article twae056. https://doi.org/10.1093/dmfr/twae056Abstract
Objectives: To compare the performance of the convolutional neural network (CNN) with the vision transformer (ViT), and the gated multilayer perceptron (gMLP) in the classification of radiographic images of dental structures.
Methods: Retrospectively collected two-dimensional images derived from cone beam computed tomographic volumes were used to train CNN, ViT, and gMLP architectures as classifiers for four different cases. Cases selected for training the architectures were the classification of the radiographic appearance of maxillary sinuses, maxillary and mandibular incisors, the presence or absence of the mental foramen, and the positional relationship of the mandibular third molar to the inferior alveolar nerve canal. The performance metrics (sensitivity, specificity, precision, accuracy, and f1-score) and area under the curve (AUC)—receiver operating characteristic and precision-recall curves were calculated.
Results: The ViT with an accuracy of 0.74-0.98, performed on par with the CNN model (accuracy 0.71-0.99) in all tasks. The gMLP displayed marginally lower performance (accuracy 0.65-0.98) as compared to the CNN and ViT. For certain tasks, the ViT outperformed the CNN. The AUCs ranged from 0.77 to 1.00 (CNN), 0.80 to 1.00 (ViT), and 0.73 to 1.00 (gMLP) for all of the four cases.
Conclusions: The ViT and gMLP exhibited comparable performance with the CNN (the current state-of-the-art). However, for certain tasks, there was a significant difference in the performance of the ViT and gMLP when compared to the CNN. This difference in model performance for various tasks proves that the capabilities of different architectures may be leveraged.