Diagnostic accuracy of ChatGPT-5 in evaluating root canal treatment and periapical pathosis on periapical radiographs

Article information

Restor Dent Endod. 2026;.rde.2026.51.e29
Publication date (electronic) : 2026 May 12
doi : https://doi.org/10.5395/rde.2026.51.e29
1Division of Endodontics, Department of Conservative Dental Sciences, College of Dentistry, Qassim University, Qassim, Saudi Arabia
2General Practitioner, Dr. Tooth Private Clinics, Qassim, Saudi Arabia
3Department of Endodontics, Faculty of Dentistry, King Abdulaziz University, Jeddah, Saudi Arabia
*Correspondence to Waleed Almutairi, BDS, MS Division of Endodontics, Department of Conservative Dental Sciences, College of Dentistry, Qassim University, Qassim, Saudi Arabia Email: wgm1410@gmail.com

Citation: Almutairi W, Alnasser M, Alharbi H, Austah O. Diagnostic accuracy of ChatGPT-5 in evaluating root canal treatment and periapical pathosis on periapical radiographs. Restor Dent Endod 2025;51(3):e29.

Received 2025 November 8; Revised 2026 January 16; Accepted 2026 January 20.

Abstract

Objectives

Artificial intelligence (AI) chatbots such as ChatGPT-5 (OpenAI) are increasingly used for dental radiograph interpretation, especially among patients seeking self-assessment. However, their diagnostic accuracy in endodontics remains unclear.

Methods

This cross-sectional STARD-AI-compliant study analyzed 271 anonymized periapical radiographs of endodontically treated posterior teeth, classified as straightforward (n = 167) or complex (n = 104), using standardized ChatGPT-5 prompts. Diagnostic criteria included obturation length (short, adequate, long), presence of voids, and periapical pathosis. Results were compared to those of a panel of general dentists and a reference standard from endodontic specialists. Sensitivity, specificity, and accuracy were calculated using the McNemar test (p < 0.05).

Results

ChatGPT-5 demonstrated high specificity (up to 99.3%) for normal or adequately treated findings but low sensitivity for short (13.7%) or long (0.1%) obturations, voids (9.0%–22.7%), and periapical lesions (10.5%–28.6%). Overall accuracy (54.0%–63.2%) was significantly lower than that of general dentists (76.0%–85.6%) (p < 0.001).

Conclusions

Although ChatGPT-5 achieved high specificity, its low sensitivity and overall accuracy limit diagnostic reliability. Expert clinician oversight remains essential for accurate interpretation and treatment planning.

INTRODUCTION

Artificial intelligence (AI) chatbots have transformed digital communication by enabling people to ask targeted, individualized inquiries and receive precise, context-specific answers [1]. This is accomplished by mimicking the neural networks found in the human brain. One of the most prominent examples is ChatGPT-5 (OpenAI, San Francisco, CA, USA), a multimodal AI system capable of processing both textual and visual inputs [2]. In this context, ‘multimodal’ refers to the model’s ability to process and integrate both textual input and visual data, such as radiographic images, within a single analytical framework.

ChatGPT-5 has become deeply integrated into modern daily life, with applications extending from education and business to healthcare clinical decision support [3]. This expanding scope of use has naturally extended into healthcare contexts, where patients increasingly upload medical and dental images to AI platforms in search of preliminary interpretations or reassurance [4]. This trend reflects the rising influence of AI as an auxiliary source of information that allows patients to engage more actively in health-related decision-making [5]. In dentistry, ChatGPT-5 can provide valuable insights and descriptive assessment of radiographic density patterns, quality of root canal treatment, and pathological changes [6,7]. However, it is important to acknowledge that the diagnostic reliability of general-purpose AI models in radiographic interpretation has not been scientifically validated [8,9].

Despite the widespread use of ChatGPT-5 in healthcare-related contexts, its diagnostic accuracy in dental radiography, particularly in endodontics, remains largely unexplored. Only a few preliminary studies have evaluated AI models’ performance in dental imaging [10], and most have focused on domain-specific algorithms rather than generalist large language models. The lack of validation raises concerns regarding potential misinterpretations when patients or clinicians rely on these systems for diagnostic insights.

Therefore, this study aimed to evaluate the diagnostic accuracy of ChatGPT-5 in interpreting periapical radiographs of endodontically treated teeth and to compare its performance with that of practicing general dentists and an expert endodontic reference standard. To our knowledge, this is the first investigation to benchmark a general-purpose AI chatbot against human clinicians using real radiographic data in endodontics.

METHODS

This diagnostic accuracy study was conducted in accordance with the Standards for Reporting Diagnostic Accuracy Studies 2015 (STARD 2015) and the STARD-Artificial Intelligence (STARD-AI) extension for studies involving AI-based diagnostic systems [11]. The study protocol was retrospectively registered and made publicly available on the Open Science Framework (https://osf.io/kmsqx). All key methodological components of the study, including design, participant selection, index test, reference standard, and statistical analysis, were reported in compliance with these guidelines to ensure transparency and reproducibility. STARD-compliant flow diagrams are presented in Figures 1 to 3.

Figure 1.

STARD (Standards for Reporting of Diagnostic Accuracy Studies) flow diagram for obturation length evaluation.

Figure 3.

STARD (Standards for Reporting of Diagnostic Accuracy Studies) flow diagram for pathology detection in the apical area.

Ethical approval was granted by the Institutional Review Board of the Ministry of Health, Qassim Province, Saudi Arabia (H-040Q-001), and the study complied with the principles of the Declaration of Helsinki. Because the study used anonymized retrospective data, informed consent was waived.

Radiographic data collection

Periapical radiographs were retrospectively obtained from a private dental clinic archive. Imaging was performed using a standardized digital intraoral X-ray unit (Planmeca ProX; Planmeca Oy, Helsinki, Finland) with consistent exposure parameters (70 kVp, 8 mA, and an exposure time of 0.2 seconds). All images were exported as high-resolution, uncompressed JPEG files. Two trained independent investigators, who were not involved in the diagnostic assessment phase, anonymized and pre-classified the radiographs into two categories: straightforward or complex.

A radiograph was classified “complex” if it met any of the following criteria:

1. Overlapping or superimposed anatomical structures (e.g., zygomatic arch or maxillary sinus shadow obstructing apex visibility).

2. Evidence of procedural complications (e.g., ledges, perforations, separated instruments), or

3. Fuzzy or blurred radiographic apex.

This classification was designed to reflect real-world diagnostic conditions, as radiographic complexity may differentially affect both human interpretation and AI-based assessment, thereby influencing diagnostic accuracy. Radiographs that did not meet any of these criteria and showed clearly interpretable root canal fillings and apices were classified as “straightforward.” Inter-rater agreement was reached through discussion. Radiographs of posterior teeth were included; anterior teeth and non-endodontically treated cases were excluded. Inclusion criteria consisted of periapical radiographs of endodontically treated teeth that provided clear visualization of the root apex and extended at least 2 mm beyond the radiographic apex. Images with artifacts, poor contrast, or incomplete apical capture were omitted.

Diagnostic evaluation

The diagnostic assessment involved three evaluators grouped into two categories: the AI-based model (ChatGPT-5) and a consensus panel of two licensed general dentists, each with at least five years of clinical experience. The general dentists jointly reviewed the radiographs and reached a consensus through discussion.

ChatGPT-5 independently evaluated the same radiographs using the platform’s image input interface. To simulate realistic public-facing use, radiographs were submitted using the platform’s visual input functionality. Following multiple preliminary tests to optimize prompt clarity, the model was provided with the following prompt: “As a professional dentist, assess the uploaded radiographs using these criteria: (1) length of obturation—short, adequate, or long; (2) presence of voids—yes or no; and (3) presence of periapical pathosis—yes or no.” The same standardized prompt was applied uniformly to all 271 radiographs. While ChatGPT-5 generally provided definitive categorical outputs, responses in complex cases frequently included uncertainty modifiers such as ‘appears to be’ or ‘may suggest,’ indicating reduced confidence in visually ambiguous scenarios.

For human evaluators, standardized diagnostic definitions were applied:

• Root canal filling length:

- Short: obturation ending >2 mm short of the radiographic apex.

- Adequate: obturation terminating within 0–2 mm of the apex.

- Long: obturation extending beyond the apex.

• A void was defined as a radiolucent gap within the filling material, and a periapical lesion as a radiolucency exceeding twice the width of the normal periodontal ligament space at the apex; both were evaluated as a binary variable (yes or no).

Diagnostic criteria were intentionally withheld from ChatGPT-5 to simulate typical real-world public usage, where users commonly upload images without structured clinical definitions. This approach was adopted to enhance the external validity of the findings.

Reference standard

The reference standard was established by two board-certified endodontists, each with more than 5 years of clinical experience in endodontic diagnosis and treatment. All radiographs were reviewed independently on a calibrated 24-inch LCD monitor (1,920 × 1,080 resolution) under standardized lighting in a quiet environment without knowledge of the ChatGPT or general dentists’ assessments. Disagreements between the endodontists were resolved by discussion to establish the reference standard.

Sample size calculation

The required sample size was estimated using an expected agreement proportion of 0.80 (based on prior AI-dentist diagnostic concordance studies [12,13], a 95% confidence level (Z = 1.96), and a 5% margin of error (E = 0.05)). The calculation indicated a minimum of 246 radiographs. The following formula was used:

n=Z2 P (1-P) E2

where, n = required sample size; Z = Z value corresponding to 95% confidence level (1.96); P = expected proportion of agreement between AI model and dentists. Based on the initial estimates, P was set to 80% (0.8 agreement); E = acceptable margin of error, which was set to 5% (0.05).

The calculation was performed under a two-tailed test assumption, aiming to detect any statistically significant difference—either higher or lower—in agreement between the AI model and human evaluators. A two-sided 95% confidence interval was thus applied.

The values were substituted into the formula as follows:

n=(1.96)2 0.8 (1-0.8)(0.05)2246

Statistical analysis

Sensitivity, specificity, and overall accuracy were calculated for ChatGPT-5 and general dentists relative to the reference standard. Sensitivity reflected the ability to detect inadequate obturations, voids, or apical lesions; specificity reflected the correct identification of normal findings of the apical area and adequate root canal filling. The McNemar test was used to compare diagnostic proportions (α = 0.05). Statistical analyses were performed using IBM SPSS Statistics version 29.0 (IBM Corp., Armonk, NY, USA).

RESULTS

A total of 271 anonymized periapical radiographs met the inclusion criteria and were analyzed. Of these, 167 (61.6%) were classified as straightforward and 104 (38.4%) as complex based on radiographic interpretability.

Root canal filling length

Assessment of obturation length revealed notable variability among evaluators (Table 1). The reference standard identified 69 cases (25.5%) as short, 166 (61.3%) as adequate, and 36 (13.3%) as long. ChatGPT-5 identified fewer short (13.7%, n = 37) and long fillings (1.5%, n = 4) while overestimating adequate cases (83.4%, n = 226). In contrast, general dentists’ classifications were more closely aligned with the reference standard (short, 28.8%; adequate, 58.7%; and long, 12.5%).

Performance of ChatGPT-5 and general dentists in detecting root canal filling length

In straightforward radiographs, ChatGPT-5 achieved high sensitivity for adequate fillings (92.7%) but markedly low sensitivity for short (13.3%) and long (0.1%) fillings. Specificity ranged from 10.4% (adequate) to 99.3% (long), yielding an overall accuracy of 58.9%. In complex radiographs, sensitivity for short fillings increased to 39.1%, with accuracy reaching 61.5%.

General dentists demonstrated consistently superior diagnostic performance. For straightforward radiographs, their overall accuracy was 85.6%, with high sensitivity for all filling categories (≥81.8%). In complex cases, accuracy remained high (81.7%). Statistical analysis confirmed significant differences in sensitivity and overall accuracy between ChatGPT-5 and general dentists for both radiograph types (p = 0.003 and p < 0.001, respectively), while specificity differences were not significant (p = 0.239 and p = 0.427, respectively) (Table 1, Figure 1).

Detection of voids in root canal fillings

The reference standard detected voids in 122 radiographs (45.0%), whereas ChatGPT-5 identified only 34 (12.5%), indicating frequent under-detection. The general dentists recognized voids in 80 cases (29.5%), closely approximating the reference standard.

For straightforward radiographs, ChatGPT-5 demonstrated a sensitivity of 9%, specificity of 91%, and overall accuracy of 54%. In complex radiographs, sensitivity improved to 22.7% and accuracy to 58.7%, with specificity of 85%. In contrast, the general dentists achieved sensitivity values of 57.7% (straightforward) and 56.8% (complex), with overall accuracies of 76% and 78.8%, respectively. Differences in sensitivity and accuracy between ChatGPT-5 and the dentists were statistically significant (p = 0.01 and p < 0.001), while specificity differences were not (p = 0.781 and p = 0.149) (Table 2, Figure 2).

Performance of ChatGPT-5 and general dentists in detecting voids in root canal fillings

Figure 2.

STARD (Standards for Reporting of Diagnostic Accuracy Studies) flow diagram for void detection in root canal fillings.

Detection of periapical lesions

The reference standard classified 92 radiographs (33.9%) as having periapical lesions. ChatGPT-5 detected only 49 (18.1%), showing limited sensitivity but high specificity. For straightforward radiographs, sensitivity was 10.5%, specificity 88.2%, and overall accuracy 63.2%. For complex radiographs, sensitivity rose to 28.6%, specificity declined to 71.0%, and accuracy was 56.7%.

General dentists demonstrated markedly higher diagnostic performance, with sensitivity of 93.0% and specificity of 79.1% in straightforward radiographs, and 80.0% and 87.0%, respectively, in complex radiographs. Their overall accuracy exceeded 83% across all conditions. Statistical comparisons confirmed significantly higher sensitivity and accuracy for general dentists (p < 0.001 for straightforward, p = 0.002 for complex), with specificity differences significant only in complex cases (p = 0.029) (Table 3, Figure 3).

Performance of ChatGPT-5 and general dentists in detecting periapical lesions

In summary, across all diagnostic parameters, ChatGPT-5 demonstrated high specificity but consistently low sensitivity, particularly in detecting voids and periapical lesions. General dentists maintained balanced sensitivity and specificity, achieving accuracies above 80% in all assessments, whereas ChatGPT-5’s performance ranged between 54% and 63%.

DISCUSSION

This study evaluated the diagnostic accuracy of ChatGPT-5 as a public source of medical information in interpreting periapical radiographs of endodontically treated teeth. As of February 2025, ChatGPT-5 had over 400 million weekly active users, reflecting its rapid growth and widespread adoption [12]. A recent Cleveland Clinic survey reported that a significant portion of the public turns to AI chatbots, such as ChatGPT-5, for medical information, with 72% of Americans believing the health advice they received from these chatbots is accurate [13]. Moreover, four out of five AI users considered information obtained from ChatGPT-5 to be at least as reliable as that provided by their physicians [14]. This growing trust in AI-driven health guidance raises concerns, particularly regarding the accuracy and reliability of the medical advice provided by AI chatbots.

In this study, the model demonstrated high specificity but consistently low sensitivity across all diagnostic parameters, indicating that while it could correctly identify normal findings, it frequently failed to detect pathological or suboptimal conditions. These results align with the study’s initial hypothesis and emphasize that generalist large-language-model systems, although capable of image recognition, remain unreliable for clinical diagnostic purposes.

The overall accuracy of ChatGPT-5 (54%–63%) was significantly lower than that of general dentists (76%–86%), reaffirming that professional human judgment remains essential for radiographic interpretation. The pattern of high specificity but poor sensitivity is characteristic of untrained or nonspecialized classifiers that default to “normal” findings when diagnostic confidence is low. This outcome is in agreement with previous AI radiology investigations, where general-purpose models underperformed compared with domain-trained algorithms designed specifically for dental imaging [15].

The limited diagnostic performance of ChatGPT-5 can be attributed to its architecture and training paradigm. Although the model is multimodal, its foundational structure is language-oriented, optimized for text generation rather than pixel-based image analysis. It lacks supervised exposure to annotated dental radiographs and thus cannot accurately recognize subtle endodontic features such as obturation length deviations, apical radiolucencies, or voids. Dedicated deep-learning networks trained on labeled datasets—such as convolutional neural networks or transformer-based vision models—have achieved sensitivities above 85% in similar diagnostic tasks [16,17], underscoring the gap between purpose-built systems and generalist AI. Although domain-specific fine-tuning of large language models may improve performance, such adaptations may not fully overcome the inherent limitations of language-first architectures. Diagnostic tasks in endodontic radiology likely require purpose-built vision models trained on large, annotated dental imaging datasets, rather than general-purpose AI systems adapted post hoc.

Clinically, these findings have practical implications. With the increasing use of ChatGPT-based platforms by patients seeking self-assessment, there is a risk that misinterpretation of radiographs may lead to delayed professional consultation, unwarranted anxiety, or misplaced reassurance. Endodontists and general practitioners should be aware of these limitations and proactively educate patients on the boundaries of AI-generated health information. The ChatGPT-5 evaluation was intentionally conducted using natural language prompts that simulate realistic public-facing interactions, without embedding detailed operational definitions. Future research may explore whether incorporating standardized diagnostic definitions into AI prompts enhances interpretive reliability.

Several factors may explain the observed discrepancies between ChatGPT-5 and human evaluators. The lack of diagnostic calibration, absence of case-specific context (such as patient symptoms or clinical notes), and the inherent ambiguity of radiographic features contribute to performance variability [18,19]. Despite standardized training, dental professionals may interpret radiographic features differently due to variations in clinical experience, perceptual sensitivity, and diagnostic thresholds for pathology. Factors such as contrast perception, anatomical variations, and subjective assessment of radiolucencies contribute to these discrepancies, particularly in complex cases where the pathological findings are subtle or ambiguous [20]. Although expert disagreement reinforces the inherent subjectivity of human evaluation, the collective accuracy of the included dentists remained superior to that of the AI model. The persistence of interobserver variability suggests that AI should not replace expert interpretation but rather serve as an adjunct tool. Additionally, the model’s reliance on a single image per case contrasts with the multi-image or three-dimensional (3D) datasets often available to clinicians, further reducing diagnostic precision.

This study has limitations inherent to retrospective radiographic designs. The dataset comprised periapical radiographs from a single clinical setting, which may limit generalizability. The ChatGPT-5 prompt was standardized to simulate public usage; however, different phrasing or context could influence responses. Moreover, the model’s outputs were not repeated across sessions to test reproducibility, as image-based stochastic variation may exist. Future research should explore incorporating multiple AI systems, larger multicenter datasets, and prospective designs that evaluate human-AI collaborative workflows. Developing specialized, dentistry-trained multimodal models may substantially enhance diagnostic accuracy and clinical reliability. Additionally, this study relied on static two-dimensional periapical radiographs, whereas 3D imaging modalities such as cone-beam computed tomography are increasingly used in endodontic practice and may yield different diagnostic performance profiles.

Overall, ChatGPT-5’s performance in this study underscores that general-purpose AI remains an adjunctive tool rather than a diagnostic substitute. While it demonstrated a capacity for identifying normal cases, it failed to match the diagnostic precision of trained clinicians. Continuous refinement of domain-specific AI frameworks, combined with clinician oversight, is essential to ensure safe and effective integration of AI into endodontic practice.

CONCLUSIONS

ChatGPT-5 demonstrated high specificity but low sensitivity in interpreting periapical radiographs of endodontically treated teeth, limiting its diagnostic reliability. While the model can recognize normal findings, it frequently fails to identify inadequate root canal fillings, voids, and periapical pathoses. These findings underscore that general-purpose AI systems remain unsuitable for autonomous diagnostic use in endodontics. Clinician expertise and critical interpretation remain indispensable, and ongoing development of domain-specific, dentistry-trained AI models is essential to enhance diagnostic accuracy, patient safety, and the responsible integration of AI into clinical practice.

Notes

CONFLICT OF INTEREST

No potential conflict of interest relevant to this article was reported.

FUNDING/SUPPORT

The authors have no financial relationships relevant to this article to disclose.

ACKNOWLEDGMENTS

The authors thank the IT department from Dr. Tooth Clinic for their help in retrieving radiographs.

AUTHOR CONTRIBUTIONS

Conceptualization: Almutairi W, Alharbi H. Data curation: Alnasser M; Alharbi H, Austah O. Formal analysis: Almutairi W, Austah O. Investigation: Almutairi W, Alnasser M, Alharbi H. Methodology: Almutairi W, Alharbi H. Project administration: Almutairi W, Alnasser M. Validation: Austah O, Alharbi H. Writing - original draft: Almutairi W, Alharbi H. Writing - review & editing: Austah O.

DATA SHARING STATEMENT

The datasets are available upon reasonable request from the corresponding author.

DISCLOSURE OF GENERATIVE AI IN SCIENTIFIC WRITING

We confirm that generative AI (specifically ChatGPT-5) was indeed central to the conduct of this study, as detailed in the methodology section, which investigates its accuracy. Regarding the preparation of the manuscript, generative AI tools were used only for grammar checking. All content was originally written by the authors and fully reviewed and edited for accuracy and originality.

References

1. Goodfellow I, Bengio Y, Courville A. Deep learning Cambridge, MA: MIT Press; 2016.
2. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS 2017); December 4–9, 2017; Long Beach, CA, USA. Red Hook, NY: Curran Associates, Inc; 2017. p. 5998–6008.
3. Rutledge GW. Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases. Learn Health Syst 2024;8e10438. 10.1002/lrh2.10438. 39036534.
4. Amin K, Khosla P, Doshi R, Chheang S, Forman HP. Artificial intelligence to improve patient understanding of radiology reports. Yale J Biol Med 2023;96:407–417. 10.59249/nkoy5498. 37780992.
5. Bickmore TW, Pfeifer LM, Jack BW. Taking the time to care: empowering low-health-literacy hospital patients with virtual nurse agents. In : Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Boston (MA): ACM Press; 2009. p. 126–35.
6. Asgary S. Artificial intelligence in endodontics: a scoping review. Iran Endod J 2024;19:85–98. 10.22037/iej.v19i2.44842. 38577001.
7. Sadr S, Mohammad-Rahimi H, Motamedian SR, Zahedrozegar S, Motie P, Vinayahalingam S, et al. Deep learning for detection of periapical radiolucent lesions: a systematic review and meta-analysis of diagnostic test accuracy. J Endod 2023;49:248–261. 10.1016/j.joen.2022.12.007. 36563779.
8. Ghaffari M, Zhu Y, Shrestha A. A review of advancements of artificial intelligence in dentistry. Dent Rev 2024;4:100081. 10.1016/j.dentre.2024.100081.
9. Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, et al. Chatgpt in radiology: a systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024;105:251–265. 10.1016/j.diii.2024.04.003. 38679540.
10. Stephan D, Bertsch A, Burwinkel M, Vinayahalingam S, Al-Nawas B, Kämmerer PW, et al. AI in dental radiology-improving the efficiency of reporting with ChatGPT: comparative study. J Med Internet Res 2024;26e60684. 10.2196/60684. 39714078.
11. Sounderajah V, Ashrafian H, Golub RM, Shetty S, De Fauw J, Hooft L, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 2021;11e047709. 10.1136/bmjopen-2020-047709. 34183345.
12. Singh S. ChatGPT statistics 2024 - 300 million active users (December) [Internet]. San Francisco, CA: DemandSage; 2024. [cited 2025 May 23]. Available from: https://www.demandsage.com/ChatGPT-statistics.
13. Cleveland Clinic. Cleveland Clinic survey: most Americans using health-monitoring technology are experiencing significant physical and mental benefits [Internet]. Cleveland, OH: Cleveland Clinic; 2024. Feb. 1. [cited 2025 May 23]. Available from: https://newsroom.clevelandclinic.org/2024/02/01/cleveland-clinic-survey-most-americans-using-health-monitoring-technology-are-experiencing-significant-physical-and-mental-benefits.
14. Ayo-Ajibola O, Davis RJ, Lin ME, Riddell J, Kravitz RL. Characterizing the adoption and experiences of users of artificial intelligence-generated health information in the United States: cross-sectional questionnaire study. J Med Internet Res 2024;26e55138. 10.2196/55138. 39141910.
15. Baza R. Evaluating artificial intelligence in dental radiography: an experimental study and a comparative analysis of AI ability in diagnosing panoramic X-ray infections [master’s thesis]. Stockholm (Sweden): KTH Royal Institute of Technology; 2024.
16. Jurafsky D, Martin JH. Speech and language processing 3rd edth ed. London: Pearson; 2021.
17. Moor M, Banerjee O, Abad ZS, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259–265. 10.1038/s41586-023-05881-4. 37045921.
18. Goldman M, Pearson AH, Darzenta N. Endodontic success: who’s reading the radiograph? Oral Surg Oral Med Oral Pathol 1972;33:432–437. 10.1016/0030-4220(72)90473-2. 4501172.
19. Goldman M, Pearson AH, Darzenta N. Reliability of radiographic interpretations. Oral Surg Oral Med Oral Pathol 1974;38:287–293. 10.1016/0030-4220(74)90070-x. 4528712.
20. Hegde S, Gao J, Vasa R, Cox S. Factors affecting interpretation of dental radiographs. Dentomaxillofac Radiol 2023;52:20220279. 10.1259/dmfr.20220279. 36472942.

Article information Continued

Figure 1.

STARD (Standards for Reporting of Diagnostic Accuracy Studies) flow diagram for obturation length evaluation.

Figure 2.

STARD (Standards for Reporting of Diagnostic Accuracy Studies) flow diagram for void detection in root canal fillings.

Figure 3.

STARD (Standards for Reporting of Diagnostic Accuracy Studies) flow diagram for pathology detection in the apical area.

Table 1.

Performance of ChatGPT-5 and general dentists in detecting root canal filling length

Evaluator Length Straightforward radiographs Complex radiographs
Sensitivity (%) Specificity (%) Accuracy (%) Sensitivity (%) Specificity (%) Accuracy (%)
ChatGPT-5 Short 13.3 94.1 71.8 39.1 81.5 72.1
Adequate 92.7 10.4 58.9 77.6 32.4 61.5
Long 0.1 99.3 85.9 7.1 97.8 85.6
General dentists Short 84.8 91.7 89.8 91.3 90.1 90.4
Adequate 86.9 83.8 85.6 82.1 81.1 81.7
Long 81.8 97.9 95.8 57.1 94.4 89.4

Table 2.

Performance of ChatGPT-5 and general dentists in detecting voids in root canal fillings

Straightforward radiographs Complex radiographs
ChatGPT-5 (%) General dentists (%) ChatGPT-5 (%) General dentists (%)
Sensitivity 9.0 57.7 22.7 56.8
Specificity 91.0 92.1 85.0 95.0
Accuracy 54.0 76.0 58.7 78.8

Table 3.

Performance of ChatGPT-5 and general dentists in detecting periapical lesions

Straightforward radiographs Complex radiographs
ChatGPT-5 (%) General dentists (%) ChatGPT-5 (%) General dentists (%)
Sensitivity 10.5 93.0 28.6 80.0
Specificity 88.2 79.1 71.0 87.0
Accuracy 63.2 83.8 56.7 84.6