Helmholtz Gemeinschaft

Search
Browse
Statistics
Feeds

Large language models for patient education prior to interventional radiology procedures: a comparative study

[thumbnail of Original Article]
Preview
PDF (Original Article) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB
[thumbnail of Supplementary Information] Other (Supplementary Information)
33kB

Item Type:Article
Title:Large language models for patient education prior to interventional radiology procedures: a comparative study
Creators Name:Levita, Bogdan, Eminovic, Semil, Lüdemann, Willie Magnus, Schnapauff, Dirk, Schmidt, Robin, Haack, Anna-Maria, Dell'Orco, Andrea, Nawabi, Jawed and Penzkofer, Tobias
Abstract:PURPOSE: This study evaluates four large language models' (LLMs) ability to answer common patient questions preceding transarterial periarticular embolization (TAPE), computed tomography (CT)-guided high-dose-rate (HDR) brachytherapy, and bleomycin electrosclerotherapy (BEST). The goal is to evaluate their potential to enhance clinical workflows and patient comprehension, while also assessing associated risks. MATERIALS AND METHODS: Thirty-five TAPE, 34 CT-HDR brachytherapy, and 36 BEST related questions were presented to ChatGPT-4o, DeepSeek-V3, OpenBioLLM-8b, and BioMistral-7b. The LLM-generated responses were independently assessed by two board-certified radiologists. Accuracy was rated on a 5-point Likert scale. Statistics compared LLM performance across question categories for patient-education suitability. RESULTS: DeepSeek-V3 attained the highest mean scores for BEST [4.49 (± 0.77)] and CT-HDR [4.24 (± 0.81)] and demonstrated comparable performance to ChatGPT-4o for TAPE-related questions (DeepSeek-V3 [4.20 (± 0.77)] vs. ChatGPT-4o [4.17 (± 0.64)]; p = 1.000). In contrast, OpenBioLLM-8b (BEST 3.51 (± 1.15), CT-HDR 3.32 (± 1.13), TAPE 3.34 (± 1.16)) and BioMistral-7b (BEST 2.92 (± 1.35), CT-HDR 3.03 (± 1.06), TAPE 3.33 (± 1.28)) performed significantly worse than DeepSeek-V3 and ChatGPT-4o across all procedures. Preparation/Planning was the only category without statistically significant differences across all three procedures. CONCLUSION: DeepSeek-V3 and ChatGPT-4o excelled on TAPE, BEST, and CT-HDR brachytherapy questions, indicating potential to enhance patient education in interventional radiology, where complex but minimally invasive procedures often are explained in brief consultations. However, OpenBioLLM-8b and BioMistral-7b exhibited more frequent inaccuracies, suggesting that LLMs cannot replace comprehensive clinical consultations yet. Patient feedback and clinical workflow implementation should validate these findings.
Keywords:Large Language Models, Interventional Radiology, Patient Education
Source:CVIR Endovascular
ISSN:2520-8934
Publisher:Springer Nature
Volume:8
Number:1
Page Range:81
Date:13 October 2025
Official Publication:https://doi.org/10.1186/s42155-025-00609-z
PubMed:View item in PubMed

Repository Staff Only: item control page

Downloads

Downloads per month over past year

Open Access
MDC Library