Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating ACR and EULAR Recommendations

Abstract

Objectives To develop and evaluate a Retrieval-Augmented Generation (RAG) system integrating European Alliance of Associations for Rheumatology (EULAR) and American College of Rheumatology (ACR) guidelines to provide rheumatologists with timely, evidence-based recommendations at the point of care.

Methods EULAR and ACR and management guidelines were selected by rheumatologists according to relevance to clinical decision making, processed, and chunked. A RAG system using LangChain framework, voyage-3 embedding model, and a Qdrant vector database was implemented. Answers to 740 guideline-specific questions were generated by ChatGPT-o3-mini with context retrieval (RAG) and without (baseline). Performance was evaluated using an LLM-as-a-judge (Gemini 2.0 Flash) assessing factual accuracy, safety, completeness, faithfulness, and preference, with Wilcoxon signed-rank and Binomial tests for statistical significance.

Results After agreement, 74 guidelines were included. The RAG-based system received consistently higher or comparable medians than the baseline across all criteria, relevance, factual accuracy, safety, completeness and conciseness (p<0.001). Moreover, the RAG-based system was significantly preferred by the LLM-judge in 92.8% of comparisons (p<0.001).

Conclusion This study demonstrates the successful development and validation of a RAG system integrating extensive ACR/EULAR guidelines. The system significantly improves answer quality compared to a baseline LLM, providing a promising foundation for reliable, AI-driven clinical decision support tools in rheumatology to enhance guideline adherence.

Key messages

Large language models, combined with EULAR and ACR guidelines, may enhance rheumatology clinical decision support.

Retrieval augmented generation (RAG) responses showed significantly greater accuracy, safety and completeness than baseline LLMs.

RAG is a promising architecture for reducing hallucinations and providing grounded, reliable answers.

Competing Interest Statement

DB has received payment honoraria for lectures, presentations, speakers bureaus,or support for meeting attendance from AbbVie, Galapagos, Janssen, UCB, Pfizer, Novartis, support for attending meetings from UCB, Novartis and AbbVie. He works part-time as Advisor at Savana Research, company on AI in medicine. AMG works at Roche as data scientist.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data availability statement

The clinical guidelines used in this study are freely available. All LLM prompts are documented, and the code is available upon individual request

For further details or additional information, please contact the corresponding authors.

Supplementary File with Questions, Answers, and the Evaluation contains the data used to evaluate the RAG system performance.

View original article

Medrxiv - Rheumatology

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Optimizing the Clinical Application of Rheumatology Guidelines Using Large Language Models: A Retrieval-Augmented Generation Framework Integrating ACR and EULAR Recommendations

Comments (0)