Clinical Risk Computation by Large Language Models Using Validated Risk Scores

Recent advances in artificial intelligence have propelled Large Language Models (LLMs) in natural language understanding, enabling new healthcare applications. While LLMs can analyze health data, directly predicting patient risk scores can be unreliable due to inaccuracies, biases, and difficulty interpreting complex medical data. A more trustworthy approach uses LLMs to calculate traditional clinical risk scores—validated, evidence-based formulas widely accepted in medicine. This improves validity, transparency, and safety by relying on established scoring systems rather than LLM-generated risk assessments, while still allowing LLMs to enhance clinical workflows through clear and interpretable explanations. In this study, we evaluated three public LLMs—GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash—in calculating five clinical risk scores: CHA₂DS₂-VASc, HAS-BLED, Wells Score, Charlson Comorbidity Index, and Framingham Risk Score. We created 100 patient profiles (20 per score) representing diverse clinical scenarios and converted them into natural language clinical notes. These served as prompts for the LLMs to extract information and compute risk scores. We compared LLM-generated scores to reference scores from validated formulas using accuracy, precision, recall, F1 score, and Pearson correlation. GPT-4o-mini and Gemini 2.5 Flash outperformed DeepSeek v3, showing near-perfect agreement on most scores. However, all models struggled with the complex Framingham Risk Score, indicating challenges for general LLMs in complex risk calculations.

Comments (0)

No login
gif