Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Study design

This retrospective study did not directly involve human subjects. All data are devoid of any information that can identify individuals, and are available online to members of Japan Radiological Society (JRS). Model inputs were submitted via application programming interfaces (APIs) of OpenAI, Anthropic, Google Cloud, or Azure AI Foundry, as detailed in later sections. Their privacy policies guarantee that data submitted via their APIs are securely handled and not used for model training. Therefore, Institutional Review Board approval was waived.

Question dataset

The questions included in our study were entirely derived from the JDRBE, which evaluates comprehensive knowledge of diagnostic radiology. Candidates must complete at least five years of training of radiology to be eligible for the JDRBE.

For the JDRBE 2021 and 2023, we used the same dataset as in our previous report [9]. We additionally prepared the dataset for the 2024 examination following the same method. Briefly, we downloaded the examination papers in the Portable Document Format (PDF) from the member-only section of the JRS website, and extracted text and images using Adobe Acrobat (Adobe, San Jose, CA), preserving the original image resolutions and file formats. For most questions, the extracted images were used as-is. For some questions with inappropriate extracted images (such as those with multiple images superimposed), we used screenshots captured from the PDF files in PNG format instead. The images were either in PNG or JPEG format, with heights ranging from 134 to 1,708 pixels (mean, 456) and widths from 143 to 1,255 pixels (mean, 482). Figure 1 shows an example of a question processed using this method. Questions from the 2022 examination were excluded because we failed to extract relevant data from the PDF file.

Fig. 1figure 1

Example of text and image extraction from a question. The main text and input images were extracted and provided to the model, while the question number (“18”) and image captions (“reconstructed sagittal slice” and “axial slice at the level of the first lumbar spine”) were omitted. In this example, the main text states, “A reconstructed sagittal slice and an axial slice at the level of the first lumbar spine are shown”

Questions without images were excluded; the remaining questions contained one to four images each. All questions had five answer choices. Approximately 90% were single-answer questions, while the remaining 10% were two-answer questions that required choosing both correct answers. The required number of choices was specified in each question statement.

Since there were no officially published answers to the examinations, ground-truth answers were determined through consensus by three or more board-certified diagnostic radiologists. The answers for the 2024 examination were determined by S.M., S.H., and T.Y., with 18, 23, and 30 years of experience in diagnostic radiology, respectively. Questions without unanimous agreement on answers were excluded from the study. Figure 2 illustrates a flow chart detailing the inclusion and exclusion processes for questions.

Fig. 2figure 2

Summary of questions included in this study

Model evaluations

The following eight models were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, and o4-mini (all developed by OpenAI, San Francisco, CA); Claude 3.7 Sonnet (Anthropic, San Francisco, CA); and Gemini 2.5 Pro (Google DeepMind, London, UK). Details of the models are described in Table 1. Some of these models are reasoning models designed to solve complex tasks by employing logical reasoning [12,13,14]. GPT-4 Turbo and GPT-4o were chosen as baselines for comparison with our previous reports; the others are recent models released between February and April 2025. Since the GPT-4 Turbo model used in our previous study was a preview version and has since become unavailable, we used the closest available version in the same model family. All models were tested under two conditions: one using both text and image inputs (hereafter, “vision”) and one using only text inputs (“text-only”).

Table 1 Details of tested large language models

We used the official Anthropic API for Claude 3.7 Sonnet, the Google Cloud API for Gemini 2.5 Pro, and either the OpenAI API or the Azure AI Foundry (Microsoft, Redmond, WA) API for OpenAI models. For Claude 3.7 Sonnet, the required max_tokens parameter was set to 4,096; all other parameters for this model and all parameters for the other models, including the temperature parameter (if applicable), were left at their default values. All questions from the examinations were in Japanese, and the textual data were passed to the models without translation. We provided a system prompt similar to the one used in our previous report, as shown in Table 2. All experiments were conducted between April 18 and May 1, 2025.

Table 2 Prompts used in the experiments

To examine which image modalities led to greater performance gains with input images, we categorized the images into five modalities (CT, MRI, X-ray, nuclear medicine, and others) and counted the number of correct responses for each modality.

Legitimacy assessment

To assess the legitimacy of model responses, two diagnostic radiologists with different levels of experience (Y.Y., 2 years; S.M., 18 years, board-certified) independently rated the responses to all 92 questions from the 2024 examination. For this evaluation, we included one representative model from each LLM vendor—Claude 3.7 Sonnet from Anthropic, Gemini 2.5 Pro from Google DeepMind, and the best-performing model from OpenAI. GPT-4 Turbo was also included as a baseline model. A five-point Likert scale (1 = very poor to 5 = excellent) was used to rate each of the 368 responses based on a comprehensive assessment of response quality, including image interpretation and explanation. The responses were presented in randomized order, and the raters were blinded to the model identities.

Statistical analysis

Differences in performance between the vision and text-only results were analyzed using McNemar’s exact test. For the legitimacy scores, we first applied Friedman’s test, followed by Wilcoxon’s signed-rank test with Holm’s correction for post-hoc pairwise comparisons. In addition, the quadratic weighted kappa was calculated to assess agreement between the two raters. Statistical significance was set at P < 0.05. All analyses were conducted using Python (version 3.12.6) with the scipy (version 1.15.2) and statsmodels (version 0.14.4) libraries.

Comments (0)

No login
gif