Artificial intelligence software to detect small hepatic lesions on hepatobiliary-phase images using multiscale sampling

This retrospective, observational study was performed in accordance with relevant guidelines and regulations of our institution. This report adheres to the checklist for AI in medical imaging guidelines [15].

Lightweight detection architecture adapted to small lesions using multiscale sampling method

Details have been reported [14]. Briefly, this model was developed using HBP images from 45 cases (31 for training and 14 for validation), all of which were completely independent from the cases included in the observer performance study, thereby eliminating the risk of data leakage. The image acquisition parameters (including scanner model, sequence type, and imaging protocol) used for model development were identical to those used in the observer study, as all images were acquired at the same institution under the same protocol. Therefore, there is no domain shift between the model development data and the test data.

This model applies multiscale sampling (ms) and a 2D patch using the minimum-intensity projection of 3 orthogonal planes.

1. Multiscale sampling

Depending on the target to be detect, the size of the patches to be sampled for training the AI network should be optimized. Small lesions cannot be detected when the network is trained on large patches; large lesions cannot be detected when the patch size is small. Therefore, the msAI software samples 6-, 12-, and 24-mm patch images.

2. 2D patches using the minimum-intensity projection of 3 orthogonal planes

The detection of lesions on HBP images requires 3D information. Therefore, the patch images were converted from 3 to 2D using the minimum-intensity projection of axial, coronal, and sagittal planes to preserve the 3D information even on 2D images. The binary classification (hepatic lesion yes/no) is obtained using the lightweight 2D deep convolutional neural network model with 2D minimum-intensity projection images of the 3 orthogonal planes as the inputs.

Study design

This retrospective and observational study was approved by our institutional review board; prior informed consent from participants was waived. Patient records and information were anonymized and de-identified prior to analysis. This study was conducted to evaluate the ability of msAI software to assist in the detection of hepatic lesions on HBP images. Readers with different levels of experience were enrolled, and their performance with and without msAI software was compared in a setting similar to the clinical practice.

Study population

We collected 4,012 scans from patients who had undergone EOB-MRI studies between March 2018 and November 2023. The inclusion criteria included the presence of hepatic metastases, hemangiomas, or simple cysts. The following scans were excluded: (1) scans from patients with underlying diseases or treatment histories, such as chemotherapy and transcatheter arterial chemo-embolization that could potentially reduce the uptake of EOB by the liver parenchyma; (2) scans with notable noise or artifacts due to air in the stomach or body motion; (3) scans from patients harboring fewer than 2 or more than 11 lesions (because a small number of lesions renders their detection too easy, while too many lesions increases the burden on the readers); (4) scans from patients harboring hepatic lesions other than metastases, hemangiomas or simple cysts (other hepatic tumors, calcifications, postoperative scars, or undiagnosed lesions); (5) duplicate scans of the same patient. Finally, HBP images of 30 scans with 186 hepatic lesions were included in the observer performance study (Fig. 1).

Fig. 1figure 1

Flowchart of case enrollment

The ground truth (GT) for the hepatic lesions was determined by 2 board-certified radiologists (SM and YN with 6 and 21 years of experience, respectively). They were excluded from the observer performance study with reference to all other sequences on EOB-MRI- and PET scans obtained with fluorine 18 fluorodeoxyglucose, and on contrast-enhanced CT images. Pathologic findings obtained at definitive surgery were also considered. Patient and lesion details recorded as the GT are presented in Table 1.

Image acquisition

Scans were performed on a 3 T MRI instrument (TRILLIUM OVAL; FUJIFILM, Tokyo, Japan) using a 28-channel coil. EOB (25 μmol/kg, Primovist; Bayer Yakuhin, Osaka, Japan) was injected at 2.0 ml/s and flushed with 20 ml of saline using a power injector (Sonic Shot 50; Nemoto Kyorindo, Tokyo, Japan).

HBP imaging was started 20 min after the EOB injection. Images were obtained with a fat-saturated T1-weighted gradient-echo nature of the sequence with parallel imaging (rapid acquisition through a parallel imaging design; RAPID, FUJIFILM Corporation). The scan parameters were slice thickness and interval 3.0 mm, TR/TE 4.0 ms/1.8 ms, flip angle 15°, field of view 36 cm, and matrix 320 × 224. Only the HBP images were used for the observer performance study. Although other sequences including dynamic MRI scans using EOB were obtained for the clinical studies, they were not evaluated.

Observer performance study

The number of readers required for the observer performance study was calculated using G*Power 3.1.9.7 [16, 17]. In the observer performance study, lesion-level sensitivity, i.e., the lesion localization fraction (LLF), will be calculated. Although LLF is originally derived from binary outcomes, the calculation of the required number of readers should focus on per-reader LLF, which is treated as a continuous variable. A preliminary analysis, conducted using data from the same dataset used in the observer performance study and involving readers who did not participate in the actual observer performance study, yielded an effect size of 1.15 for the difference in per-reader LLF with and without the msAI software. Based on this effect size, a significance level of 5%, and a power of 80%, the required minimum number of readers was calculated to be 9 using the Wilcoxon signed-rank test. Our study involved 14 readers; they were 3 board-certified radiologists with 5–7 years of experience, 9 radiology residents with less than 4 years of experience, and 2 general physicians.

All readers received standardized instructions, were informed of the software operation methods, and all were trained on 12 patients excluded from the observer performance study. They were informed only that images showing cysts, hemangiomas, or hepatic metastases were included. Post-training, each reader interpreted the HBP image data sets twice, once with and once without msAI software in the concurrent reader mode. In each session, they were instructed to detect and annotate all hepatic lesions, and to rate their confidence level for each detected lesion on a scale from 0 to 100. They were also required to complete each interpretation session within 3 min. To minimize any memory bias, we imposed an interval of at least 2 weeks between the sessions.

We created a graphical user interface program for displaying Digital Imaging and Communications in Medicine images and ancillary information, lesion annotations, confidence level inputs, and lesion candidate displays. Lesion annotations recorded by the readers are blue, and suspected lesions identified by the software are red (Fig. 2). The reading environment included one monitor for viewing and inputting ancillary information, one high-resolution monitor for viewing the HBP images (EIZO Radi10 Force RX 240, 21.3 inch, 2 M color), and a personal computer that included the dedicated applications (Intel(R) Xeon(R) W-2104 CPU @ 3.20 GHz, RAM 16 GB).

Fig. 2figure 2

The original graphical user interface program for the observer performance test. The hepatic lesion detected by the reader is circled in blue; a red square shows the lesion identified by the AI software

Data analysis

Annotations made by the readers were recorded as true positive (TP) when the distance between the center of their annotation markers and the GT were within 3/4 of the diameter of the GT and when the distance was equal or less than 3 mm. As the msAI software evaluates annotations in three-dimensional space, the z-axis was also taken into account. Therefore, annotations placed on adjacent slices were not considered false positives as long as the 3D Euclidean distance between the annotation and the ground truth remained within the defined threshold. The LLF was calculated without using confidence score information. Any detection that correctly localized a lesion—regardless of its confidence score—was counted as a true positive. As a result, LLF was calculated under conditions that inherently allow a relatively high number of false positives. This approach was chosen to comprehensively assess lesion-level sensitivity without discarding low-confidence but correctly localized detections, as the primary goal of this study was to ensure that lesions were detected in the first place. The figure of merit (FOM) for detecting hepatic lesions was calculated using jackknife-free-response receiver-operating characteristic (JAFROC) analysis [18, 19]. Because our evaluation focused on lesion-level detection performance with localization, conventional receiver-operating characteristic (ROC) analysis, which assesses classification performance on a per-case (e.g., per-patient) basis, was not suitable. On the other hand, both free-response receiver-operating characteristic (FROC) and JAFROC allow for multiple lesion markings per image and evaluate localization accuracy. Unlike FROC, which is primarily descriptive and visual (e.g., plotting sensitivity versus false positives), JAFROC enables statistical comparison between conditions (e.g., with vs. without the software) by computing an FOM using a jackknife resampling method.

The negative consultation ratio (NCR), the percentage of correct diagnoses turning into incorrect by the AI software, was calculated [20]. High NCR values indicate that the reader tended to rely too much on the AI software. Negative consultation includes 2 decision patterns; true negative (TN) without and false positive (FP) with the AI software, and TP without-and false negative (FN) with the AI software. The NCR was calculated by dividing the negative consultations by all patterns. TN with and without the AI software are not included in the set of all patterns since the analysis includes only findings that were detected either with or without the software (or both). We labeled readers with an NCR lower than 10% as those who evaluated the results presented by the AI software results correctly.

Statistical analysis

Statistical analyses were performed using R version 4.5.1 (R Foundation for Statistical Computing, Vienna, Austria). The difference in LLF with and without the msAI software was tested with the paired t test, while the difference in FOM was evaluated using Dorfman–Berbaum–Metz method; a p value < 0.05 was considered statistically significant. We also performed subset analysis based on the lesion size using a 6 mm (smallest patch size) threshold and the NCR using a 10% threshold.

Comments (0)

No login
gif