Classification of fundus autofluorescence images based on macular function in retinitis pigmentosa using convolutional neural networks

Patient selection

This retrospective observational study was conducted at Nagoya University Hospital. All procedures conformed to the tenets of the Declaration of Helsinki and were approved by the Institutional Review Board/Ethics Committee of Nagoya University Hospital (approval number: 2023-0382). This study allowed patients to refuse participation by opt-out instead of obtaining written informed consent.

RP was diagnosed based on the ocular history, concentric contraction of the visual field, pathognomonic features of the fundus (e.g., pigmentary changes and attenuation of blood vessels), and a reduced response of electroretinograms.

In this study, medical records of 326 patients with RP who visited Nagoya University Hospital between February 2007 and March 2023 were reviewed. Of these patients, those who underwent fundus color photography and FAF imaging with an ultra-wide field imaging device (Optos P200Tx; Optos) and Humphrey Field Analyzer (HFA; Carl Zeiss Meditec, Inc.) test 10-2 program were enrolled.

We excluded eyes with atypical RP, such as sectorial RP or nonpigmented RP, vitrectomized eyes, and eyes with macular holes.

None of the included eyes had severe epiretinal membranes. Patients were divided into two groups, severe and mild, according to the criteria described below. We also evaluated whether there were any differences between these two groups in terms of sex ratio, age at the last visit, phakia to pseudophakia ratio, and the ratio of eyes used in the analysis [both eyes (OU), right eye (OD), or left eye (OS)].

Severity classification

Based on previous reports showing that HFA 10-2 program was useful in evaluating visual function in patients with RP [21,22,23], we defined images with a mean deviation (MD) value of > − 10 and < − 20 decibels (dB) on the HFA 10-2 program as the mild and severe groups, respectively. Whenever an HFA 10-2 test was performed within a year of the image acquisition date, that MD value was used to determine the classification of the image. Meanwhile, if no HFA 10-2 test had been performed within a year, the classification was performed only when both the MD values of before and after the image acquisition date were determined to be the same, either mild or severe value.

OU images were included only if they were classified into the same group, whereas OD images were included only if the OD and OS images were classified into different groups.

Images taken on different dates that met the criteria for HFA 10-2 were included even if they were from the same person and eye; however, with patients who had undergone cataract surgery, only images taken after the surgery were included.

There was one case that had initially been determined to be mild but was later reclassified as severe; in this case, only images determined to be severe were analyzed.

OD images were included in the analysis only when the OD and OS had different lens conditions (e.g., phakia in OD and pseudophakia in OS).

Dataset

In both groups, 40 cases were randomly selected, and only one image of each case was randomly extracted as test data. The image was cropped to retain the central half, which was vital for assessment of central vision function. This approach should enhance the accuracy of the model by concentrating on the most relevant parts of the image for assessing macular function. The extracted images were used as test data to confirm the performance of the CNN model. Images of the remaining cases were used as training data. The training images were cropped so that the center half remained similar to the test images and were augmented 16 times by rotation and flipping.

Deep learning model

We created a CNN model to classify the fundus color or FAF images into the severe and mild groups. CNN models for FAF images and fundus color images were created separately. This study used visual geometry group-16 (VGG16) to create the CNN [24]. All images were resized to 228 × 228 pixels. Transfer learning was applied to VGG16 using the initial weights obtained from training on the ImageNet dataset. The last output layer was replaced with several patterns, all with a final dense layer for binary classification. Optimization was performed using the Adam optimizer with several learning rate patterns. The loss function was categorical cross-entropy. The models were trained for up to 1,000 epochs with 32 images per step. We examined fivefold cross-validation of the training data to create an optimal CNN model. The model created for the training images was used to determine whether the test image was severe or mild (the group for which the probability was determined to be ≥ 0.5 was used as the prediction result). Under the same conditions, transfer learning using Xception [25], DenseNet201 [26], and MobileNet [27] were preliminarily evaluated to determine whether the CNN models outperformed VGG16 for this present study.

All models were trained on a computer with Ubuntu (18.04), Intel® Xeno Gold 6134 Computer Processing Unit, four Quadro P6000 GPUs, and 192 GB system memory. The CNN models were developed with Python Keras (https://keras.io/ja/) using TensorFlow (https://www.tensorflow.org/) as the backend.

Heatmaps

To visualize where the CNN model was focused, heat maps were created and were overlaid on the corresponding fundus images. Gradient-weighted Class Activation Mapping (Grad-CAM) was used [28]. The target layer was the third convolutional layer in block 5.

Metrics

To evaluate the model performance, we calculated the accuracy and area under the receiver operating characteristic curve (AUC) with 95% confidence interval (CI) as described by the true positive rate (sensitivity)–false positive rate (1-specificity). Accuracy is the ratio of the number of correctly predicted images to the total number of images. To estimate the 95% CI for the AUC, we performed bootstrap resampling with 1,000 iterations.

Statistical analysis

The chi-square test was employed to determine the differences in the male–female, phakia–IOL, and OU–OD–OS ratios between the severe and mild groups. The Mann–Whitney U test was used to compare the age at the last visit in the severe and mild groups. Python Scipy (https://scipy.org/) and Python Statsmodels (https://www.statsmodels.org/stable/index.html) were used for statistical analyses.

Comments (0)

No login
gif