Fractal-driven self-supervised learning enhances early-stage lung cancer GTV segmentation: a novel transfer learning framework

Patients and equipment

In this retrospective study, we consecutively enrolled 104 patients (36–91 years old; 81 males; 23 females; GTV volume range: 1.361–164.1 cc; mean GTV volume: 17.01 cc) with peripheral early-stage non-small cell lung cancer who underwent radiotherapy at our institution from December 2017 to March 2025. This study was approved by the Institutional Ethics Committee of the University of Yamanashi (Approval No.: CS0010). The requirement for individual informed consent was waived because this was a retrospective study. Information about the study was disclosed on our institution's website, providing patients with the opportunity to opt out. To ensure patient privacy, the data were de-identified and assigned a unique study-specific number.

Planning CT images were acquired using an Aquilion LB (Canon Medical Systems, Japan) with a matrix size of 512 × 512, slice thickness of 2.0 mm, and pixel size of 1.074 × 1.074 mm. All planning CT images were obtained during inspiratory breath-hold using Abches (APEX Medical Inc., Japan). GTV contours were delineated by a radiation oncologist with over 7 years of experience using the commercial treatment planning system Pinnacle (Philips Radiation Oncology Systems, USA). Subsequently, all cases were reviewed and approved by at least three senior radiation oncologists (with over 10 years of experience) during departmental conferences.

All image processing, computation, and statistical analyses in this study were performed using Python (version 3.12). The segmentation models were implemented using the PyTorch framework (version 2.6.0).

Data preprocessing

In GTV auto-segmentation tasks, the tumor size is typically very small relative to the entire CT image, which leads to a severe class imbalance between the tumor and the background [4, 5]. Additionally, ResNet-18, which we adopted as the encoder model in this study, repeatedly halves the image size through downsampling. Therefore, setting the pixel size to a power of two allows for stable computation, as it prevents remainders and the need for exception handling. For these reasons, all CT images and corresponding GTV binary masks were cropped to a 64 × 64 pixels (68.736 × 68.736 mm) region of interest (ROI) centered on the GTV centroid, as this was the smallest power-of-two size that could fully encompass the tumor while minimizing the inclusion of unnecessary background. All GTVs in the cases included in this study were fully contained within this ROI, with axial maximum diameters ranging from 12.9 to 65.6 mm (mean: 27.2 mm).

Fractal image generation and FractalDB construction

The fractal images used in this study are mathematically generated synthetic images that mimic ubiquitous fractal structures found in nature. The dataset composed of these images is named FractalDB [18]. The procedure for generating fractal images is shown in Fig. 2. Although the fractal image generation algorithm has been detailed by Kataoka et al., we provide a brief overview below, as it is a core technology of this study.

Fig. 2

Workflow of fractal image generation. A fractal image is generated by starting from an initial point at t = 0 and repeatedly applying randomly selected affine transformations until t = T

Iterated Function Systems (IFS) were used for image generation. An IFS is a family of functions composed of $N$ affine transformations, defined in the following form:

$$\text= \left\_, _, \dots, _; _, _, \dots, _\right\}$$

Here, $X$ is a complete metric space, $_$ are affine transformations, $_$ is the probability that $_$ is selected. Each affine transformation $_$ is defined in the following form, where $_=(_, _, _, _, _, _)$ represents the affine transformation parameters:

$$_\left(}}_}+1};_\right)=\left[\begin_& _\\ _& _\end\right]}}_}}+\left[\begin_\\ _\end\right]$$

$$}=\left[\beginx\\ y\end\right]$$

By iteratively applying these transformations any number of times, it's possible to generate self-similar fractal patterns. For image generation, many points are rendered to form a 2D fractal image by starting from an initial point $(_, _)$ and repeatedly calculating the selected transformation $_$ based on its probability $_$.

At this point, when the IFS configuration parameters (the number of affine transformations $N$, the affine transformation parameters $_$, and the probabilities of selecting each affine transformation $_$) are randomly determined, the random parameter set $\Theta$ can be expressed as follows:

$$\Theta =_, _)\right\}}_^$$

Kataoka, et al. define this as a fractal category. Furthermore, by slightly varying these parameters, we can generate image variations (referred to as instances) within a fractal category.

FractalDB is a fractal image dataset composed of these fractal categories and the instances contained within them. In other words, a fractal category in FractalDB is analogous to a class in ImageNet, and the images within each class correspond to instances. Notably, while the fractal category is defined by the IFS configuration parameters $\Theta$, FractalDB trains the model to estimate these parameters as a task. This approach allows for the automatic assignment of pseudo-labels to fractal images, even without explicit ground truth labels like cats or birds found in ImageNet.

This study utilized publicly available models pre-trained on the FractalDB dataset, which was constructed by Kataoka et al. According to the original publication, the parameters used to generate FractalDB were configured as follows: the IFS parameters $a$ through $f$, as well as $p$, were set within the range of − 1.0 to 1.0, with the constraint that the sum of the $p$ values equals 1. The number of applied affine transformations was 200,000 per image. Additionally, to increase the variation in fractal shapes, each parameter set was multiplied by a weight ranging from 0.8 to 1.2, in increments of 0.1 [18].

Pre-trained model

We evaluated DeepLabV3+, a convolutional neural network (CNN)-based encoder-decoder model, and Dense Prediction Transformers (DPT), a Transformer-based encoder-decoder model, for 2D lung cancer GTV auto-contouring. The ResNet-101 backbone of DeepLabV3+ was replaced with a more lightweight ResNet-18, as its deep structure would excessively downsample the small 64 × 64 inputs, resulting in feature maps with insufficient spatial resolution [21]. DPT, originally using ViT-Hybrid or ViT-Large, was adapted using the more efficient Data-efficient image Transformers Tiny (DeiT-Tiny) backbone to avoid overfitting with our limited cases [22]. These models were built using the segmentation_models_pytorch (v0.5.0) and pytorch-image-models libraries (v1.0.15).

We investigated four distinct training strategies for the encoders (i.e., ResNet-18 or DeiT-Tiny) of both models: training from scratch (random initialization), pre-trained on ImageNet-1K (ImageNet-1K_PT), pre-trained on FractalDB-1K (FractalDB-1K_PT), and pre-trained on FractalDB-10K (FractalDB-10K_PT). Due to computational resource constraints, we used publicly available pre-trained models for these three strategies. For the ImageNet-1K_PT model, we loaded the standard weights provided by the segmentation_models_pytorch library. For the FractalDB models, we used the versions publicly released by Kataoka et al. (https://github.com/hirokatsukataoka16/FractalDB-Pretrained-ResNet-PyTorch) and Nakashima et al. (https://github.com/nakashima-kodai/FractalDB-Pretrained-ViT-PyTorch).

Training from scratch and ImageNet-1K_PT are commonly used approaches in semantic segmentation tasks and serve as the baseline comparisons in this study. FractalDB-1K, containing 1 million images across an equivalent number of classes (1,000) to ImageNet-1K (which has 1.28 million images), was tested to evaluate if comparable or exceeding accuracy could be achieved. Additionally, we employed FractalDB-10K to leverage the inherent scalability of the fractal dataset. This dataset, containing 10 million images across 10 times as many classes (10,000) as FractalDB-1K, was tested to determine if superior accuracy could be achieved.

Unlike ImageNet, which contains ground-truth labels (e.g., cats, birds) for training models on a classification task, FractalDB lacks such annotations. To tackle this, FractalDB therefore formulates a pretext task where the model is trained to estimate the fractal categories$\Uptheta$ (introduced in the "Fractal image generation and FractalDB construction section"), using them as pseudo-labels. This approach, known as SSL, offers the considerable advantages of reducing manual labeling effort and mitigating potential human-induced biases [23, 24].

Transfer learning

The encoders used in this study, ResNet-18 and DeiT-Tiny, were pre-trained on image classification tasks using the ImageNet or FractalDB datasets. To adapt these models for lung cancer segmentation, architectural modifications are necessary. Specifically, the layers specialized for the classification task need to be removed, and a segmentation decoder needs to be connected.

When ResNet-18 is used as the backbone for DeepLabV3+, its final fully-connected layer, which serves as the classification head, is removed. The high-level feature maps from the encoder are fed into the Atrous Spatial Pyramid Pooling module of DeepLabV3+, while the spatially rich, low-level feature maps are fused with the decoder during the upsampling process [21].

For DeiT-Tiny, used as the backbone for the DPT, the classification head and the CLS token are discarded. The remaining patch tokens output by the encoder are transformed into multi-resolution 2D feature maps by DPT's Reassemble module. These feature maps are then progressively fused and upsampled by a subsequent convolutional decoder to generate the final segmentation mask [22].

Constructing these model architectures from scratch is a considerably complex and time-consuming process. To streamline development, this study leveraged the PyTorch deep learning framework, which supports these intricate operations. Specifically, high-level libraries such as segmentation-models-pytorch enable the rapid construction of transfer learning models simply by specifying the desired encoder and segmentation architecture. All models in this study were implemented using this high-level library, built upon the PyTorch framework.

Fine-tuning process

Figure 3 illustrates an overview of this study. We performed fine-tuning of the models using a dataset of 104 patient cases, consisting of 2D CT images and their corresponding 2D GTV binary masks. Of these, 78 cases were allocated to the training dataset (63 for training and 15 for validation), and 26 cases were assigned to the test dataset. We therefore used the validation set not for hyperparameter tuning, but for selecting the optimal training epoch, specifically the one that yielded the lowest validation loss. Given the limited number of training samples, we employed fourfold cross-validation to appropriately evaluate the models' generalization performance (the detailed breakdown of the number of 2D slices for cross-validation is provided in Table 1). To prevent overfitting to the training dataset, we employed on-the-fly data augmentation using the Albumentations library. The specific set of transformations, detailed in Table 2, was based on the methods described in the official tutorials of segmentation-models-pytorch (https://github.com/qubvel-org/segmentation_models.pytorch) and the original Albumentations paper [25].

Fig. 3

Schematic overview of the entire workflow in this study

Table 1 Data distribution for the fourfold cross-validationTable 2 Data augmentation parameters and settings

We created a total of eight distinct deep learning (DL) models by combining the two base models with four different pre-training strategies. While DeepLabV3+ directly processed the 64 × 64 ROIs, DPT required the input to be resized to 224 × 224 to match its DeiT-Tiny encoder. For the DPT model's training, both CT and mask images were upsampled to 224 × 224 using nearest-neighbor interpolation. The resulting predicted masks were then downsampled back to the original 64 × 64 resolution using the same method for evaluation. For all DL models, we fine-tuned the entire model using the Adam optimizer with a batch size of 8 for up to 50 epochs. To prevent the disruption of pre-trained weights, we employed a learning rate warm-up strategy; the rate started at 2e-5, for the first epoch, was gradually increased to the target rate of 1e-4 by the fifth epoch, and then remained constant. This overall approach was chosen because it is known that aggressive fine-tuning can degrade performance by disrupting the learned pre-trained parameters [26].

Quantitative evaluation of segmentation

We compared the GTV segmentations predicted by our models with manually delineated contours by radiation oncologists. Given the strengths and weaknesses of various contour similarity metrics, we utilized a combination of several for evaluation in this study [27].

The volume dice coefficient ($\text$) quantifies volumetric overlap between two volumes, A and B, defined as:

The $\text$ is a very simple metric: 1 for complete overlap and 0 for no overlap. Despite its simplicity, it has drawbacks, including weak correlation with clinical validity and overestimation for larger volumes. Nevertheless, we adopted $\text$ as it's a widely used and frequently reported metric for quantifying region similarity [27].

The surface dice coefficient ($\text$), a newer metric proposed by Nikolov et al., addresses $\text$ shortcomings by evaluating contour agreement, not volume [28]. For two contours, we extract their surface point clouds, defined as $_$ and $_$ from their respective 3D volumes. Furthermore, an allowed tolerance $\tau (\text)$ is introduced, and the point clouds that match within $\tau$ are defined as $_^$ and $_^$, respectively. The $\text$ is then defined as:

$$\text\left(\tau \right)=\frac_^\right|+\left|_^\right|}_\right|+\left|_\right|}$$

As $\text$ focuses on the contours, it's particularly useful in tasks like radiation therapy where peripheral accuracy is crucial. In this study, we used a commonly accepted tolerance value of $\tau =1 \text$ [29, 30].

Hausdorff Distance ($\text$) measures the maximum of the shortest distances from any point in one contour to the other contour. $\text$ is known to have a low correlation with the $\text$ and can thus complement its shortcomings [31]. However, its sensitivity to outliers, due to relying on a maximum value, led us to adopt the 95th percentile Hausdorff Distance ($\text95$), which is commonly used to mitigate this drawback [27].

To evaluate the differences in segmentation accuracy among pre-training strategies, we used the manually delineated contours by radiation oncologists as ground truth and assessed them using $\text$, $\text$, and $\text95$. For statistical comparison among groups, we employed the Friedman test, which is suitable for non-parametric repeated measures data. If a significant difference was observed with the Friedman test, we performed pairwise comparisons using the Wilcoxon signed-rank test to clarify differences between individual groups, with Holm's method applied for multiple comparison correction. Statistical analysis was conducted using Python (scipy and statsmodels libraries). A significance level of less than 5% (p < 0.05) was set.

Subgroup analysis

Our hypothesis posits that pre-training with fractal images will enable models to more accurately segment complex shapes exhibiting fractal structures, such as lung cancer tumors. Specifically, we anticipate that the FractalDB-1K_PT will be more capable of handling complex shapes compared to the ImageNet-1K_PT.

To demonstrate this hypothesis, we conducted a subgroup analysis focusing on tumor shape complexity to compare the segmentation accuracy changes of each model. We divided the 104 cases used in this study into two groups based on their shape complexity index: the Lower 52 cases (simple group) and the upper 52 cases (complex group). We used the surface-to-volume ratio ($\text$) as the complexity metric. $\text$, calculated by dividing a GTV’s surface area by its volume, is higher for intricate, non-spherical shapes and Lower for simple, sphere-like shapes. Consequently, the $\text$ for the simple group was 0.325 ± 0.058, and for the complex group, it was 0.502 ± 0.073. For this comparison, we did not retrain the models on each group; instead, we used the inference results from the fourfold cross-validation performed on all 104 cases as described in the "Fine-tuning process" section.

View original article

JAPANESE JOURNAL OF RADIOLOGY

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Fractal-driven self-supervised learning enhances early-stage lung cancer GTV segmentation: a novel transfer learning framework

Comments (0)