A temporal enhanced semi-supervised training framework for needle segmentation in 3D ultrasound images

Prostate cancer and kidney cancer are very prevalent in men around the world and pose a serious threat to their health. The gold standard for diagnosing these cancers is biopsy. The accurate biopsy navigation can increase the accuracy of the biopsy and reduce trauma to patients (Van Der Aa et al 2019). The real-time 3D ultrasound attracts much attention in the field of surgical navigation (Fenster et al 2004, Abul et al 2007, Fiard et al 2013, Marien et al 2017). Automatic segmentation of biopsy needle in 3D ultrasound images is the key technology to realize the accurate intraoperative navigation of the biopsy. Many researchers have proposed a variety of traditional needle detection algorithms, including Hough transform (HT) (Ding and Fenster 2003), parallel integral projection (PIP) (Barva et al 2008), random sample consensus (RANSAC) (Uhercik et al 2010), phase grouping (Qiu et al 2014), and the algorithm based on Kalman filter (KF) (Yan et al 2021). Because of the blurring and discontinuity of the biopsy needle and unwanted interference similar to needle appearance in the ultrasound images, most of the traditional algorithms cannot achieve satisfactory segmentation accuracy or involve the low detection speed.

To solve the problems with the conventional segmentation techniques, many deep learning (DL) models have been proposed (Kamnitsas et al 2017, Zhao et al 2020, Chen et al 2023, Shi et al 2023). Among these models, the convolutional neural network (CNN) (Ronneberger et al 2015, Lian et al 2018) and transformer (Vaswani et al 2017, Tao et al 2022) have become two kinds of popular models for segmentation tasks. In recent years, the CNN has been combined with the transformer to further improve the accuracy of segmentation networks (Wang et al 2021a, Zhang et al 2021, Hatamizadeh et al 2022). Because the transformer module involves the complicated computation and consumes storage resources, the hybrid CNN-transformer model mostly relies on the encoder-decoder structure, which can eliminate redundant information by encoding to lower the model complexity. However, the models with the static image inputs are still not accurate enough for the needle detection due to the low ultrasound image quality and interference similar to the needle appearance.

Recently, the utilization of temporal information has been found to be effective for improving the medical image segmentation accuracy (Van De Leemput et al 2019, Sun et al 2020, Tian et al 2020, Zheng et al 2022). When the biopsy needle moves, the temporal information can provide the reference of its relative movement and shape, which will reduce its detection difficulty greatly. Thanks to the ability of the transformer to learn the global feature correlation, it is very suitable for processing multi-frame images. Many researchers have applied the transformer to video segmentation tasks (Oh et al 2019, Hwang et al 2021, Wang et al 2021b, Yang et al 2021, Yang and Yang 2022). However, the existing work mainly deals with 2D natural image sequences, and very few researches have been done on the segmentation of 3D ultrasound volume sequence.

Another issue with medical image segmentation is the high expense of data annotation. Therefore, numerous attempts have been made to address semi-supervised image segmentation, which means that only a few images in the dataset will be annotated to facilitate achieving segmentation accuracy close to that produced on all labeled data. One of the popular semi-supervised segmentation strategies is consistency learning (Tarvainen and Valpola 2017, Ke et al 2020, Chen et al 2021, Zou et al 2021, Wu et al 2022), which means to encourage the model to have similar output when the sample or parameter is slightly disturbed. This will force the output features of the similar samples to be closer while those of different categories will be farther apart. Because only in this way can the disturbed sample features not exceed the scope of the original category, the model's performance can be indirectly enhanced by using the unlabeled samples. The existing consistency learning schemes are mainly realized by setting up parallel networks or transforming input data. The former will consume a lot of additional computing resources, and the latter cannot deliver sufficient segmentation performance.

To achieve fast and accurate biopsy needle segmentation in real-time 3D ultrasound image sequences, this paper has proposed a 4D motion information based temporal enhanced semi-supervised training framework for segmentation model. Firstly, the encoder of the segmentation model is used for extracting the features from each volume. Then, a circle transformer module based on the static and dynamic features is designed to extract and combine the temporal information between different volumes. Finally, the segmentation result of each volume is generated by the decoder of the segmentation model. In our circle transformer module, we analyze the characteristics of self-attention and optimize its calculation method under the condition of double inputs. The two attention calculations with exchanged query features are proposed to generate the output features based on the weighted sum of the current volume instead of the adjacent volumes. The accurate correspondence at the feature level makes the optimization process of our model simpler and can achieve higher segmentation accuracy. The proposed framework is suitable for all encoder-decoder structure based segmentation models, and can significantly increase the segmentation accuracy of the model. Unlike other models that use the temporal information, our method removes the dependence of the model on the temporal framework by simultaneously constraining the segmentation results before and after the temporal information is combined. Therefore, the inference only needs the segmentation model, which is equivalent to a cost-free optimization. It is not necessary to use the image sequence as input during inference, and the application mode is more flexible. Besides, the proposed framework can also be used for semi-supervised segmentation by calculating the consistency loss based on the output probability maps of the unlabeled data before and after the temporal information is combined, which is more conducive to the improvement of the model performance and does not require additional memory. Experiments on three ultrasound datasets acquired during the biopsy of beagles show that our method outperforms some 4D segmentation network and semi-supervised segmentation methods in terms of needle detection efficiency and accuracy.

The main contributions of this work can be summarized as follows:

(1) A novel temporal enhanced semi-supervised training framework based on 4D motion information is proposed for needle segmentation, which can attenuate the influence of the low ultrasound image resolution and needle-like interference on the segmentation result.

(2) The segmentation results produced after the temporal module on the unlabeled data are used as the pseudo-labels to constrain those produced before the temporal module, thereby facilitating the effective semi-supervised segmentation.

(3) The circle transformer is designed between the encoder and decoder to fully utilize the temporal information for representing the motion of needle, thereby improving the accuracy of needle segmentation.

(4) Extensive experiments are done to analyze the influence of network structure and hyperparameters on segmentation accuracy of the proposed method and demonstrate its effectiveness.

2.1. The encoder-decoder structure based segmentation model

The segmentation model based on the encoder-decoder structure is mostly based on U-Net (Ronneberger et al 2015). Its encoder includes multiple convolution layers for feature extraction and pooling layers for the reduced resolution, and its decoder includes multiple convolution layers and interpolation function for restoring image resolution. In order to make up for the information lost during encoding, the skip connection combines the multi-scale encoded features with the corresponding decoded ones. Based on the U-Net, researchers have proposed many variants using attention mechanism (Lian et al 2018, Lv et al 2020, Zhao et al 2022), residual module (Xiao et al 2018), multi-scale supervision (Huang et al 2020) and so on. In the encoder-decoder model combining transformer and CNN, transformer can be directly used to encode images, replace CNN (Hatamizadeh et al 2022) or connect with CNN in parallel (Zhang et al 2021). The TransBTS (Wang et al 2021a) sets the multi-layer transformer module between the encoder and decoder, where the transformer can also be considered as a part of the encoder in order to unify the structure.

2.2. Temporal information based learning method

The segmentation model based on CNN can learn the temporal information through channel expansion (Van De Leemput et al 2019, Sun et al 2020), long short-term memory (LSTM) module (Zheng et al 2022) and attention module (Tian et al 2020). Channel expansion means that the input image has multiple channels, and different channels represent different moments of a sequence. This network also has multiple output channels, which represent the category prediction probability of the target frame. The LSTM module is transferred from natural language processing to image processing, and the Conv-LSTM based on convolution has been proposed. Zheng et al (2022) have presented to use CNN for extracting the spatial features with the same resolution as the original image but multiple channels, then use the Conv-LSTM for extracting the temporal information and finally fuse the spatio-temporal features to produce the probability map of the target frame. The triple attention network (TriANet) (Tian et al 2020) fuses the features of the temporal series by employing the channel, spatial and temporal attention modules between the CNN-based encoder and decoder.

Many video segmentation methods are based on transformer because its structure is suitable for the fusion of sequence features. This kind of model generally uses the CNN-based backbone to extract the initial image features for each input image of the sequence, and adds the position encoding. Then, the features of different frames are linked by the global correlation calculation when they are updated. The video instance segmentation framework based on transformer (VisTR) (Wang et al 2021b) feeds the multi-frame image features into the transformers encoder module to fuse and update all features in the sequence, and models the entire temporal and spatial features. For the decoder module of VisTR which is similar to that in the detection transformer (DETR) (Carion et al 2020), the instance query and the dense input feature sequence are introduced for the attention operation to decode each instance feature. Unlike the VisTR, the inter-frame communication (IFC) transformer (Hwang et al 2021) divides the features of each frame encoded by transformer into frame tokens and memory tokens. Only memory tokens are sent to the transformer encoder for inter-frame information transmission. Finally, the packets of memory tokens are restored and combined with their corresponding frame tokens as the inputs of the decoding module. The space-time memory (STM) network (Oh et al 2019) divides the image sequence into past and current frames, and aggregates the instance information in the past frames to the current frame through a unique STM reading operation. The weight calculation included in the aggregation process is not self-attention but the attention between the past and current frames. The associating objects with transformer (AOT) (Yang et al 2021) adopts a multi-level information embedding and output model based on transformer blocks, which is similar to the LSTM network. The encoded features of each frame are combined with the memory features generated by image features for all previous frames through the multi-head attention module, and the segmentation results are output and decoded step by step.

2.3. Consistency learning method

The construction of consistency learning framework is mainly achieved by setting different inputs or using different model parameters. Mean-teacher (Tarvainen and Valpola 2017) adds different Gaussian noise to the input image to get two different inputs for the student and teacher models, and uses the exponential moving average of the parameters in the student model as the parameters in the teacher model. The PseudoSeg (Zou et al 2021) makes the weak enhancement and the strong enhancement for the input image, and inputs the two enhanced images into the same network. The model output for the weakly enhanced input image is used to produce the pseudo labels for supervising its output for the input strongly enhanced image. For the cross probability consistency (CPC) method which is a simplified version of guided collaborative training (Ke et al 2020) framework, an image is input into two different networks, and the prediction probability of the two networks will be constrained to be similar. The cross pseudo supervision (CPS) (Chen et al 2021) method combines the generation of pseudo labels with the idea of CPC. For two different outputs, the CPS supervises each other by getting corresponding pseudo labels through argmax operation, and uses cross entropy loss as the constraint. Generally, the CPC and CPS networks have the same structures but different initializations. MC-Net+ (Wu et al 2022) includes shared encoders and multiple slightly different decoders, and uses the multiple consistency constraint. The probability output of each decoder of MC-Net+ is supervised by the soft pseudo labels of other decoders to provide more strict consistency constraints.

3.1. Network structure

Our method is based on the DL model with the encoder-decoder structure, which is divided into encoder and decoder modules. Figure 1 shows the corresponding overall structure. Firstly, all volumes are processed by an encoder with the shared parameters to obtain the encoded features, which will generate pre-decoded features of each volume through the circle transformer module combined with the temporal information. Then, the pre-decoded features are sent into the decoder with shared parameters to generate the segmentation results of each volume. If the skip connection exists between the encoder and decoder, its features will be directly input into the decoder of the corresponding volume. The two inputs of the circle transformer module are expressed as static features and dynamic features, respectively, and the generated pre-decoded features of different frames corresponds to different inputs. The static features and dynamic features are the encoded features of the current volume and other volumes, respectively. Let $_=\_^,_^,\,\ldots ,\,_^\}$ and $_=\_^,_^,\,\ldots ,\,_^\}$ represent the encoded and pre-decoded features of n image volumes, respectively, and the CircleT denote the circle transformer module. The calculation of pre-decoded features can be expressed as:

Equation (1)

Figure 1. The proposed training framework for biopsy needle segmentation in 3D ultrasound volume sequence. Encoder and decoder share the parameters, respectively. The pre-decoded features of each volume are generated by the circle transformer module according to the different grouping methods of the input features.

Standard image High-resolution image

To attenuate the dependence of the trained model on the temporal module, our method uses the same loss function to simultaneously constrain segmentation results obtained based on the encoded and pre-decoded features. This framework (strategy) will force our model to obtain approximate outputs before and after the combination of temporal information, thereby improving the segmentation accuracy effectively when the encoder-decoder model is applied alone. The advantage of this method lies in the introduction of a more effective temporal information learning module and the accuracy improvement of encoder-decoder segmentation models without additional cost by using the temporal information. To realize semi-supervised model training, a new consistency learning strategy is proposed for the constraint of unlabeled data. Specifically, we use the probability maps generated by pre-decoded features to constrain the ones generated by the encoded features. The total loss function is defined as:

Equation (2)

where Lcons and Lseg represent the consistency loss for the unlabeled data and the segmentation loss for the labeled data, respectively; D( · ) means the calculation process of the decoder; label denotes segmentation ground-truth of the labeled data.

3.2. Circle transformer

In our method, each volume is encoded by the encoder. For the segmentation of a volume, these encoded features are divided into the static features fs and dynamic features fd . The circle transformer module generates the pre-decoded features by fusing two groups of features based on the self-attention calculation.

Let fa represent the output features of self-attention calculation, and fm represent the intermediate variable. We will consider the case of fs ∈ RN×C and fd ∈ RT·N×C , where N and C denote the number and length of vectors, respectively, and T means the number of volumes. The self-attention operation requires that the size of key and value must be the same and the output has the same size as query. Therefore, the common practice of fusing the temporal features based on self-attention is to treat fd as key and value and fs as query to obtain the output feature fa ∈ RN×C . The calculation process is as follows:

Equation (3)

Equation (3) shows that fa is the weighted sum of fd in the traditional self-attention calculation. It is assumed that the best feature-level correspondence is to obtain fa by the weighted sum of fs , and the non-corresponding features have a negative impact on segmentation accuracy. The STM reading module proposed by STM (Oh et al 2019) concatenates fa and fs on the channel dimensions to maintain the integrity of information, but it still fails to solve the problem of feature mismatch.

In this paper, the circle transformer module will be presented to realize the weighted sum of fs by two self-attention calculations with exchanged query features. Figure 2 shows two schemes of the circle transformer. In the first scheme, the dynamic features are preprocessed by the first self-attention to match their size with the static features so as to provide the basis for the second self-attention. Specifically, the first self-attention still treats fd as key and value and fs as query, and the intermediate variable is obtained by a weighted sum of fd . The second self-attention treats fs as key and value and the intermediate variable as query. Accordingly, the self-attention module produces the output based on the weighted sum of fs . The calculation is expressed as:

Equation (4)Equation (5)

Figure 2. Two different designs for the circle transformer. (a) and (b) correspond to the described first and second strategies, respectively.

Standard image High-resolution image

The second scheme is to treat fs as key and value and fd as query in the first self-attention, and yield the intermediate variable of the same size as fd by the weighted sum of fs . The second self-attention treats the above intermediate variable as key and value and fs as query. The final output is a twice-weighted sum of fs . Therefore, the essence of computing output feature is the weighted summation of fs that we expect. In some way, the weighted summation of static features can ensure that the output size is the same as the input one. The calculation process can be expressed as:

Equation (6)Equation (7)

In this paper, the second strategy has been chosen because the corresponding model involves more intermediate variable parameters so that it can fit the correlation in the temporal information better. Similar to the traditional transformer, the circle transformer adds the feedforward neural network (FFN) after the attention structure. Self-attention and FFN are followed by layer normalization (LN) and residual connection from the input. The final output feature fo for the circle transformer module is calculated as:

Equation (8)3.3. Encoder-decoder model

In the proposed method, the encoder and decoder are constructed by the short-term and long-term memory self-attention model (SLM-SA) (Wen et al 2023) proposed in our previous research. The SLM-SA is designed based on CNN and improved transformer module, and it has outperforming segmentation accuracy compared with U-Net models and occupies less memory, which is suitable for 4D segmentation with large data volume. Figure 3 illustrates the detailed structure of the model.

Figure 3. The structure of SLM-SA segmentation model. (a) The overall model framework. (b) The SM-Block and LM-Block. (c) The improved MSA module including reattention mechanism and learnable weights for output features.

Standard image High-resolution image

The encoder of the SLM-SA model can be divided into three stages. Firstly, image features are extracted sequentially through multiple convolutional modules based on the CNN. Then, the short-term memory block (SM-Block) extracts the useful information and optimizes the adjacent convolution process based on the attention mechanism. Finally, the long-term memory block (LM-Block) combines all SM-Block outputs and generates the most effective encoded result. In the encoder, the SM-Block and LM-Block based on the transformer structure are the key components and they include an improved multi-head self-attention (MSA) module, which involves a re-attention operation for the attention weight matrix realized by convolution and assigns learnable weights to each channel of the module's output features.

The decoder of the SLM-SA uses the convolution and deconvolution functions to gradually restore image resolution and reduce channels, thereby generating the probability map of pixel-wise class prediction.

The detailed design of the convolution module in the SLM-SA model is shown in table 1. Here, the Conv-Block is a residual module, BN means batch normalization and ReLU means rectified linear unit. Convk_s_c and ConvTk_s_c denotes the convolution layer and transposed convolution layer, respectively, and they both have the k × k × k kernel, stride of s and c output channels.

Table 1. The detailed design of convolution module in SLM-SA model.

No.StemConv-BlockReconstruction(1)Conv3_2_32, BN, ReLUConv3_1_32, BN, ReLUBN, ReLU, Conv3_1_32(2)Conv3_1_32, BN, ReLUConv3_1_32, BNBN, ReLU, ConvT3_2_32(3)Conv3_2_32, BN, ReLUAdd (1), ReLUBN, ReLU, Conv3_1_32(4)Conv3_1_32, BN, ReLUConv3_1_32, BN, ReLUBN, ReLU, ConvT3_2_32(5)—Conv3_1_32, BNBN, ReLU, Conv3_1_16(6)—Add (4), ReLUBN, ReLU, Conv3_1_23.4. Loss function

As shown in equation (2), the loss function includes segmentation loss and consistency loss, i.e. Ltotal = Lseg + Lcons . The segmentation loss of our method combines LDSC (Milletari et al 2016) and LCE related to Dice similarity coefficient (DSC) and cross entropy, respectively. The DSC measures the similarity between model prediction results and segmentation ground truths. The larger DSC means higher similarity. The value of cross entropy means the difference between two probability distributions, and it can reflect the uncertainty of each pixel prediction. The smaller cross entropy means better model prediction result. For the segmentation of single category, the loss function is designed as:

Equation (9)

where yi and $}_$ represent the label and prediction for the ith voxel in the considered sample, respectively. M is the total number of voxels in the sample.

For the unlabeled data, the mean squared error (MSE) will be adopted to construct the consistency loss in our method, and it is defined as:

Equation (10)4.1. Datasets

The first dataset is obtained from transabdominal ultrasound-guided kidney biopsy of ten beagles. Our biopsies target suspected cancerous lesions in the kidneys. The ultrasound machine (S60 Pro, Sonoscape Inc, China) equipped with 3D abdominal probe (VC2-9, Sonoscape Inc, China) is used to acquire the real-time 3D needle ultrasound image sequences. The probe is based on sector mechanical scanning and operates at a frequency of 3.8–5 MHz, acquiring 3D ultrasound images at a frame rate of 3 Hz, with a volume size of approximately 145 mm × 95 mm × 105 mm. The voxel size of ultrasound volume is 0.455 mm × 0.455 mm × 0.455 mm. We collect ten videos including a dataset consisting of 433 3D volumes. When our model is trained with the temporal information, the short sequence including the specified volumes is extracted from these data. The extraction of sequences can be done by sliding a window of a specified length across the entire video with a step of 1 so that the number of sequences extracted from per video should be L−(l−1), where L and l denote the length of a video and the length of a short sequence, respectively. Here, each sequence including 3 volumes is utilized, and a total of 423 sequences will be obtained. Among them, 393 sequences are used for training and 30 sequences are used to test the model. The volume data in each sequence will undergo the online random data argumentation with the same parameters during training, including translation, scaling, flipping and rotation. Then the size of each volume data is cropped to 96 × 96 × 96.

The second and third datasets are acquired during transrectal and transabdominal ultrasound-guided prostate biopsy procedures of beagles, and they are pre-processed in the same way as the first dataset. The probe used for transabdominal ultrasound is the same as that in the first dataset while transrectal ultrasound uses another one (VE9-5, Sonoscape Inc, China). Based on sector mechanical scanning, the transrectal probe operates at a frequency of 7.5–9.8 MHz, and it acquires 3D ultrasound images at a frame rate of 3.3 Hz with the identical voxel and volume sizes to those of the first dataset. Among the two datasets, we use 404 and 421 sequences for model training, and 25 and 30 sequences for model testing, respectively.

The preoperative preparation for a biopsy includes anesthesia, skin preparation, and disinfection. The biopsy was performed using a Bard automatic biopsy gun (MC1816 18 g (1.2 mm) *16 cm, BARD, America), and the depths of needle insertion in the three datasets were approximately 50 mm, 80 mm, and 80 mm, respectively. All the animal experiments mentioned above were approved by the Ethics Committee of Tongji Medical College, Huazhong University of Science and Technology (permit number: 2019S2164).

4.2. Training settings

Our research platform consists of Intel Xeon Gold 6240 CPU and NVIDIA A100-PCIE-40 GB GPU. Our method has been realized by the MindSpore Lite tool (Mindspore). We have fixed the batch size to 6 and chosen the cosine annealing algorithm to automatically adjust the learning rate in each experiment. The Adam optimizer with a weight decay of 3e−4 and smoothing values of 0.9 and 0.999 has been selected to train our model. In order to ensure the fair comparison, the hyperparameters of each method have been fully fine-tuned to achieve the best training effect. Each experiment in this paper is repeated for three times, and the optimal segmentation result is chosen as the evaluation standard, and 10-fold cross validation is implemented for the model test.

4.3. Evaluation metrics

For all models, their performance was evaluated using DSC, needle tip positioning error (Etip), needle length error (Elen) and needle angle error (Eang). The DSC is given by:

Equation (11)

where FN, FP and TP represent the number of false negative, false positive and true positive pixels in the predicted pixel classification results in comparison to the ground truth. The needle parameters (needle tip position, length, angle) were extracted from the segmentation results and label maps by a linear fitting algorithm. The Etip, Elen and Eang were computed as the MSE between the detected results from our model and their ground truths. Statistical tests with false discovery rate (FDR) (Benjamini et al 1995) correction were conducted based on the needle tip position errors and the corrected p-values (i.e. q-values) are listed.

5.1. Ablation experiments

Several ablation experiments were specially designed and performed on the first dataset in order to demonstrate the effectiveness of our proposed method. Firstly, we compared the segmentation accuracy before and after combining the temporal module to illustrate the effect of the presented training framework on the segmentation accuracy. The volume data was input into the segmentation model in turn during the single volume training. A short sequence was regarded as a sample in the training combined with temporal module, and the segmentation model was divided into encoder and decoder. To further prove that our method can improve the mainstream encoder-decoder segmentation models, U-Net (Ronneberger et al 2015), Att-Unet (Lian et al 2018), TransBTS (Wang et al 2021a), UNETR (Hatamizadeh et al 2022) and SLM-SA (Wen et al 2023) were applied to the experiment.

The results of DSC indicators and needle detection errors are shown from figure 4 and table 2. From figure 4 and table 2, we can see that the SLM-SA has the highest segmentation accuracy and thus it is used as the encoder and decoder in our method in the following experiments. Meanwhile, the needle segmentation and detection accuracy have been greatly improved in most cases when the compared methods are combined with the proposed temporal information based training framework. Particularly, the simplest U-Net has more than 7% improvements of DSC after combining the temporal module (q < 0.01). The quantitative comparisons indicate the effectiveness of our proposed training framework.

Figure 4. The comparison of DSC indicators before and after combination of temporal module.

Standard image High-resolution image

Table 2. The comparison of needle detection accuracy before and after combination of temporal module.

ModelsTemporal Etip/mm ↓ Elen/mm ↓ $}_}}^}x}^\downarrow $ $}_}}^}y}^\downarrow $ $}_}}^}z}^\downarrow $ q U-Net×2.47 ± 1.164.56 ± 2.042.95 ± 1.303.86 ± 1.510.86 ± 0.330.001 √1.81 ± 0.453.01 ± 1.172.88 ± 1.233.89 ± 1.450.55 ± 0.24 Att-Unet×1.40 ± 0.542.69 ± 1.333.04 ± 1.553.97 ± 1.900.72 ± 0.430.037 √1.33 ± 0.581.94 ± 0.81 2.46 ± 1.243.46 ± 1.54 0.46 ± 0.20 TransBTS×1.82 ± 0.473.07 ± 1.503.18 ± 1.154.15 ± 1.670.68 ± 0.430.036 √1.75 ± 0.693.06 ± 1.163.01 ± 1.313.69 ± 1.360.85 ± 0.44 UNETR×1.67 ± 0.692.72 ± 1.552.85 ± 1.404.18 ± 2.05

Comments (0)

No login
gif