The image dataset comprised 30 non-contrast body CT images acquired at Gifu University Hospital. In this context, the 30 cases comprised images acquired from different patients, none of whom presented musculoskeletal disorders. The scanning device was a LightSpeed Ultra 16 (GE Healthcare). The spatial resolution was 0.625 \(\times\) 0.625 \(\times\) 0.625 [mm], and the image size was 512 \(\times\) 512 \(\times\) 802-1104 [voxels].
2.2 Experimental environmentThe experimental setup for this study was as follows. The computer was equipped with a Ryzen Threadripper PRO 5965WX CPU, NVIDIA RTX A6000 (48 GB \(\times\) 3) GPUs, and 256 GB RAM, although the process could be executed on an NVIDIA Quadro RTX 5000 (16 GB\(\times\)1) GPU. The operating system was Ubuntu 20.04 LTS, and the deep learning library used was TensorFlow [21] 2.6.0+nv.
2.3 Joint segmentation of sternocleidomastoid and skeletal musclesAn overview of the proposed method is shown in Fig. 1. This study focused on the depiction of skeletal muscles in all cross-sectional slices within the scanning range of body CT images. Previously, Kawamoto et al. [22] proposed a method for recognizing skeletal muscles around the L3 cross-section by learning a site-specific muscle and the erector spinae muscle simultaneously. In contrast, our method focuses on skeletal muscles that are visible over a wider range. The main proposal of this work was to facilitate the automatic segmentation of the sternocleidomastoid muscle using a two-dimensional (2D) U-Net [17] through multiclass learning of sternocleidomastoid and skeletal muscles in the input body CT images. The U-Net architecture has demonstrated accuracies of 87.24% for COVID-19 lesion segmentation and 97.75% for lung segmentation in CT images [20]. These accuracies are within 1% of those for the successor network, the multi-transformer U-Net [23], which achieved accuracies of 88.18% and 98.01%, respectively. Moreover, U-Net was employed as the base architecture in recently proposed networks. Thus, U-Net is a promising network for benchmark segmentation-based methods. This study proposes a learning and segmentation method using 2D U-Net.
The process begins with inputting 3D body CT images and corresponding ground truth labels. The 2D U-Net is then trained on all axial slices extracted from these 3D volumes, and it learns to segment multiple muscle classes simultaneously. After it is trained, the U-Net processes each axial slice of a new 3D body CT image for segmentation. Finally, the individual 2D segmentation results are reconstructed to form a complete 3D segmentation of the sternocleidomastoid and other skeletal muscles. This method leverages the efficiency of 2D U-Net processing while effectively handling 3D volumetric data, resulting in a comprehensive 3D muscle segmentation. The 2D U-Net architecture consists of an encoder and a decoder. In this study, the encoder applies a process of two 3\(\times\)3 convolutions followed by 2\(\times\)2 max pooling for downsampling, repeated four times. Subsequently, two additional 3\(\times\)3 convolutions are applied. The decoder then employs a process of 2\(\times\)2 upsampling, concatenation with the corresponding encoder feature maps, and two 3\(\times\)3 convolutions, iterated four times. Batch normalization and rectified linear unit (ReLU) activation are applied after each 3\(\times\)3 convolution. Finally, a 1\(\times\)1 convolution followed by a softmax function is utilized. This architecture enables the segmentation of the sternocleidomastoid muscle and skeletal muscles.
Fig. 1Segmentation of sternocleidomastoid muscle using multiclass learning with skeletal muscles (proposed method)
The proposed method considered a three-class classification problem that distinguished between the sternocleidomastoid muscle, skeletal muscle, and background regions. Therefore, the loss function, same as in [21], and used respective weights of 0.5 and 1 for cross-entropy (CE) and Dice (loss = \(0.5 \times CE + Dice\)). And the learning parameters were set as follows: the number of epochs was 50, learning rate was \(3 \times 10^\), batch size was four, and optimization function used was Adam [24]. In addition, data augmentation was applied to the training images, which is effective in the skeletal muscle segmentation of the lower and upper legs [25]. Specifically, augmentation techniques such as scaling, translation, rotation, shear transformation, and flipping were randomly applied to expand the training data size by a factor of eight.
2.4 Experimental settingsIn this study, the automatic segmentation of the sternocleidomastoid muscle was performed using a 2D U-Net based on multiclass learning with skeletal muscles. In such deep learning-based methods, segmentation through the learning of a single region of the target is the most fundamental approach. Therefore, in this study, we compare our method with the recognition results obtained from learning only the sternocleidomastoid muscle region as a baseline approach. Furthermore, as a deep learning-based method for the automatic recognition of specific skeletal muscles, a technique involving multiclass learning of the erector spinae muscle was previously proposed [22]. The multiclass learning of the target region and erector spinae muscles effectively segmented six regions: the trapezius, supraspinatus, rectus abdominis, obliquus abdominis, quadratus lumborum, and psoas major muscles. Therefore, this study performed segmentation through the multiclass learning of the sternocleidomastoid and erector spinae muscles and compared the results with our own method.
The learning and segmentation of only sternocleidomastoid muscle is a two-class classification problem. It distinguishes between the target sternocleidomastoid muscle and background regions. Consequently, the loss function employed a combination of the binary cross-entropy (BCE) and Dice loss, with respective weights of 0.5 and 1 (loss = \(0.5 \times BCE + Dice\)), and the sigmoid function was employed as the activation function as in a previous method [22]. However, the method uses multiclass learning for the sternocleidomastoid muscle and the erector spinae muscle results in a three-class classification problem, which is similar to the proposed method. Therefore, we use a loss function that employs CE and Dice, as described in Sect. 2.3 (loss = \(0.5 \times CE + Dice\)) and the softmax function is used as the activation function. All other network architecture and hyperparameters are set to the same as those of the proposed method described in Sect. 2.3.
2.5 Ground truth and evaluation metricsThe 2D U-Net is a supervised learning method. Therefore, this study required ground truth images for 30 cases, the same number as the image dataset in Sect. 2.1. This step was required for both the target sternocleidomastoid and skeletal muscles used in the proposed method and the erector spinae muscles used for comparison with the conventional method [22]. The ground truth images are created by the first author (K. A.), a student majoring in computer science, using thresholding and the graph cut tool [26] implemented in PLUTO [27], a common platform for the computer-aided diagnosis of medical images. For overall skeletal muscle regions, the results of threshold processing were manually corrected. For specific skeletal muscle regions, manual marking was performed at intervals of several slices, followed by the application of the graph cut method. These results were then iteratively refined by manual correction and ultimately verified by the second author (N. K.), who holds a Ph.D. in medical science.
In this study, the dataset was divided into training and test sets on a per-patient basis, and threefold cross-validation was employed as the method for accuracy verification. Using the experimental environment described in Sect. 2.2, we conducted the training for three sets simultaneously, assigning each set to a different GPU. Experiments were conducted using the proposed method and the two learning approaches described in Sect. 2.4. The effectiveness of the proposed method was validated based on the accuracy of segmentation. The evaluation metrics for segmentation accuracy included the Dice (Dice=\(2 \times |A \cap B|/|A| + |B|\)), Recall (Recall=\(|A \cap B|/|A|\)), and Precision (Precision=\(|A \cap B|/|B|\)). Here, A represents the target region in the ground truth image, and B represents the target region in the segmentation result. Dice, Recall, and Precision were, respectively, used to evaluate the overlap ratio, under-segmentation rate, and over-segmentation rate between the ground truth image and the segmentation result.
Comments (0)