Decentralized Distributed Multi-institutional PET Image Segmentation Using a Federated Deep Learning Framework

Purpose: The generalizability and trustworthiness of deep learning (DL) – based algorithms depend on the size and heterogeneity of training datasets. However, because of patient privacy concerns and ethical and legal issues, sharing medical images between different centers is restricted. Our objective is tobuild a federatedDL-basedframework forPETimage segmentation utilizing a multicentricdataset and tocompare its performancewith thecentral-ized DL approach. Methods: PET images from 405 head and neck cancer patients from 9 different centers formed the basis of this study. All tumors were segmented manually. PET images converted to SUV maps were resampled to isotropic voxels (3 (cid:1) 3 (cid:1) 3 mm 3 ) and then normalized. PET image subvolumes (12 (cid:1) 12 (cid:1) 12 cm 3 ) consisting of whole tumors and background were analyzed. Data from each center were divided into train/validation (80% of patients) and test sets (20% of patients). The modified R2U-Net was used as core DL model. A parallel federated DL model was developed and compared with the centralized approach where the data sets are pooled to one server. Segmentation metrics, including Dice similarity and Jaccard coefficients, percent relative errors (RE%) of SUV peak , SUV mean , SUV median , SUV max , metabolic tumor volume, and total lesion glycolysis were computed and compared with manual delineations. Results: The performance of the centralized versus federated DL methods was nearly identical for segmentation metrics: Dice (0.84 ± 0.06 vs 0.84 ± 0.05) and Jaccard (0.73 ± 0.08 vs 0.73 ± 0.07). For quantitative PET parameters, we obtained comparable RE% for SUV mean (6.43% ± 4.72% vs 6.61% ± 5.42%), metabolic tumor volume (12.2% ± 16.2% vs 12.1% ± 15.89%), and total lesion glycolysis (6.93% ± 9.6% vs 7.07% ± 9.85%) and negligible RE% for SUV max and SUV peak . No significant differences in performance ( P > 0.05) between the 2 frameworks (centralized vs federated) were observed. Conclusion: The developed federated DL model achieved comparable quantitative performance with respect to the centralized DL model. Federated DL models could provide robust and generalizable segmentation, while addressing patient privacy and legal and ethical issues in clinical data sharing.

tional and metabolic information of the underlying tissues at the molecular level. 18F-FDG PET imaging plays a major role for improved clinical diagnosis, evaluation of prognosis, treatment planning including external radiation therapy (RT), and for posttreatment follow-up. 1Radiation therapy is a standard treatment modality of head and neck cancer (HNC), 2 and the precise delineation of tumor boundaries is crucial as segmentation accuracy not only affects survival of HNC patients but is also essential in avoiding irradiation of organs at risk. 3 A number of studies have demonstrated that the ability of 18 F-FDG PET to characterize tumor metabolism facilitates its segmentation for RT planning. 4Currently, tumor segmentation is performed manually by the radiation oncologist.This task, however, is time-consuming and labor-intensive and also crucially suffers from interobserver and intraobserver variability because of the complex HN anatomy on the one hand and the required considerable operator experience on the other hand. 5However, even in experienced hands, interobserver variability can be substantial.In a recent study, 6 the interobserver variability of PET/CT-based gross target volume (GTV) segmentation in HNC patients undergoing RT resulted in a mean GTVoverlap, as reflected by the Dice similarity coefficient, of only 69%, although 3 experienced radiation oncologists had manually segmented the tumors.To segment HNCs, data from 18 F-FDG PET and CT acquisitions are usually used. 7In this approach, it is assumed that 18 F-FDG tumor uptake and anatomic tumor boundaries correspond on coregistered PET and CT images. 8However, anatomic and metabolic tumor boundaries may not coincide because of PET/ CT mis-coregistration errors and peritumoral inflammation, which may lead to overestimation of tumor volume on morphologic images. 9n addition, accurate and precise delineation of tumor contours plays a critical role for the reliability of quantitative analysis of 18 F-FDG uptake, including texture analysis.Such analysis (referred to as radiomics) can be utilized to evaluate tumor changes during treatment and establish prognostic models for predicting survival and treatment outcome. 10It has been shown that the variability of tumor contouring can jeopardize the robustness and reproducibility of quantitative metrics including radiomics features extracted from PET images. 10Moreover, tumor segmentation has been recognized as a stumbling block and a time-consuming step in radiomics studies, 11 hindering the utilization of large multicenter data sets, an essential step for successful clinical implementation of texture analysis. 12part from the aforementioned reasons, newly developed RT techniques and radiomics models based on PET alone, without relying on other imaging modalities, hold promise for fully automated segmentations.Therefore, the development of accurate fully automated segmentation methods is highly desirable.4][15] An array of DL techniques have been developed for the task of medical image segmentation and proven to produce promising results for different modalities, especially PET. 16Among the different AI techniques, DL-based methods have gained special attention because of their ability of automatically extracting high-throughput features and generating probability maps to segment and delineate normal and abnormal tissues. 17In a PET segmentation study by Czakon et al, 18 3 different AI-based methods, namely, a model based on spatial distance weighted fuzzy c-means, another based on dictionary learning, and a DL approach, were compared.The DL approach achieved the highest performance.Moreover, the MICCAI (Medical Image Computing and Computer Assisted Intervention) 19 and HECKTOR (HEad and neCK TumOR) 20 automated segmentation challenges, the best performing models, were all DL-based.Nevertheless, only a limited number of studies have so far investigated the potential of DL segmentation methods based solely on PET. 21In a more recent study, 3 DL algorithms with a combination of 8 loss functions were assessed for HNC tumor segmentation from PET images, reporting promising results. 22However, DLbased algorithms are known to be data-hungry, and as such, their generalizability is largely dependent on the size and the heterogeneity of the used datasets.
4][25] To overcome this limitation, the concept of federated learning (FL) is being increasingly explored in the context of medical data and more recently in medical imaging.][25] Traditionally, DL models are developed in a single center, where the data owner trains the application-specific model using available local training data sets.However, this approach has 2 major limitations.First, the development of an accurate and robust DL model requires massive data sets, which are unlikely to be obtained from a single center.Second, data acquired in a single center may be homogeneous, resulting in poor generalizability and poor performance on independent unseen samples.To address these limitations, data owners (users) send their data sets to a central server, having significant computational power and storage capacity, to pool data for a meaningful model implementation.This approach is known as the centralized framework.In the last decade, an abundant range of applications have been proposed based on standalone and centralized frameworks.However, data sets often contain sensitive information that data owners (users) may prefer to keep private.][25] Privacy-preserving mechanisms can be utilized to ensure privacy of the data owners (users).Alternatively, we can train a centralized model in a decentralized fashion.7][28][29][30][31][32] Alternatively, models can be trained using distributed or decentralized approaches.0][31][32] In the FL framework, data owners (users) train their networks locally and send the trained model updates to a central server.Then, the server can, for instance, iteratively aggregate the local updates into a global network.As a result, FL allows training collaborative DL models using localized data from different centers, addressing privacy concerns to some extent.Federated learning-based models have been developed for different medical imaging tasks, including abnormality detection and classification, 34 prognostic modeling, 35 and segmentation. 36,37In the current study, we have developed a federated DL-based framework for PET image segmentation using multicenter data sets and compared its performance with a centralized model, which uses data pooling in a central server for model building.

Manual Image Segmentation and Preprocessing
Image quality of all PET images was first evaluated by an experienced nuclear medicine physician.Subsequently, manual segmentation of primary tumors was performed for each center on axial slices, starting from initial segmentation provided by each center, by an experienced nuclear medicine physician or in consensus by 2 radiologists, depending on the center.The delineations were used as standard of reference for evaluation.PET images were converted to SUV maps.Because the data sets were acquired at different centers with different scanners, image acquisition protocols, and reconstruction settings, PET images were interpolated to an isotropic voxel of 3 Â 3 Â 3 mm 3 , which resulted in rotationally invariant uniform (matrix size and voxel size) data sets.In addition, in order to make the computations tractable, all PET images were cropped to 12 Â 12 Â 12-cm 3 subvolumes (uniform matrix size and voxel size) including whole tumor and background.Cropped PET images were normalized to the range [0,1].These straightforward steps were adopted for easy implementation and to ensure reproducibility of image preprocessing in clinical setting.

Federated Learning Framework
Consider a centralized server that aims to train a DL model consisting of d parameters based on data samples available at K data centers (owners/users).The objective is to minimize some loss function denote the set of N k labeled training data samples available at the kth data center (owner), where k ∈ {1,2,…, K}.The vector x k ð Þ i represents the ith PET image at center k, where i ∈ {1, 2, ..., N k }, and y k ð Þ i represents the corresponding label.The goal of the FL model is to find the vector θ°satisfying: In our framework, each center (data owner) minimizes its empirical loss function with respect to the local data samples, using the stochastic gradient descent (SGD) algorithm.Let t denote the global iteration and suppose each data owner performs a τ − step local SGD, for some τ ∈ ℕ. Upon receiving the global model parameter θ(t) from the server, the jth step of the local SGD at data owner k, k ∈ {1,2,…,K}, corresponds to the following update: where η j k t ð Þdenotes the learning rate, and the first local update is set as Depending on the strategy to update the parameters of the global and local models, various techniques have been proposed [44][45][46] to optimize the communication efficiency compared with the naive SGD method.We used federated averaging (FedAvg) in our framework.The schematic description of the FL process is presented in Figure 1.First, the global model developed by the server distributes data through different centers (A).Next, the models are trained separately in each center (B) using the local data set, and finally, trained models from all centers are returned to the server to aggregate and update the central global model (C).These steps are repeated until some convergence criteria are met, for example, until no significant loss descend is observed.The model learns from the data sets using SGD for all optimizations. 22Similar to previous studies, [47][48][49][50][51][52][53] the FL process in this work was performed on a server with multiple local graphics processing units (GPUs), where each local GPU was considered as a different center.

Deep Neural Network
As for the DL architecture, we used a modified R2U-Net, 54 which is composed of recurrent residual connections as well as convolutional blocks (Fig. 1).The most established neural structure for image segmentation in the medical community is U-Net, based on which many further variations have been proposed.R2U-Net builds on top of this by adding recurrence to the convolutional residual blocks, which helps the network to increase its effective capacity without increasing the number of parameters.In fact, recurrence can be thought of as the operation of unrolling a network block through time to provide more effective depth.Moreover, R2U-Net uses feature accumulation, which helps extracting low-level features.We used 3 down-and up-sampling levels with 16, 32, and 64 channels in our R2U-Net structure, as well as 2 recurrent convolutional layers with 2 iterations per down-and up-sampling, along with the batch normalization layers.As for the activation function, we used the standard ReLU, except for the sigmoid in the output layer.All implementations were performed in TensorFlow.

Training
All evaluations and reports were performed for 81 patients (20% of each center).All PET images were fed as input to the R2UNet in both the FL and centralized frameworks to generate the corresponding binary masks of tumors.We trained the DL model with axial slices as 1-channel images with a batch size of 64.

Quantitative Evaluation
To evaluate the performance of the 2 models, standard segmentation metrics, including the Dice similarity coefficient, Jaccard similarity coefficient, false-negative rate, false-positive rate, volume similarity, and mean and SD of surface distance were calculated with respect to manual segmentations.In addition, clinical evaluation of DL-guided segmentations using both centralized and FL frameworks was assessed through a number of image-derived PET metrics, including SUV peak , SUV mean , SUV median , SUV max , metabolic tumor volume (MTV), and total lesion glycolysis (TLG).In addition, we extracted a number of shape radiomic features, including sphericity, asphericity, elongation, and flatness using SERA package, 55 which is compliant with the Image Biomarker Standardization Initiative guidelines. 56We calculated the mean relative error (RE%) and the mean absolute relative error (ARE%) with respect to manual segmentation.

Statistical Analysis
Descriptive statistics included mean ± SD and 95% confidence interval (CI) for different image quantification metrics.The Kolmogorov-Smirnov test showed that the data were not normally distributed.Therefore, pairwise comparison between parameters was performed using the nonparametric 2-sample Wilcoxon test FIGURE 2. 3D views of PET segmentations obtained from manual (red), centralized learning (green), and federated learning (blue) methods, on representative patients from different centers (cases 1 and 2 are from 2 centers).

RESULTS
Figure 2 illustrates the 3D-rendered volumes of segmented GTVs, sampled from different clinical centers for manual, centralized, and federated segmentation approaches.The model performance of centralized (green) and federated DL algorithm (blue) is visually compared against manual segmentation.Supplemental Figure 1 (http:// links.lww.com/CNM/A378)represents additional cases of segmented GTVs categorized by clinical center.
In addition, through center-based analysis, we observed consistency between centralized and FL models in terms of relative bias from ground truth segmentation in SUV mean (6.43% ± 4.72% vs 6.61% ± 5.42%), MTV (12.23%-16.19%vs 12.1%-15.89%),and TLG (6.93%-9.6%vs 7.07%-9.85%).For some indices, such as SUV max and SUV peak , the difference between the prediction and ground truth could be ignored.In the case of shape features, almost the same consistency was observed between centralized versus FL (elongation = 5.5%-7.43%vs 5.67%-8.19%and sphericity = 3.39%-4.6%vs 3.81%-5.35%).Table 1 summarizes the mean ± SD of RE% and ARE% of quantitative PET metrics between centralized and FL FIGURE 3. 2D views of segmentations obtained from manual (red), centralized learning (green), and federated learning (blue) methods on representative patients from the 9 different centers.
approaches compared with the ground truth using the center-based approach.Center-based statistical analysis revealed that differences between all derived quantitative metrics were not significant ( P > 0.05).Selected image-derived features (signal intensity features and shape features) of the segmented volumes of the whole test set are illustrated in Figure 6.Furthermore, the AREs of these metrics between centralized and FL models are summarized in Table 2, as categorized by center.Lower and upper bands of 95% CI are summarized in Supplemental Tables 4 and 5 (http://links.lww.com/CNM/A378) along with statistical analysis, in which no significant differences between the 2 approaches were observed (Supplemental Tables 6 and 7, http://links.lww.com/CNM/A378).

DISCUSSION
Accurate and reproducible tumor segmentation from noisy PET images faces many challenges. 57,58Inherent limitations in PET imaging, such as low spatial resolution, partial volume effect, high noise characteristics, and motion artifacts, result in blurred boundaries between tumor and background.In addition, different shapes, textures, and locations of tumors render the development of generalized segmentation methods difficult.Furthermore, the variability of PET scanners, imaging protocols, and reconstruction/correction algorithms challenges the reproducibility of segmentation results. 9ultiple computer-aided methods have been proposed for PET image segmentation that successfully addressed the aforementioned challenges to some extent. 58Conventional segmentation methods range from simple algorithms, such as threshold-based, 59 regiongrowing, and active contours, to more sophisticated approaches based on clustering and classification algorithms trained on PET features, such as fuzzy locally adaptive Bayesian, atlas-based, fuzzy cmean iterative clustering and Gaussian mixture models.Although these algorithms provided promising results, translation into the clinic faced multiple impediments.Some techniques require manual identification of the central tumor voxel or a bounding box encompassing the entire tumor, 60 some are limited by partial volume effect, 61 and some require additional tuning on different scanners. 62ompared with the aforementioned methods, DL-based methods have shown promising results.In the first MICCAI PET segmentation challenge, 19 the performance of conventional and machine learning algorithms was evaluated on a dataset of 176 PET images consisting of simulated, phantom, and clinical studies.Deep learning-based algorithms outperformed other techniques, achieving a Dice score of 0.80.Huang et al 63 applied a U-Net architecture for HNC segmentation from PET/CT images on dual-center datasets utilizing 22 patient studies evaluated using one-leave-out scheme and reported a Dice coefficient of 0.73.Andrearczyk et al 64 segmented HNC tumors using the V-Net architecture and evaluated their model using one-centerleave-out in a 4-center database and reported Dice coefficients of 0.58 and 0.60 for PET and fused PET/CT images, respectively.Leung et al 65 proposed a physics-guided tumor segmentation method from PET images using DL.They simulated realistic tumors and trained the model based on these data followed by fine-tuning on clinical datasets.They reported a Dice coefficient of 0.87 (95% CI, 0.86-0.88)and 0.73 (95% CI, 0.71-0.76)for simulated and clinical studies, respectively.
In a more recent study, Shiri et al 22 developed a fully automated tumor segmentation from HNC PET studies using DL algorithms and multicentric datasets.They evaluated 24 different models implemented through the combination of 3 DL algorithms and 8 different loss functions.Deep learning models were trained on 370 images and tested on 100 PET images on 12-cm 3 subvolumes that included both tumor and background.They reported a Dice coefficient (mean ± SD and 95% CI) for Res-Net with cross-entropy loss (0.86 ± 0.05 and 0.85-0.87),Dense-VNet with cross-entropy loss (0.85 ± 0.058 and 0.84-0.86),and NN-UNet with Dice plus XEnt (0.87 ± 0.05 and 0.86-0.88).There were no statistically significant differences between the 3 models for various quantitative segmentation metrics.In addition, they reported an RE% <5% for SUV max, SUV mean , and SUV median in NN-UNet with Dice plus XEnt model. 22espite the potential of DL-based segmentation models, their performance depends highly on the specific datasets used for training.These algorithms require large/heterogeneous data sets to provide robust and generalizable models.Creating large data sets for data-hungry DL models requires collaboration among different centers.Meanwhile, owing to legal/ethical and privacy issues, direct data sharing between centers is not always feasible.The FL framework can address these challenges by providing decentralized training procedures for DL models.][25] In the current study, we compared the performance of centralized and FL models for the segmentation of HNC PET images.Overall, a high consistency was observed between centralized and FL approaches in terms of quantitative image segmentation metrics, including Dice coefficient (0.84 ± 0.06 vs 0.84 ± 0.05) and Jaccard coefficient (0.73 ± 0.08 vs 0.73 ± 0.07).In terms of conventional PET image-derived quantitative metrics, consistency between FL versus centralized approach was confirmed with SUV mean (6.43% ± 4.72% vs 6.61% ± 5.42%) and TLG (6.93%-9.6%vs 7.07%-9.85%).For SUV max and SUV peak , RE% and ARE% were almost zero.Overall, statistical analysis showed no significant differences ( P > 0.05) between these 2 strategies for different quantitative metrics.4][25] Dayan et al 35 developed an FL-based model for oxygen requirements in COVID-19 patients using vital, laboratory, and chest x-ray images.They reported 16% and 38% improvement in average area under the curve and generalizability for FL-based models compared with centerbased models.Gawali et al 66 reported an area under the receiver operating characteristic curve/F1 score of 0.95/0.72 and 0.93/0.62 with centralized and FL models for chest x-ray classification.
In a recent study by Feki et al, 53 FL-based models were evaluated for COVID-19 detection from chest x-ray images.They evaluated 2 different DL architectures for centralized and FL frameworks with different settings and reported that the FL-based method can achieve comparable results with respect to centralized methods and remain robust in the presence of not independent and identically distributed and unbalanced data.In another study by Lee et al, 67 an FL framework was tested for thyroid nodules malignancy classification using ultrasound images.They enrolled 8457 ultrasounds images from 6 different centers and compared the performance of 5 different DL-based networks.They reported areas under the receiver operating characteristic curve of 78.88% to 87.56% and 82.61% to 91.57% for FL and centralized-based learning methods,  respectively.It was concluded that FL-based techniques could potentially achieve the performance of centralized methods in the classification of benign and malignant lesions from ultrasound images.In Li et al, 68 multisite functional MRI analysis was performed using FL and domain adaptation for classification of autism spectrum disorders among healthy control subjects using brain function connectivity.To tackle the domain shift issue, they proposed 2 different methods for FL model performance boosting and showed that FL model performance could be improved by domain adaptation.The proposed method could potentially be implemented in nuclear medicine FL studies for improving models' performance.
In this work, we showed that building DL models from multiple decentralized data sets at multiple centers is possible via FL, where the local data sets remain in the respective centers. 69Federated learning algorithms bear some inherent limitations.For instance, curious servers may be able to infer local sensitive data sets from trained models, via different types of attacks, 70,71 such as membership inference attacks 72 and model inversion attack. 73rivacy-aware FL models have recently been introduced to address these additional privacy challenges at the expense of additional computational complexity and performance loss. 36,74,75Another issue in FL would be malicious parties that could potentially perform data poisoning attacks during the training process, 76,77 that is, modifying the label of data and uploading random updates to the global model. 73,78Two different categories of noise could arise during training.First was the inherent noise present in the data sets (either in PET images or segmentations) that could potentially cause the learning process to diverge in local centers and affect the whole learning process (in case of high magnitude of noise in the data that the network could not handle).Second, in FL, data poisoning could be induced by malicious parties performing data poisoning attacks during the training process, 76,77 that is, by modifying the labels of the data or by uploading random updates to the global model. 73,78nother challenge in FL is how to establish harmony in data preprocessing because it should be performed on data from different centers acquired with different protocols and settings.In the present study, all preprocessing was performed in a uniform manner, including converting to SUV, cropping, and normalizing to provide reproducibility across different centers.8][49][50][51][52][53] A number of challenges were linked to the training of the data sets for implementation of the FL approach, such as local computer capacity and communication between centers and local sites.Further studies should be carried out involving real multiple clinical centers (using one-center leave-out strategy) to tackle these challenges, specifically the communication bottleneck.Another limitation of this work is the lack of comparison of DL algorithms with conventional segmentation techniques.Yet, the main aim of the current study was the comparison of the FL approach with centralized training.In the current study, the data from each center were randomly divided into train/validation (80% of patients) and test sets (20% of patients) as a standard evaluation method owing to the computational burden associated with alternative, more complex data-splitting methods.Further studies should be performed through cross-validation strategies (ie, 10-fold) to assess the effect of permutations of data splitting on FL learning robustness compared with centralized DL models.

CONCLUSION
We evaluated the performance of a federated DL framework for PET image segmentation to enable robust decentralized learning without directly sharing data among clinical centers.We compared our proposed model with centralized models and achieved similar performance for an array of image segmentation metrics and quantitative PET features.Federated learning-based models provide robust and generalizable segmentation models while addressing the privacy concerns and legal and ethical issues in medical data sharing among clinical centers.

FIGURE 1 .
FIGURE 1. Federated learning flowchart.The global model generated by the server is distributed to different centers (A).Models are trained separately in each center (B).Trained models from all centers are returned to the server to be aggregated to form a new global model (C).Schematic of deep recurrent residual neural network (R2U-Net) (D).
Figure 2 illustrates the 3D-rendered volumes of segmented GTVs, sampled from different clinical centers for manual, centralized, and federated segmentation approaches.The model performance of centralized (green) and federated DL algorithm (blue) is visually compared against manual segmentation.Supplemental Figure 1 (http:// links.lww.com/CNM/A378)represents additional cases of segmented GTVs categorized by clinical center.Figure 3 presents a visual comparison between the 2 different learning strategies against manual segmentation (ground truth) via multiple 2D axial views of different lesions from each center.A magnified version of the same GTVs is illustrated in Figure 4.In Supplemental Figures 2-10 (http://links.lww.com/CNM/A378),different cases from various centers are shown, depicting the accuracy and complexity of the segmentation task according to textural GTV characteristics.In separate center-by-center analyses, strong consistency was observed between the 2 approaches, in terms of quantitative image segmentation performance metrics.In Figure5, model performance

FIGURE 4 .
FIGURE 4. Magnified 2D views of segmentations obtained from manual (red), centralized learning (green), and federated learning (blue) methods on patients from the 9 different centers.

FIGURE 5 .
FIGURE 5. Comparison of the performance of the centralized versus federated learning frameworks in terms of quantitative metrics.

FIGURE 6 .
FIGURE 6. Violin plots of RE% for quantitative PET metrics (SUV mean , MTV, and TLG) and radiomics shape features (sphericity, elongation, and asphericity) for centralized versus federated learning models.
P K k¼1 N k data samples; then one can assign α k = N k / N. The local objective functions are defined as empirical averages over the associated training sets as follows: K} are the local objective functions, and α k ∈ {1,…,K} are nonnegative weighting coefficients satisfying P K k¼1 α k ¼ 1.Let us consider the collection of centers (hospitals) Clinical Nuclear Medicine • Volume 47, Number 7, July 2022 PET Image Segmentation Using Federated Learning to have a total of N = Clinical Nuclear Medicine • Volume 47, Number 7, July 2022 PET Image Segmentation Using Federated Learning (Wilcoxon rank sum test or Mann-Whitney U test) with P < 0.05 defined as threshold for statistical significance.

TABLE 2 .
Summary of ARE (%) in Quantitative PET Metrics (Mean ± SD) for Centralized and Federated Learning Models Within Different Centers