Federated learning on AWS using FedML, Amazon EKS, and Amazon SageMaker

This post is co-written with Chaoyang He, Al Nevarez and Salman Avestimehr from FedML. Many organizations are implementing machine learning (ML) to enhance their business decision-making through automation and the use of large distributed datasets. With increased access to data, ML has the potential to provide unparalleled business insights and opportunities. However, the sharing of raw, non-sanitized sensitive information across different locations poses significant security and privacy risks, especially in regulated industries such as healthcare. To address this issue, federated learning (FL) is a decentralized and collaborative ML training technique that offers data privacy while maintaining accuracy and fidelity. Unlike traditional ML training, FL training occurs within an isolated client location using an independent secure session. The client only shares its output model parameters with a centralized server, known as the training coordinator or aggregation server, and not the actual data used to train the model. This approach alleviates many data privacy concerns while enabling effective collaboration on model training. Although FL is a step towards achieving better data privacy and security, it’s not a guaranteed solution. Insecure networks lacking access control and encryption can still expose sensitive information to attackers. Additionally, locally trained information can expose private data if reconstructed through an inference attack. To mitigate these risks, the FL model uses personalized training algorithms and effective masking and parameterization before sharing information with the training coordinator. Strong network controls at local and centralized locations can further reduce inference and exfiltration risks. In this post, we share an FL approach using FedML, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker to improve patient outcomes while addressing data privacy and security concerns. The need for federated learning in healthcare Healthcare relies heavily on distributed data sources to make accurate predictions and assessments about patient care. Limiting the available data sources to protect privacy negatively affects result accuracy and, ultimately, the quality of patient care. Therefore, ML creates challenges for AWS customers who need to ensure privacy and security across distributed entities without compromising patient outcomes. Healthcare organizations must navigate strict compliance regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, while implementing FL solutions. Ensuring data privacy, security, and compliance becomes even more critical in healthcare, requiring robust encryption, access controls, auditing mechanisms, and secure communication protocols. Additionally, healthcare datasets often contain complex and heterogeneous data types, making data standardization and interoperability a challenge in FL settings. Use case overview The use case outlined in this post is of heart disease data in different organizations, on which an ML model will run classification algorithms to predict heart disease in the patient. Because this data is across organizations, we use federated learning to collate the findings. The Heart Disease dataset from the University of California Irvine’s Machine Learning Repository is a widely used dataset for cardiovascular research and predictive modeling. It consists of 303 samples, each representing a patient, and contains a combination of clinical and demographic attributes, as well as the presence or absence of heart disease. This multivariate dataset has 76 attributes in the patient information, out of which 14 attributes are most commonly used for developing and evaluating ML algorithms to predict the presence of heart disease based on the given attributes. FedML framework There is a wide selection of FL frameworks, but we decided to use the FedML framework for this use case because it is open source and supports several FL paradigms. FedML provides a popular open source library, MLOps platform, and application ecosystem for FL. These facilitate the development and deployment of FL solutions. It provides a comprehensive suite of tools, libraries, and algorithms that enable researchers and practitioners to implement and experiment with FL algorithms in a distributed environment. FedML addresses the challenges of data privacy, communication, and model aggregation in FL, offering a user-friendly interface and customizable components. With its focus on collaboration and knowledge sharing, FedML aims to accelerate the adoption of FL and drive innovation in this emerging field. The FedML framework is model agnostic, including recently added support for large language models (LLMs). For more information, refer to Releasing FedLLM: Build Your Own Large Language Models on Proprietary Data using the FedML Platform. FedML Octopus System hierarchy and heterogeneity is a key challenge in real-life FL use cases, where different data silos may have different infrastructure with CPU and GPUs. In such scenarios, you can use FedML Octopus. FedML Octopus is the industrial-grade platform of cross-silo FL for cross-organization and cross-account training. Coupled with FedML MLOps, it enables developers or organizations to conduct open collaboration from anywhere at any scale in a secure manner. FedML Octopus runs a distributed training paradigm inside each data silo and uses synchronous or asynchronous trainings. FedML MLOps FedML MLOps enables local development of code that can later be deployed anywhere using FedML frameworks. Before initiating training, you must create a FedML account, as well as create and upload the server and client packages in FedML Octopus. For more details, refer to steps and Introducing FedML Octopus: scaling federated learning into production with simplified MLOps. Solution overview We deploy FedML into multiple EKS clusters integrated with SageMaker for experiment tracking. We use Amazon EKS Blueprints for Terraform to deploy the required infrastructure. EKS Blueprints helps compose complete EKS clusters that are fully bootstrapped with the operational software that is needed to deploy and operate workloads. With EKS Blueprints, the configuration for the desired state of EKS environment, such as the control plane, worker nodes, and Kubernetes add-ons, is described as an infrastructure as code (IaC) blueprint. After a blueprint is configured, it can be used to create consistent environments across multiple AWS accounts and Regions using continuous deployment automation. The content shared in this post reflects real-life situations and experiences, but it’s important to note that the deployment of these situations in different locations may vary. Although we utilize a single AWS account with separate VPCs, it’s crucial to understand that individual circumstances and configurations may differ. Therefore, the information provided should be used as a general guide and may require adaptation based on specific requirements and local conditions. The following diagram illustrates our solution architecture. In addition to the tracking provided by FedML MLOps for each training run, we use Amazon SageMaker Experiments to track the performance of each client model and the centralized (aggregator) model. SageMaker Experiments is a capability of SageMaker that lets you create, manage, analyze, and compare your ML experiments. By recording experiment details, parameters, and results, researchers can accurately reproduce and validate their work. It allows for effective comparison and analysis of different approaches, leading to informed decision-making. Additionally, tracking experiments facilitates iterative improvement by providing insights into the progression of models and enabling researchers to learn from previous iterations, ultimately accelerating the development of more effective solutions. We send the following to SageMaker Experiments for each run: Model evaluation metrics – Training loss and Area Under the Curve (AUC) Hyperparameters – Epoch, learning rate, batch size, optimizer, and weight decay Prerequisites To follow along with this post, you should have the following prerequisites: Deploy the solution To begin, clone the repository hosting the sample code locally: git clone git@ssh.gitlab.aws.dev:west-ml-sa/fl_fedml.ai.git Then deploy the use case infrastructure using the following commands: terraform init terraform apply The Terraform template may take 20–30 minutes to fully deploy. After it’s deployed, follow the steps in the next sections to run the FL application. Create an MLOps deployment package As a part of the FedML documentation, we need to create the client and server packages, which the MLOps platform will distribute to the server and clients to begin training. To create these packages, run the following script found in the root directory: This will create the respective packages in the following directory in the project’s root directory: Upload the packages to the FedML MLOps platform Complete the following steps to upload the packages: On the FedML UI, choose My Applications in the navigation pane. Choose New Application. Upload the client and server packages from your workstation. You can also adjust the hyperparameters or create new ones. Trigger federated training To run federated training, complete the following steps: On the FedML UI, choose Project List in the navigation pane. Choose Create a new project. Enter a group name and a project name, then choose OK. Choose the newly created project and choose Create new run to trigger a…

Leave a Reply

Your email address will not be published. Required fields are marked *

ArcelorMittal Nippon Steel India Celebrates Safety Month

ArcelorMittal Nippon Steel India Celebrates Safety Month

Various events conducted to raise awareness among individuals and groups about

Spotify cries foul over Apple’s app review process

Spotify cries foul over Apple’s app review process

Spotify has accused Apple of deliberately holding back approval for an updated

You May Also Like