Today, Amazon SageMaker HyperPod is announcing a new one-click, validated cluster creation experience that accelerates setup and prevents common misconfigurations, so you can launch your distributed training and inference clusters complete with Slurm or Amazon Elastic Kubernetes Service (Amazon EKS) orchestration, Amazon Virtual Private Cloud (Amazon VPC) networking, high-performance storage, and security built in by default.
With SageMaker HyperPod, you can efficiently scale tasks such as generative AI training, fine-tuning, or inference over clusters with hundreds or thousands of AI accelerators. The system continuously checks for hardware problems, resolves them automatically, and makes sure your workloads recover without manual intervention.
Previously, customers were required to set up a VPC, an Amazon Simple Storage Service (Amazon S3) bucket, AWS Identity and Access Management (IAM) roles, and other AWS resources as prerequisites for creating a SageMaker HyperPod cluster. This multi-step process created manual touch points where misconfiguration could occur.
With the new cluster creation experience, you can create your SageMaker HyperPod clusters, including the required prerequisite AWS resources, in one click, with prescriptive default values automatically applied. In this post, we explore the new cluster creation experience for Amazon SageMaker HyperPod.
Solution overview
SageMaker HyperPod offers two new deployment options on the AWS Management Console for creating clusters orchestrated by Slurm and Amazon EKS: quick setup and custom setup. Both options are presented on the Amazon SageMaker AI console.
When you create a cluster, SageMaker HyperPod creates an AWS CloudFormation stack to deploy your cluster and supporting resources with your specified configurations.
With AWS CloudFormation, you can declaratively express the desired state of your cloud architectures using infrastructure as code (IaC) so that even complex compositions using multiple managed services—such as SageMaker HyperPod clusters and prerequisite resources—can be deployed in a single request consistently across multiple environments.
In the following sections, we walk through the details of the quick setup and custom setup options, and provide screenshots of key configurations.
Quick setup
With quick setup, SageMaker HyperPod uses sensible defaults for instance groups, networking, orchestration, lifecycle configuration, permissions, and storage. You can also view which configurations are editable after the cluster is created, and which would require the corresponding AWS resources to be recreated; if you want to edit such configurations, use the custom setup.Quick setup offers automatic instance recovery for instances that become unhealthy or unresponsive.
For networking, quick setup creates a new VPC with subnets spread across the Availability Zones in your AWS Region. Within each Availability Zone, a public /24 subnet is created for internet access through a NAT gateway, a private /24 subnet is created to facilitate EKS control plane communications, and a /16 private subnet is created for targeting accelerated instance group capacity. A new security group is also configured with the required rules for allowing Elastic Fabric Adapter (EFA) and Amazon FSx for Lustre network traffic.
Using a /16 private subnet as default for SageMaker HyperPod instances supports over 65,000 private IPs, which is important to accommodate for large clusters of accelerated instances consuming multiple IP addresses for each host.
For Amazon EKS orchestration, quick setup creates a new EKS cluster using the latest supported Kubernetes version with available operators enabled, including the EFA, Neuron, and NVIDIA device plugins; the health monitoring agent (HMA); the Kubeflow training operators; and the SageMaker HyperPod inference operator.
Quick setup also creates a new S3 bucket to store the default lifecycle scripts for instance setup and configuration, a new IAM role with the necessary permissions for the SageMaker HyperPod cluster, and a new FSx for Lustre file system for high-performance data storage and retrieval.
Custom setup
With a custom setup, you have the flexibility to choose how your SageMaker HyperPod cluster is configured at a more granular level across the same dimensions.
Although automatic node recovery is still recommended to reboot or replace faulty nodes when issues are detected, with a custom setup for Amazon EKS orchestration, you can selectively disable this feature if you need more control over the recovery process to conduct manual intervention for troubleshooting or testing purposes. When continuous provisioning mode is enabled, SageMaker HyperPod allows concurrent initiation of multiple operations, parallel execution of scaling up, scaling down, and AMI updates within a single instance group, and cluster creation even if not all requested instances are immediately available. This option provides more flexibility and faster operations by allowing multiple changes to be made simultaneously, which can reduce overall deployment and update times.
Custom setup gives you the options to create a new VPC with a custom CIDR range and target specific Availability Zones for subnet creation based on the location of your accelerated compute capacity. You can also reference an existing VPC and security group for SageMaker HyperPod cluster deployment, which is useful if you intend to use an existing EKS cluster for orchestration or attach an existing FSx for Lustre file system.
For Amazon EKS orchestration, you can create a new EKS cluster with the option to select supported Kubernetes versions along with two or more private subnets that Amazon EKS will use to provision two elastic network interfaces (ENIs) to establish network connectivity between the Kubernetes API server and your VPC. If you prefer to use an existing EKS cluster, you can select it by name using the custom setup.
You also have granular control over which optional operators are installed in your EKS cluster using the default Helm charts based on the specific requirements of your workload. Some of these components are required and must be installed for SageMaker HyperPod clusters to operate successfully.
With a custom setup, you can choose to use custom lifecycle scripts from an existing S3 bucket for advanced configuration needs like installing custom machine learning (ML) frameworks or specific versions of dependencies, deploying proprietary software or tools, and configuring specific network optimizations. You can also assign an existing IAM role to the SageMaker HyperPod cluster to accommodate specific permission requirements. For storage, you have the flexibility to integrate an existing FSx for Lustre file system, provision a new file system with multiple throughput and storage capacity options, or skip file system provisioning if it’s not yet needed.
Add an instance group
For both quick and custom setup options, you can add a new instance group to your SageMaker HyperPod cluster from the SageMaker AI console.
You can choose between standard instance groups, which provide a general-purpose computing environment without additional security restrictions, or restricted instance groups (RIGs) to provision a specialized environment within SageMaker HyperPod that provides an isolated space for training customized Amazon Nova models.
You can select on-demand capacity for one-time workloads and testing, or flexible training plans capacity to get predictable access to accelerated compute resources within your timeline and budget, for your planned, large-scale training jobs. With flexible training plans, you can schedule capacity on the latest P6-B200 instance types and P6e-GB200 UltraServers powered by NVIDIA Blackwell tensor core GPUs. If you need to provision an instance group for long-term usage, you can contact AWS to reserve capacity for a longer duration.
With Amazon EKS orchestration, for each instance group that you add, you can enable stress and connectivity deep health checks. These deep health checks are performed in addition to the orchestrator-agnostic basic health checks that also apply to Slurm orchestrated clusters. Stress checks test hardware components under stress to identify potential issues with GPUs, memory, and other hardware components. Connectivity checks test network connectivity between nodes to maintain proper communication for distributed training.
With advanced configuration, you can choose the number of threads that run on each CPU core of your Amazon Elastic Compute Cloud (Amazon EC2) instances. Choosing one thread per core disables multi-threading. Each core runs a single thread, which can provide more predictable performance for applications that benefit from dedicated core resources, such as certain high-performance computing workloads. Choosing two threads per core enables multi-threading. Each physical core runs two threads simultaneously, potentially increasing throughput for multi-threaded applications at the cost of some per-thread performance.
Download your CloudFormation template parameters
For further customization and reuse, you can download a copy of the CloudFormation template from the SageMaker AI console with the parameters you selected preconfigured. You can use this template with continuous delivery tools like AWS CodePipeline to automatically build and test changes before promoting them to production stacks. With CodePipeline, you can create parameter overrides in a template configuration file to input custom values when you create or update a stack across different dev, test, and production environments.
Conclusion
SageMaker HyperPod now offers an enhanced one-click deployment experience to set up purpose-built, resilient infrastructure for training and deploying large ML models. With the quick setup option, you can take advantage of prescriptive defaults, and the custom setup option provides the flexibility to tailor distributed training environments to meet specialized requirements. With IaC through AWS CloudFormation, you get a declarative expression of your SageMaker HyperPod cluster environment that can be version controlled, further customized, and integrated into continuous delivery pipelines.
Get started today by visiting the SageMaker AI console and creating a new SageMaker HyperPod cluster.
About the authors
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Cindy Zhao is a Software Development Engineer based in Seattle. She focuses on building large-scale ML infrastructure with AWS SageMaker HyperPod, helping customers set up secure and reliable clusters for foundation model training. Outside of work, she enjoys traveling and spending time with her cat.
Nathan Arnold is a Senior AI/ML Specialist Solutions Architect at AWS based out of Austin Texas. He helps AWS customers—from small startups to large enterprises—train and deploy foundation models efficiently on AWS. When he’s not working with customers, he enjoys hiking, trail running, and playing with his dogs.