Chat-based assistants powered by Retrieval Augmented Generation (RAG) are transforming customer support, internal help desks, and enterprise search, by delivering fast, accurate answers grounded in your own data. With RAG, you can use a ready-to-deploy foundation model (FM) and enrich it with your own data, making responses relevant and context-aware without the need for fine-tuning or retraining. Running these chat-based assistants on Amazon Elastic Kubernetes Service (Amazon EKS) gives you the flexibility to use a variety of FMs, retaining full control over your data and infrastructure.
Amazon EKS scales with your workload and is cost-efficient for both steady and fluctuating demand. Because EKS is certified Kubernetes-conformant, it is compatible with existing applications running on a standard Kubernetes environment, whether hosted on on-premises data centers or public clouds. For your data plane, you can take advantage of a wide range of compute options, including CPUs, GPUs, AWS purpose-built AI chips (AWS Inferentia and AWS Trainium) and ARM-based CPU architectures (AWS Graviton), to match performance and cost requirements. Such flexibility makes Amazon EKS an ideal candidate for running heterogeneous workloads because you can compose different compute substrates, within the same cluster, to optimize both performance and cost efficiency.
NVIDIA NIM microservices consist of microservices that deploy and serve FMs, integrating with AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon EKS, and Amazon SageMaker. NIM microservices are distributed as Docker containers and are available through the NVIDIA NGC Catalog. Deploying GPU-accelerated models manually requires you to select and configure runtimes such as PyTorch or TensorFlow, set up inference servers such as Triton, implement model optimizations, and troubleshoot compatibility issues. This takes engineering time and expertise. NIM microservices eliminate this complexity by automating these technical decisions and configurations for you.
The NVIDIA NIM Operator is a Kubernetes management tool that facilitates the operation of model-serving components and services. It handles large language models (LLMs), embedders, and other model types through NVIDIA NIM microservices within Kubernetes environments. The Operator streamlines microservice management through three primary custom resources. First, the NIMCache
resource facilitates model downloading from NGC and network storage persistence. This enables multiple microservice instances to share a single cached model, improving microservice startup time. Second, the NIMService
resource manages individual NIM microservices, creating Kubernetes deployments within specified namespaces. Third, the NIMPipeline
resource functions as an orchestrator for multiple NIM service resources, allowing coordinated management of service groups. This architecture enables efficient operation and lifecycle management, with particular emphasis on reducing inference latency through model caching and supporting automated scaling capabilities.
NVIDIA NIM, coupled with the NVIDIA NIM Operator, provide a streamlined solution to address the deployment complexities stated in the opening. In this post, we demonstrate the implementation of a practical RAG chat-based assistant using a comprehensive stack of modern technologies. The solution uses NVIDIA NIMs for both LLM inference and text embedding services, with the NIM Operator handling their deployment and management. The architecture incorporates Amazon OpenSearch Serverless to store and query high-dimensional vector embeddings for similarity search.
The underlying Kubernetes infrastructure of the solution is provided by EKS Auto Mode, which supports GPU-accelerated Amazon Machine Images (AMIs) out of the box. These images include the NVIDIA device plugin, the NVIDIA container toolkit, precompiled NVIDIA kernel drivers, the Bottlerocket operating system, and Elastic Fabric Adapter (EFA) networking. You can use Auto Mode with Accelerated AMIs to spin up GPU instances, without manually installing and configuring GPU software components. Simply specify GPU-based instance types when creating Karpenter NodePools
, and EKS Auto Mode will launch GPU-ready worker nodes to run your accelerated workloads.
Solution overview
The following architecture diagram shows how NVIDIA NIM microservices running on Amazon EKS Auto Mode power our RAG chat-based assistant solution. The design, shown in the following diagram, combines GPU-accelerated model serving with vector search in Amazon OpenSearch Serverless, using the NIM Operator to manage model deployment and caching through persistent Amazon Elastic File System (Amazon EFS) storage.
Solution diagram (numbers indicate steps in the solution walkthrough section)
The solution follows these high-level steps:
- Create an EKS cluster
- Set up Amazon OpenSearch Serverless
- Create an EFS file system and set up necessary permissions
- Create Karpenter GPU
NodePool
- Install NVIDIA Node Feature Discovery (NFD) and NIM Operator
- Create nim-service namespace and NVIDIA secrets
- Create
NIMCaches
- Create
NIMServices
Solution walkthrough
In this section, we walk through the implementation of this RAG chat-based assistant solution step by step. We create an EKS cluster, configure Amazon OpenSearch Serverless and EFS storage, set up GPU-enabled nodes with Karpenter, deploy NVIDIA components for model serving, and finally integrate a chat-based assistant client using Gradio and LangChain. This end-to-end setup demonstrates how to combine LLM inference on Kubernetes with vector search capabilities, forming the foundation for a scalable, production-grade system—pending the addition of monitoring, auto scaling, and reliability features.
Prerequisites
To begin, ensure you have installed and set up the following required tools:
- AWS CLI (version aws-cli/2.27.11 or later)
kubectl
eksctl
(use version v0.195.0 or later to support Auto Mode)- Helm
These tools need to be properly configured according to the Amazon EKS setup documentation.
Clone the reference repository and cd into the root folder:
git clone https://github.com/aws-samples/sample-rag-chatbot-nim
cd sample-rag-chatbot-nim/infra
Environment setup
You need an NGC API key to authenticate and download NIM models. To generate the key, you can enroll (for free) in the NVIDIA Developer Program and then follow the NVIDIA guidelines.
Next, set up a few environment variables (replace the values with your information):
export CLUSTER_NAME=automode-nims-blog-cluster
export AWS_DEFAULT_REGION={your region}
export NVIDIA_NGC_API_KEY={your key}
Pattern deployment
To perform the solution, complete the steps in the following sections.
Create an EKS cluster
Deploy the EKS cluster using EKS Auto Mode, with eksctl
:
CHATBOT_SA_NAME=${CLUSTER_NAME}-client-service-account
IAM_CHATBOT_ROLE=${CLUSTER_NAME}-client-eks-pod-identity-role
cat << EOF | eksctl create cluster -f -
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${CLUSTER_NAME}
region: ${AWS_DEFAULT_REGION}
autoModeConfig:
enabled: true
iam:
podIdentityAssociations:
- namespace: default
serviceAccountName: ${CHATBOT_SA_NAME}
createServiceAccount: true
roleName: ${IAM_CHATBOT_ROLE}
permissionPolicy:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "aoss:*"
Resource: "*"
addons:
- name: aws-efs-csi-driver
useDefaultPodIdentityAssociations: true
EOF
Pod Identity Associations connect Kubernetes service accounts to AWS Identity and Access Management (IAM) roles, allowing pods to access AWS services securely. In this configuration, a service account will be created and associated with an IAM role, granting it full permissions to OpenSearch Serverless (in a production environment, restrict privileges according to the principle of least privilege).
NIMCaches
require volume AccessMode: ReadWriteMany
. Amazon Elastic Block Store (Amazon EBS) volumes provided by EKS Auto Mode aren’t suitable because they support ReadWriteOnce
only and can’t be mounted by multiple nodes. Storage options that support AccessMode: ReadWriteMany
include Amazon EFS, as shown in this example, or Amazon FSx for Lustre, which offers higher performance for workloads with greater throughput or latency requirements.
The preceding command will take a few minutes to be completed. When it’s completed, eksctl
configures your kubeconfig
and points it to the new cluster. You can validate that the cluster is up and running and that the EFS addon is installed by entering the following command:
kubectl get pods --all-namespaces
Expected output:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system efs-csi-controller-55b8dd6f57-wpzbg 3/3 Running 0 3m7s
kube-system efs-csi-controller-55b8dd6f57-z2gzc 3/3 Running 0 3m7s
kube-system efs-csi-node-6k5kz 3/3 Running 0 3m7s
kube-system efs-csi-node-pvv2v 3/3 Running 0 3m7s
kube-system metrics-server-6d67d68f67-7x4tg 1/1 Running 0 6m15s
kube-system metrics-server-6d67d68f67-l4xv6 1/1 Running 0 6m15s
Set up Amazon OpenSearch Serverless
A vector database stores and searches through numerical representations of text (embeddings). Such a component is essential in RAG chat-based assistant architectures because it facilitates finding relevant information related to a user question based on semantic similarity rather than exact keyword matches.
We use Amazon OpenSearch Service as the vector database. OpenSearch Service provides a managed solution for deploying, operating, and scaling OpenSearch clusters within AWS Cloud infrastructure. As part of this service, Amazon OpenSearch Serverless offers an on-demand configuration that automatically handles scaling to match your application’s requirements.
First, using AWS PrivateLink, create a private connection between the cluster’s Amazon Virtual Private Cloud (Amazon VPC) connection and Amazon OpenSearch Serverless. This keeps traffic within the AWS network and avoids public internet routing.
Enter the following commands to retrieve the cluster’s virtual private cloud (VPC) ID, CIDR block range, and subnet IDs, and store them in corresponding environment variables:
VPC_ID=$(aws eks describe-cluster
--name $CLUSTER_NAME
--query "cluster.resourcesVpcConfig.vpcId"
--output text
--region=$AWS_DEFAULT_REGION) &&
CIDR_RANGE=$(aws ec2 describe-vpcs
--vpc-ids $VPC_ID
--query "Vpcs[].CidrBlock"
--output text
--region $AWS_DEFAULT_REGION) &&
SUBNET_IDS=($(aws eks describe-cluster
--name $CLUSTER_NAME
--query "cluster.resourcesVpcConfig.subnetIds[]"
--region $AWS_DEFAULT_REGION
--output text))
Use the following code to create a security group for OpenSearch Serverless in the VPC, add an inbound rule to the security group allowing HTTPS traffic (port 443) from your VPC’s CIDR range, and create an OpenSearch Serverless VPC endpoint connected to the subnets and security group:
AOSS_SECURITY_GROUP_ID=$(aws ec2 create-security-group
--group-name ${CLUSTER_NAME}-AOSSSecurityGroup
--description "${CLUSTER_NAME} AOSS security group"
--vpc-id $VPC_ID
--region $AWS_DEFAULT_REGION
--query 'GroupId'
--output text) &&
aws ec2 authorize-security-group-ingress
--group-id $AOSS_SECURITY_GROUP_ID
--protocol tcp
--port 443
--region $AWS_DEFAULT_REGION
--cidr $CIDR_RANGE &&
VPC_ENDPOINT_ID=$(aws opensearchserverless create-vpc-endpoint
--name ${CLUSTER_NAME}-aoss-vpc-endpoint
--subnet-ids "${SUBNET_IDS[@]}"
--security-group-ids $AOSS_SECURITY_GROUP_ID
--region $AWS_DEFAULT_REGION
--vpc-id $VPC_ID
--query 'createVpcEndpointDetail.id'
--output text)
In the following steps, create an OpenSearch Serverless collection (a logical unit to store and organize documents).
- Create an encryption policy for the collection:
AOSS_COLLECTION_NAME=${CLUSTER_NAME}-collection
ENCRYPTION_POLICY_NAME=${CLUSTER_NAME}-encryption-policy
aws opensearchserverless create-security-policy
--name ${ENCRYPTION_POLICY_NAME}
--type encryption
--policy "{"Rules":[{"ResourceType":"collection","Resource":["collection/${AOSS_COLLECTION_NAME}"]}],"AWSOwnedKey":true}"
- The network policy that restricts access to the collection to only come through a specific VPC endpoint:
NETWORK_POLICY_NAME=${CLUSTER_NAME}-network-policy
aws opensearchserverless create-security-policy
--name ${NETWORK_POLICY_NAME}
--type network
--policy "[{"Description":"Allow VPC endpoint access","Rules":[{"ResourceType":"collection","Resource":["collection/${AOSS_COLLECTION_NAME}"]}],"SourceVPCEs":["$VPC_ENDPOINT_ID"]}]"
- The data policy that grants permissions to the IAM chat-based assistant role for interacting with indices in the collection:
DATA_POLICY_NAME=${CLUSTER_NAME}-data-policy
IAM_CHATBOT_ROLE_ARN=$(aws iam get-role --role-name ${IAM_CHATBOT_ROLE} --query 'Role.Arn' --output text)
aws opensearchserverless create-access-policy
--name ${DATA_POLICY_NAME}
--type data
--policy "[{"Rules":[{"ResourceType":"index","Resource":["index/${AOSS_COLLECTION_NAME}/*"],"Permission":["aoss:CreateIndex","aoss:DescribeIndex","aoss:ReadDocument","aoss:WriteDocument","aoss:UpdateIndex","aoss:DeleteIndex"]}],"Principal":["${IAM_CHATBOT_ROLE_ARN}"]}]"
- The OpenSearch collection itself:
AOSS_COLLECTION_ID=$(aws opensearchserverless create-collection
--name ${AOSS_COLLECTION_NAME}
--type VECTORSEARCH
--region ${AWS_DEFAULT_REGION}
--query 'createCollectionDetail.id'
--output text)
Create EFS file system and set up necessary permissions
Create an EFS file system:
EFS_FS_ID=$(aws efs create-file-system
--region $AWS_DEFAULT_REGION
--performance-mode generalPurpose
--query 'FileSystemId'
--output text)
EFS requires mount targets, which are VPC network endpoints that connect your EKS nodes to the EFS file system. These mount targets must be reachable from your EKS worker nodes, and access is controlled using security groups.
- Execute the following command to set up the mount targets and configure the necessary security group rules:
EFS_SECURITY_GROUP_ID=$(aws ec2 create-security-group
--group-name ${CLUSTER_NAME}-EfsSecurityGroup
--description "${CLUSTER_NAME} EFS security group"
--vpc-id $VPC_ID
--region $AWS_DEFAULT_REGION
--query 'GroupId'
--output text) &&
aws ec2 authorize-security-group-ingress
--group-id $EFS_SECURITY_GROUP_ID
--protocol tcp
--port 2049
--region $AWS_DEFAULT_REGION
--cidr $CIDR_RANGE &&
for subnet in $SUBNET_IDS; do
aws efs create-mount-target
--file-system-id $EFS_FS_ID
--subnet-id $subnet
--security-groups $EFS_SECURITY_GROUP_ID
--region $AWS_DEFAULT_REGION
done
- Create the
StorageClass
in Amazon EKS for Amazon EFS:
cat << EOF | kubectl apply -f -
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs
provisioner: efs.csi.aws.com
parameters:
provisioningMode: efs-ap
fileSystemId: ${EFS_FS_ID}
directoryPerms: "777"
EOF
- Validate the EFS storage class:
kubectl get storageclass efs
These are the expected results:
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
efs efs.csi.aws.com Delete Immediate false 9s
Create Karpenter GPU NodePool
To create the Karpenter GPU NodePool
, enter the following code:
cat << EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-node-pool
spec:
template:
metadata:
labels:
type: karpenter
NodeGroupType: gpu-node-pool
spec:
nodeClassRef:
group: eks.amazonaws.com
kind: NodeClass
name: default
taints:
- key: nvidia.com/gpu
value: "Exists"
effect: "NoSchedule"
requirements:
- key: "eks.amazonaws.com/instance-family"
operator: In
values: ["g5"]
- key: "eks.amazonaws.com/instance-size"
operator: In
values: [ "2xlarge", "4xlarge", "8xlarge", "16xlarge", "12xlarge", "24xlarge"]
- key: "kubernetes.io/arch"
operator: In
values: ["amd64"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
limits:
cpu: "1000"
EOF
This NodePool
is designed for GPU workloads using AWS G5 instances, which feature NVIDIA A10G GPUs. The taint ensures that only workloads specifically designed for GPU usage will be scheduled on these nodes, maintaining efficient resource utilization. In a production environment, you might want to consider using Amazon EC2 Spot Instances as well to optimize on costs.
Enter the command to validate successful creation of the NodePool
:
kubectl get nodepools
These are the expected results:
NAME NODECLASS NODES READY AGE
general-purpose default 0 True 15m
gpu-node-pool default 0 True 8s
system default 2 True 15m
gpu-node-pool
was created and has 0 nodes. To inspect nodes further, enter this command:
kubectl get nodes -o custom-columns=NAME:.metadata.name,READY:"status.conditions[?(@.type=='Ready')].status",OS-IMAGE:.status.nodeInfo.osImage,INSTANCE-TYPE:.metadata.labels.'node.kubernetes.io/instance-type'
This is the expected output:
NAME READY OS-IMAGE INSTANCE-TYPE
i-0b0c1cd3d744883cd True Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32) c6g.large
i-0e1f33e42fac76a09 True Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32) c6g.large
There are two instances, launched by EKS Auto Mode with non-accelerated Bottlerocket Amazon Machine Image (AMI) variant aws-k8s-1.32
, and CPU-only (non-GPU) instance type c6g
.
Install NVIDIA NFD and NIM Operator
The NFD is a Kubernetes plugin that identifies available hardware capabilities and system settings. NFD and NIM Operator are installed using Helm charts, each with their own custom resource definitions (CRDs).
- Before proceeding with installation, verify if related CRDs exist in your cluster:
# Check for NFD-related CRDs
kubectl get crds | grep nfd
# Check for NIM-related CRDs
kubectl get crds | grep nim
If these CRDs aren’t present, both commands will return no results.
- Add Helm repos:
helm repo add nfd https://kubernetes-sigs.github.io/node-feature-discovery/charts
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
- Install the NFD dependency for NIM Operator:
helm install node-feature-discovery nfd/node-feature-discovery
--namespace node-feature-discovery
--create-namespace
- Validate the pods are up and CRDs were created:
kubectl get po -n node-feature-discovery
Expected output:
NAME READY STATUS RESTARTS AGE
node-feature-discovery-gc-5b65f7f5b6-q4hlr 1/1 Running 0 79s
node-feature-discovery-master-767dcc6cb8-6hc2t 1/1 Running 0 79s
node-feature-discovery-worker-sg852 1/1 Running 0 43s
kubectl get crds | grep nfd
Expected output:
nodefeaturegroups.nfd.k8s-sigs.io 2025-05-05T01:23:16Z
nodefeaturerules.nfd.k8s-sigs.io 2025-05-05T01:23:16Z
nodefeatures.nfd.k8s-sigs.io 2025-05-05T01:23:16Z
- Install the NIM Operator:
helm install nim-operator nvidia/k8s-nim-operator
--namespace nim-operator
--create-namespace
--version v2.0.0
You might need to use version v1.0.1 for the NIM Operator instead of v2.0.0 as shown in the preceding code example because occasionally you might receive a “402 Payment Required” message.
- Validate the pod is up and CRDs were created:
kubectl get po -n nim-operator
Expected output:
NAME READY STATUS RESTARTS AGE
nim-operator-k8s-nim-operator-6d988f78df-h4nqn 1/1 Running 0 24s
kubectl get crds | grep nim
Expected output:
nimcaches.apps.nvidia.com 2025-05-05T01:18:00Z
nimpipelines.apps.nvidia.com 2025-05-05T01:18:00Z
nimservices.apps.nvidia.com 2025-05-05T01:18:01Z
Create nim-service
namespace and NVIDIA secrets
In this section, create the nim-service
namespace and add two secrets containing your NGC API key.
- Create namespace and secrets:
kubectl create namespace nim-service
kubectl create secret -n nim-service docker-registry ngc-secret
--docker-server=nvcr.io
--docker-username='$oauthtoken'
--docker-password=$NVIDIA_NGC_API_KEY
kubectl create secret -n nim-service generic ngc-api-secret
--from-literal=NGC_API_KEY=$NVIDIA_NGC_API_KEY
- Validate secrets were created:
kubectl -n nim-service get secrets
The following is the expected result:
NAME TYPE DATA AGE
ngc-api-secret Opaque 1 13s
ngc-secret kubernetes.io/dockerconfigjson 1 14s
ngc-secret
is a Docker registry secret used to authenticate and pull NIM container images from NVIDIA’s NGC container registry.
ngc-api-secret
is a generic secret used by the model puller init container to authenticate and download models from the same registry.
Create NIMCaches
RAG enhances chat applications by enabling AI models to access either internal domain-specific knowledge or external knowledge bases, reducing hallucinations and providing more accurate, up-to-date responses. In a RAG system, a knowledge base is created from domain-specific documents. These documents are sliced into smaller pieces of text. The text pieces and their generated embeddings are then uploaded to a vector database. Embeddings are numerical representations (vectors) that capture the meaning of text, where similar text content results in similar vector values. When questions are received from users, they’re also sent with their respective embeddings to the database for semantic similarity search. The database returns the closest matching chunks of text, which are used by an LLM to provide a domain-specific answer.
We use Meta’s llama-3-2-1b-instruct
as LLM and NVIDIA Retrieval QA E5 (embedqa-e5-v5
) as embedder.
This section covers the deployment of NIMCaches
for storing both the LLM and embedder models. Local storage of these models speeds up pod initialization by eliminating the need for repeated downloads. Our llama-3-2-1b-instruct
LLM, with 1B parameters, is a relatively small model and uses 2.5 GB of storage space. The storage requirements and initialization time increase when larger models are used. Although the initial setup of the LLM and embedder caches takes 10–15 minutes, subsequent pod launches will be faster because the models are already available in the cluster’s local storage.
Enter the following command:
kubectl apply -f nim-caches.yaml
This is the expected output:
nimcache.apps.nvidia.com/nv-embedqa-e5-v5 created
nimcache.apps.nvidia.com/meta-llama-3-2-1b-instruct created
NIMCaches
will create PersistentVolumes
(PVs) and PersistentVolumeClaims
(PVCs) to store the models, with STORAGECLASS
efs
:
kubectl get -n nim-service pv,pvc
The following is the expected output:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
persistentvolume/pvc-5fa98625-ea65-4aef-99ff-ca14001afb47 50Gi RWX Delete Bound nim-service/nv-embedqa-e5-v5-pvc efs <unset> 77s
persistentvolume/pvc-ab67e4dc-53df-47e7-95c8-ec6458a57a01 50Gi RWX Delete Bound nim-service/meta-llama-3-2-1b-instruct-pvc efs <unset> 76s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
persistentvolumeclaim/meta-llama-3-2-1b-instruct-pvc Bound pvc-ab67e4dc-53df-47e7-95c8-ec6458a57a01 50Gi RWX efs <unset> 77s
persistentvolumeclaim/nv-embedqa-e5-v5-pvc Bound pvc-5fa98625-ea65-4aef-99ff-ca14001afb47 50Gi RWX efs <unset> 77s
Enter the following to validate NIMCaches
:
kubectl get nimcaches -n nim-service
This is the expected output (STATUS
will stay initially blank, then become InProgress
for 10–15 mins until model download is complete):
NAME STATUS PVC AGE
meta-llama-3-2-1b-instruct Ready meta-llama-3-2-1b-instruct-pvc 13m
nv-embedqa-e5-v5 Ready nv-embedqa-e5-v5-pvc 13m
Create NIMServices
NIMServices
are custom resources to manage NVIDIA NIM microservices. To deploy the LLM and embedder services enter the following:
kubectl apply -f nim-services.yaml
The following is the expected output:
nimservice.apps.nvidia.com/meta-llama-3-2-1b-instruct created
nimservice.apps.nvidia.com/nv-embedqa-e5-v5 created
Validate the NIMServices
:
kubectl get nimservices -n nim-service
The following is the expected output:
NAME STATUS AGE
meta-llama-3-2-1b-instruct Ready 5m25s
nv-embedqa-e5-v5 Ready 5m24s
Our models are stored in an EFS volume, which is mounted to the EC2 instances as a PVC. That translates to faster pod startup times. In fact, notice in the preceding example that the NIMServices
are ready in approximately 5 minutes. This time includes GPU node(s) launch from Karpenter and container image pull and launch.
Compared to the 10–15 minutes required for internet-based model downloads, as experienced during the NIMCaches
deployment, loading models from the local cache reduces startup time considerably, enhancing the overall system scaling speed. Should you need even more performing storage alternatives, you could explore alternatives such as Amazon FSx for Lustre.
Enter the following command to check the nodes again:
kubectl get nodes -o custom-columns=NAME:.metadata.name,READY:"status.conditions[?(@.type=='Ready')].status",OS-IMAGE:.status.nodeInfo.osImage,INSTANCE-TYPE:.metadata.labels.'node.kubernetes.io/instance-type'
The following is the expected output:
NAME READY OS-IMAGE INSTANCE-TYPE
i-0150ecedccffcc17f True Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32) c6g.large
i-027bf5419d63073cf True Bottlerocket (EKS Auto) 2025.4.26 (aws-k8s-1.32) c5a.large
i-0a1a1f39564fbf125 True Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia) g5.2xlarge
i-0d418bd8429dd12cd True Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia) g5.2xlarge
Karpenter launched two new GPU instances to support NIMServices
, with a Bottlerocket accelerated AMI variant Bottlerocket (EKS Auto, Nvidia) 2025.4.21 (aws-k8s-1.32-nvidia)
. The number and type of instances launched might vary depending on Karpenter’s algorithm, which takes into consideration parameters such as instance availability and cost.
Confirm that the NIMService
STATUS
is Ready
before progressing further.
Chat-based assistant client
We now use a Python client, implementing the chat-based assistant interface, using the Gradio and LangChain libraries. Gradio creates the web interface and chat components, handling the frontend presentation. LangChain connects various components and implements RAG through multiple services in our EKS cluster. Meta’s llama-3-2-1b-instruct
serves as the base language model, and nv-embedqa-e5-v5
creates text embeddings. OpenSearch acts as the vector store, managing these embeddings and enabling similarity search. This setup allows the chat-based assistant to retrieve relevant information and generate contextual responses.
Sequence diagram showing question-answering workflow with document upload process
- Enter the following commands to deploy the client, hosted on Amazon Elastic Container Registry (Amazon ECR) as a container image in the public gallery (the application’s source files are available in the
client
folder of the cloned repository):
AOSS_INDEX=${CLUSTER_NAME}-index
CHATBOT_CONTAINER_IMAGE=public.ecr.aws/h6c7e9p3/aws-rag-chatbot-eks-nims:1.0
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: rag-chatbot
labels:
app: rag-chatbot
spec:
ports:
- port: 7860
protocol: TCP
selector:
app: rag-chatbot
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-chatbot
spec:
selector:
matchLabels:
app: rag-chatbot
template:
metadata:
labels:
app: rag-chatbot
spec:
serviceAccountName: ${CHATBOT_SA_NAME}
containers:
- name: rag-chatbot
image: ${CHATBOT_CONTAINER_IMAGE}
ports:
- containerPort: 7860
protocol: TCP
env:
- name: AWS_DEFAULT_REGION
value: ${AWS_DEFAULT_REGION}
- name: OPENSEARCH_COLLECTION_ID
value: ${AOSS_COLLECTION_ID}
- name: OPENSEARCH_INDEX
value: ${AOSS_INDEX}
- name: LLM_URL
value: "http://meta-llama-3-2-1b-instruct.nim-service.svc.cluster.local:8000/v1"
- name: EMBEDDINGS_URL
value: "http://nv-embedqa-e5-v5.nim-service.svc.cluster.local:8000/v1"
EOF
- Check the client pod status:
kubectl get pods
The following is the example output:
NAME READY STATUS RESTARTS AGE
rag-chatbot-6678cd95cb-4mwct 1/1 Running 0 60s
- Port-forward the client’s service:
kubectl port-forward service/rag-chatbot 7860:7860 &
- Open a browser window at
http://127.0.0.1:7860
.
In the following screenshot, we prompted the chat-based assistant about a topic that isn’t in its knowledge base yet: “What is Amazon Nova Canvas.”
The chat-based assistant can’t find information on the topic and can’t formulate a proper answer.
- Download the file at location:
https://docs.aws.amazon.com/pdfs/ai/responsible-ai/nova-canvas/nova-canvas.pdf
and upload its embeddings to OpenSearch Serverless using the client UI, switching to the Document upload tab, in the top left, as shown in the following screenshot.
The expected result is nova-canvas.pdf
appearing the list of uploaded files, as shown in the following screenshot.
- Wait 15–30 seconds for OpenSearch Serverless to process and index the data. Ask the same question, “What is Amazon Nova Canvas,” and you will receive a different answer, as shown in the following screenshot.
Cleanup
To clean up the cluster and the EFS resources created so far, enter the following command:
aws efs describe-mount-targets
--region $AWS_DEFAULT_REGION
--file-system-id $EFS_FS_ID
--query 'MountTargets[*].MountTargetId'
--output text
| xargs -n1 aws efs delete-mount-target
--region $AWS_DEFAULT_REGION
--mount-target-id
Wait approximately 30 seconds for the mount targets to be removed, then enter the following command:
aws efs delete-file-system --file-system-id $EFS_FS_ID --region $AWS_DEFAULT_REGION
eksctl delete cluster --name=$CLUSTER_NAME --region $AWS_DEFAULT_REGION
To delete the OpenSearch Serverless collection and policies, enter the following command:
aws opensearchserverless delete-collection
--id ${AOSS_COLLECTION_ID}
aws opensearchserverless delete-security-policy
--name ${ENCRYPTION_POLICY_NAME}
--type encryption
aws opensearchserverless delete-security-policy
--name ${NETWORK_POLICY_NAME}
--type network
aws opensearchserverless delete-access-policy
--name ${DATA_POLICY_NAME}
--type data
Conclusion
In this post, we showed how to deploy a RAG-enabled chat-based assistant on Amazon EKS, using NVIDIA NIM microservices, integrating an LLM for text generation, an embedding model, and Amazon OpenSearch Serverless for vector storage. Using EKS Auto Mode with GPU-accelerated AMIs, we streamlined our deployment by automating the setup of GPU infrastructure. We specified GPU-based instance types in our Karpenter NodePools
, and the system automatically provisioned worker nodes with all necessary NVIDIA components, including device plugins, container toolkit, and kernel drivers. The implementation demonstrated the effectiveness of RAG, with the chat-based assistant providing informed responses when accessing relevant information from its knowledge base. This architecture showcases how Amazon EKS can streamline the deployment of AI solutions, maintaining production-grade reliability and scalability.
As a challenge, try enhancing the chat-based assistant application by implementing chat history functionality to preserve context across conversations. This allows the LLM to reference previous exchanges and provide more contextually relevant responses. To further learn how to run artificial intelligence and machine learning (AI/ML) workloads on Amazon EKS, check out our EKS best practices guide for running AI/ML workloads, join one of our Get Hands On with Amazon EKS event series, and visit AI on EKS deployment-ready blueprints.
About the authors
Riccardo Freschi is a Senior Solutions Architect at AWS who specializes in Modernization. He helps partners and customers transform their IT landscapes by designing and implementing modern cloud-native architectures on AWS. His focus areas include container-based applications on Kubernetes, cloud-native development, and establishing modernization strategies that drive business value.
Christina Andonov is a Sr. Specialist Solutions Architect at AWS, helping customers run AI workloads on Amazon EKS with open source tools. She’s passionate about Kubernetes and known for making complex concepts easy to understand.