Deploy LiteLLM Agent Platform to AWS EKS

The recommended production topology runs the sandbox Kubernetes cluster on AWS EKS while the web and worker processes run on Render. EKS gives you managed node groups, native ECR integration, and IAM-based access control for the cluster API. Render handles TLS termination, auto-deploys from GitHub, and provides a managed Postgres database.

This guide covers the recommended Render + EKS topology. If you prefer to run web and worker inside EKS alongside the sandbox pods, refer to the all-EKS deployment guide in the repository’s deploy/ directory.

Prerequisites

Before you begin, make sure you have:

AWS CLI configured with an IAM user or role that can create EKS clusters and ECR repositories
eksctl or kubectl installed locally
A Render account
A LiteLLM gateway URL and API key
Docker installed locally to build harness images

Architecture overview

Render (web + worker)
  ├── litellm-agents-web     ← Next.js, public HTTPS
  ├── litellm-agents-worker  ← reconciler + warm pool
  └── Render Postgres        ← managed database

AWS EKS (us-east-1 or your region)
  └── sandbox pods (s-*, w-*)  ← agent harnesses (Sandbox CRs)

The web and worker services connect to EKS using a kubeconfig that is base64-encoded and stored as a Render environment variable. Authentication uses aws-iam-authenticator with static IAM credentials, which the build command installs automatically.

Step 1: Provision the EKS cluster

Run the provisioning script from the repository root. The script creates the EKS cluster, installs the agent-sandbox controller, and prints the base64-encoded kubeconfig you will need for Render.

bin/eks-up.sh 2>eks-stderr.log | tee kube-config.b64
# K8S_NODE_HOST is printed to stderr:
grep K8S_NODE_HOST eks-stderr.log

Verify the cluster is reachable before continuing:

base64 -d < kube-config.b64 > /tmp/kc
KUBECONFIG=/tmp/kc kubectl get nodes
KUBECONFIG=/tmp/kc kubectl get crd | grep sandboxes.agents.x-k8s.io

You should see at least one Ready node and the sandboxes.agents.x-k8s.io CRD.

Step 2: Push harness images to ECR

Create an ECR repository for the harness image and push it:

aws ecr create-repository \
  --repository-name litellm-agent-platform \
  --region us-east-1

aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin \
    <account-id>.dkr.ecr.us-east-1.amazonaws.com

docker build -t litellm-agent-platform/opencode-sandbox:latest \
  -f harnesses/opencode/Dockerfile .

docker tag litellm-agent-platform/opencode-sandbox:latest \
  <account-id>.dkr.ecr.us-east-1.amazonaws.com/litellm-agent-platform:latest

docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/litellm-agent-platform:latest

Note the full ECR image URI — you will set it as K8S_HARNESS_IMAGE in the next step.

Step 3: Deploy web and worker on Render

Click the Deploy to Render button in the repository’s deploy/render/README.md, or follow the manual steps below. The Render Blueprint in deploy/render/ creates three resources from render.yaml:

Resource	Type
Render Postgres	Managed PostgreSQL database
`litellm-agents-web`	Web Service (Next.js)
`litellm-agents-worker`	Background Worker (reconciler)

MASTER_KEY is auto-generated and DATABASE_URL is wired automatically by Render. After provisioning completes, open the Environment tab for both the web service and the worker and add the following variables:

Variable	Value
`LITELLM_API_BASE`	Your LiteLLM gateway URL
`LITELLM_API_KEY`	Your LiteLLM gateway API key
`KUBE_CONFIG_B64`	Contents of `kube-config.b64` from Step 1
`AWS_ACCESS_KEY_ID`	Access key for the IAM principal that ran `bin/eks-up.sh`
`AWS_SECRET_ACCESS_KEY`	Secret key for the same IAM principal
`AWS_REGION`	EKS cluster region (e.g. `us-east-1`)
`K8S_NODE_HOST`	`auto` (recommended — discovers the node IP at request time)
`K8S_HARNESS_IMAGE`	Full ECR URI from Step 2
`K8S_IMAGE_PULL_POLICY`	`IfNotPresent`
`ENCRYPTION_KEY`	Base64-encoded 32 bytes (generate with the command below)

Generate an encryption key:

node -e "console.log(require('crypto').randomBytes(32).toString('base64'))"

Set K8S_IMAGE_PULL_POLICY to IfNotPresent or Always for production. The default value of Never is only correct for local kind clusters that have the image loaded directly — on EKS, Never causes pods to fail with ErrImageNeverPull.

After saving all variables, trigger a manual redeploy of both services from the Render dashboard.

Step 4: Verify the deployment

Check the web service

Open the web service URL in a browser. You should see the LAP login screen. Log in with MASTER_KEY.

Check the worker logs

In the Render dashboard, open the worker service logs and confirm the line reconciler worker started appears.

Create an agent and spawn a session

In the web UI, create an agent and click Spawn session. The first cold start takes 30–60 s because the EKS node pulls the harness image from ECR. Subsequent starts with the warm pool active should complete in under 2 s.

Point the lap CLI at production

Configure the CLI to talk to your production deployment:

lap login
# Agent platform URL: https://<your-render-web-url>
# Master key:         <MASTER_KEY>

Then open a sandbox:

lap <agent-name>

Day-to-day operations

Update a secret value

kubectl patch secret litellm-env -n default \
  --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/data/MASTER_KEY\",
         \"value\":\"$(echo -n 'newvalue' | base64)\"}]"

# Restart both deployments to pick up the change:
kubectl rollout restart deployment/litellm-web deployment/litellm-worker -n default

Scale the warm pool

kubectl patch secret litellm-env -n default \
  --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/data/WARM_POOL_SIZE\",
         \"value\":\"$(echo -n '6' | base64)\"}]"
kubectl rollout restart deployment/litellm-worker -n default

Clean up stale sandbox pods

If the reconciler was down and pods accumulated, delete them by label:

kubectl delete sandboxes.agents.x-k8s.io -n default \
  -l litellm-session-id --grace-period=0
kubectl delete sandboxes.agents.x-k8s.io -n default \
  -l litellm-warm-task-id --grace-period=0
kubectl delete services -n default \
  -l litellm-session-id --grace-period=0

Common deployment issues

Symptom	Cause	Fix
`Cannot find module '@tailwindcss/postcss'`	Build command missing `--include=dev`	Confirm the Render build command includes `npm ci --include=dev`
`HTTP-Code: 401` on every Kubernetes API call	AWS credentials wrong or IAM principal not mapped to a cluster role	Verify `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_REGION` are set on both services; re-run `bin/eks-up.sh` to refresh the `aws-auth` mapping
`aws-iam-authenticator: command not found`	Binary not installed during build	Confirm `bin/install-aws-iam-authenticator.sh` runs in the build command and `./bin` is on `PATH` at start
`ImagePullBackOff` on sandbox pods	Cluster nodes cannot pull from ECR	Check IAM permissions for the node role and confirm the ECR repository is in the same region
Stale `ready` session rows after re-pointing the cluster	Previous sessions point at unreachable sandbox URLs	Wait 60 s for the reconciler ghost-reaper, or `DELETE FROM "Session" WHERE status='ready';` before the first request

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Deploy LiteLLM Agent Platform to AWS EKS

Prerequisites

Architecture overview

Step 1: Provision the EKS cluster

Step 2: Push harness images to ECR

Step 3: Deploy web and worker on Render

Step 4: Verify the deployment

Day-to-day operations

Update a secret value

Scale the warm pool

Clean up stale sandbox pods

Common deployment issues

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Documentation Index

​Prerequisites

​Architecture overview

​Step 1: Provision the EKS cluster

​Step 2: Push harness images to ECR

​Step 3: Deploy web and worker on Render

​Step 4: Verify the deployment

​Day-to-day operations

​Update a secret value

​Scale the warm pool

​Clean up stale sandbox pods

​Common deployment issues

Prerequisites

Architecture overview

Step 1: Provision the EKS cluster

Step 2: Push harness images to ECR

Step 3: Deploy web and worker on Render

Step 4: Verify the deployment

Day-to-day operations

Update a secret value

Scale the warm pool

Clean up stale sandbox pods

Common deployment issues