The recommended production topology runs the sandbox Kubernetes cluster on AWS EKS while the web and worker processes run on Render. EKS gives you managed node groups, native ECR integration, and IAM-based access control for the cluster API. Render handles TLS termination, auto-deploys from GitHub, and provides a managed Postgres database.Documentation Index
Fetch the complete documentation index at: https://docs.litellm-agent-platform.ai/llms.txt
Use this file to discover all available pages before exploring further.
This guide covers the recommended Render + EKS topology. If you prefer to run web and worker inside EKS alongside the sandbox pods, refer to the all-EKS deployment guide in the repository’s
deploy/ directory.Prerequisites
Before you begin, make sure you have:- AWS CLI configured with an IAM user or role that can create EKS clusters and ECR repositories
eksctlorkubectlinstalled locally- A Render account
- A LiteLLM gateway URL and API key
- Docker installed locally to build harness images
Architecture overview
aws-iam-authenticator with static IAM credentials, which the build command installs automatically.
Step 1: Provision the EKS cluster
Run the provisioning script from the repository root. The script creates the EKS cluster, installs theagent-sandbox controller, and prints the base64-encoded kubeconfig you will need for Render.
Ready node and the sandboxes.agents.x-k8s.io CRD.
Step 2: Push harness images to ECR
Create an ECR repository for the harness image and push it:K8S_HARNESS_IMAGE in the next step.
Step 3: Deploy web and worker on Render
Click the Deploy to Render button in the repository’sdeploy/render/README.md, or follow the manual steps below.
The Render Blueprint in deploy/render/ creates three resources from render.yaml:
| Resource | Type |
|---|---|
| Render Postgres | Managed PostgreSQL database |
litellm-agents-web | Web Service (Next.js) |
litellm-agents-worker | Background Worker (reconciler) |
MASTER_KEY is auto-generated and DATABASE_URL is wired automatically by Render.
After provisioning completes, open the Environment tab for both the web service and the worker and add the following variables:
| Variable | Value |
|---|---|
LITELLM_API_BASE | Your LiteLLM gateway URL |
LITELLM_API_KEY | Your LiteLLM gateway API key |
KUBE_CONFIG_B64 | Contents of kube-config.b64 from Step 1 |
AWS_ACCESS_KEY_ID | Access key for the IAM principal that ran bin/eks-up.sh |
AWS_SECRET_ACCESS_KEY | Secret key for the same IAM principal |
AWS_REGION | EKS cluster region (e.g. us-east-1) |
K8S_NODE_HOST | auto (recommended — discovers the node IP at request time) |
K8S_HARNESS_IMAGE | Full ECR URI from Step 2 |
K8S_IMAGE_PULL_POLICY | IfNotPresent |
ENCRYPTION_KEY | Base64-encoded 32 bytes (generate with the command below) |
Step 4: Verify the deployment
Check the web service
Open the web service URL in a browser. You should see the LAP login screen. Log in with
MASTER_KEY.Check the worker logs
In the Render dashboard, open the worker service logs and confirm the line
reconciler worker started appears.Create an agent and spawn a session
In the web UI, create an agent and click Spawn session. The first cold start takes 30–60 s because the EKS node pulls the harness image from ECR. Subsequent starts with the warm pool active should complete in under 2 s.
Day-to-day operations
Update a secret value
Scale the warm pool
Clean up stale sandbox pods
If the reconciler was down and pods accumulated, delete them by label:Common deployment issues
| Symptom | Cause | Fix |
|---|---|---|
Cannot find module '@tailwindcss/postcss' | Build command missing --include=dev | Confirm the Render build command includes npm ci --include=dev |
HTTP-Code: 401 on every Kubernetes API call | AWS credentials wrong or IAM principal not mapped to a cluster role | Verify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION are set on both services; re-run bin/eks-up.sh to refresh the aws-auth mapping |
aws-iam-authenticator: command not found | Binary not installed during build | Confirm bin/install-aws-iam-authenticator.sh runs in the build command and ./bin is on PATH at start |
ImagePullBackOff on sandbox pods | Cluster nodes cannot pull from ECR | Check IAM permissions for the node role and confirm the ECR repository is in the same region |
Stale ready session rows after re-pointing the cluster | Previous sessions point at unreachable sandbox URLs | Wait 60 s for the reconciler ghost-reaper, or DELETE FROM "Session" WHERE status='ready'; before the first request |