Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.litellm-agent-platform.ai/llms.txt

Use this file to discover all available pages before exploring further.

The recommended production topology runs the sandbox Kubernetes cluster on AWS EKS while the web and worker processes run on Render. EKS gives you managed node groups, native ECR integration, and IAM-based access control for the cluster API. Render handles TLS termination, auto-deploys from GitHub, and provides a managed Postgres database.
This guide covers the recommended Render + EKS topology. If you prefer to run web and worker inside EKS alongside the sandbox pods, refer to the all-EKS deployment guide in the repository’s deploy/ directory.

Prerequisites

Before you begin, make sure you have:
  • AWS CLI configured with an IAM user or role that can create EKS clusters and ECR repositories
  • eksctl or kubectl installed locally
  • A Render account
  • A LiteLLM gateway URL and API key
  • Docker installed locally to build harness images

Architecture overview

Render (web + worker)
  ├── litellm-agents-web     ← Next.js, public HTTPS
  ├── litellm-agents-worker  ← reconciler + warm pool
  └── Render Postgres        ← managed database

AWS EKS (us-east-1 or your region)
  └── sandbox pods (s-*, w-*)  ← agent harnesses (Sandbox CRs)
The web and worker services connect to EKS using a kubeconfig that is base64-encoded and stored as a Render environment variable. Authentication uses aws-iam-authenticator with static IAM credentials, which the build command installs automatically.

Step 1: Provision the EKS cluster

Run the provisioning script from the repository root. The script creates the EKS cluster, installs the agent-sandbox controller, and prints the base64-encoded kubeconfig you will need for Render.
bin/eks-up.sh 2>eks-stderr.log | tee kube-config.b64
# K8S_NODE_HOST is printed to stderr:
grep K8S_NODE_HOST eks-stderr.log
Verify the cluster is reachable before continuing:
base64 -d < kube-config.b64 > /tmp/kc
KUBECONFIG=/tmp/kc kubectl get nodes
KUBECONFIG=/tmp/kc kubectl get crd | grep sandboxes.agents.x-k8s.io
You should see at least one Ready node and the sandboxes.agents.x-k8s.io CRD.

Step 2: Push harness images to ECR

Create an ECR repository for the harness image and push it:
aws ecr create-repository \
  --repository-name litellm-agent-platform \
  --region us-east-1

aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin \
    <account-id>.dkr.ecr.us-east-1.amazonaws.com

docker build -t litellm-agent-platform/opencode-sandbox:latest \
  -f harnesses/opencode/Dockerfile .

docker tag litellm-agent-platform/opencode-sandbox:latest \
  <account-id>.dkr.ecr.us-east-1.amazonaws.com/litellm-agent-platform:latest

docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/litellm-agent-platform:latest
Note the full ECR image URI — you will set it as K8S_HARNESS_IMAGE in the next step.

Step 3: Deploy web and worker on Render

Click the Deploy to Render button in the repository’s deploy/render/README.md, or follow the manual steps below. The Render Blueprint in deploy/render/ creates three resources from render.yaml:
ResourceType
Render PostgresManaged PostgreSQL database
litellm-agents-webWeb Service (Next.js)
litellm-agents-workerBackground Worker (reconciler)
MASTER_KEY is auto-generated and DATABASE_URL is wired automatically by Render. After provisioning completes, open the Environment tab for both the web service and the worker and add the following variables:
VariableValue
LITELLM_API_BASEYour LiteLLM gateway URL
LITELLM_API_KEYYour LiteLLM gateway API key
KUBE_CONFIG_B64Contents of kube-config.b64 from Step 1
AWS_ACCESS_KEY_IDAccess key for the IAM principal that ran bin/eks-up.sh
AWS_SECRET_ACCESS_KEYSecret key for the same IAM principal
AWS_REGIONEKS cluster region (e.g. us-east-1)
K8S_NODE_HOSTauto (recommended — discovers the node IP at request time)
K8S_HARNESS_IMAGEFull ECR URI from Step 2
K8S_IMAGE_PULL_POLICYIfNotPresent
ENCRYPTION_KEYBase64-encoded 32 bytes (generate with the command below)
Generate an encryption key:
node -e "console.log(require('crypto').randomBytes(32).toString('base64'))"
Set K8S_IMAGE_PULL_POLICY to IfNotPresent or Always for production. The default value of Never is only correct for local kind clusters that have the image loaded directly — on EKS, Never causes pods to fail with ErrImageNeverPull.
After saving all variables, trigger a manual redeploy of both services from the Render dashboard.

Step 4: Verify the deployment

1

Check the web service

Open the web service URL in a browser. You should see the LAP login screen. Log in with MASTER_KEY.
2

Check the worker logs

In the Render dashboard, open the worker service logs and confirm the line reconciler worker started appears.
3

Create an agent and spawn a session

In the web UI, create an agent and click Spawn session. The first cold start takes 30–60 s because the EKS node pulls the harness image from ECR. Subsequent starts with the warm pool active should complete in under 2 s.
4

Point the lap CLI at production

Configure the CLI to talk to your production deployment:
lap login
# Agent platform URL: https://<your-render-web-url>
# Master key:         <MASTER_KEY>
Then open a sandbox:
lap <agent-name>

Day-to-day operations

Update a secret value

kubectl patch secret litellm-env -n default \
  --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/data/MASTER_KEY\",
         \"value\":\"$(echo -n 'newvalue' | base64)\"}]"

# Restart both deployments to pick up the change:
kubectl rollout restart deployment/litellm-web deployment/litellm-worker -n default

Scale the warm pool

kubectl patch secret litellm-env -n default \
  --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/data/WARM_POOL_SIZE\",
         \"value\":\"$(echo -n '6' | base64)\"}]"
kubectl rollout restart deployment/litellm-worker -n default

Clean up stale sandbox pods

If the reconciler was down and pods accumulated, delete them by label:
kubectl delete sandboxes.agents.x-k8s.io -n default \
  -l litellm-session-id --grace-period=0
kubectl delete sandboxes.agents.x-k8s.io -n default \
  -l litellm-warm-task-id --grace-period=0
kubectl delete services -n default \
  -l litellm-session-id --grace-period=0

Common deployment issues

SymptomCauseFix
Cannot find module '@tailwindcss/postcss'Build command missing --include=devConfirm the Render build command includes npm ci --include=dev
HTTP-Code: 401 on every Kubernetes API callAWS credentials wrong or IAM principal not mapped to a cluster roleVerify AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION are set on both services; re-run bin/eks-up.sh to refresh the aws-auth mapping
aws-iam-authenticator: command not foundBinary not installed during buildConfirm bin/install-aws-iam-authenticator.sh runs in the build command and ./bin is on PATH at start
ImagePullBackOff on sandbox podsCluster nodes cannot pull from ECRCheck IAM permissions for the node role and confirm the ECR repository is in the same region
Stale ready session rows after re-pointing the clusterPrevious sessions point at unreachable sandbox URLsWait 60 s for the reconciler ghost-reaper, or DELETE FROM "Session" WHERE status='ready'; before the first request