Documentation Index
Fetch the complete documentation index at: https://docs.litellm-agent-platform.ai/llms.txt
Use this file to discover all available pages before exploring further.
When a session gets stuck or fails, the first tool to reach for is the built-in diagnose endpoint. It aggregates the session row, pod state, recent pod logs, and a live harness probe into a single JSON response, so you can identify the root cause without running multiple kubectl and curl commands by hand.
Session status values
The status field on a session row takes one of four values:
| Status | Meaning |
|---|
creating | The platform is provisioning the sandbox pod and waiting for the harness to become ready. |
ready | The harness is reachable and accepting messages. |
failed | The session failed during bring-up (e.g. creating timeout, image pull error). The failure_reason field contains a short description. |
dead | The session was reaped — either by the idle timeout (24 h of inactivity), by the reconciler detecting a gone pod, or by an explicit delete. |
Session phases
While a session is in creating, the phase field tracks where in the bring-up sequence the platform is:
| Phase | Owner | Meaning |
|---|
creating_sandbox | Platform | Submitting the Sandbox CR and NodePort Service to Kubernetes. |
pod_pending | Platform | Kubernetes has accepted the CR; the pod is not yet scheduled or running. |
pod_running | Platform | The pod is running; the platform is waiting for the harness HTTP endpoint to respond. |
injecting_files | Platform | Injecting configuration files into the pod before the harness starts. |
waiting_harness | Platform | Polling the harness HTTP health endpoint. |
harness_ready | Platform | Harness responded healthy; creating the harness-side session. |
cloning_repo | Harness | The harness is cloning the agent’s repository inside the container. |
installing_deps | Harness | The harness is installing dependencies. |
harness_listening | Harness | The harness is fully initialized and listening for messages. |
ready | Platform | Session is ready; status flips to ready at the same time. |
If a session is stuck in pod_pending for more than a minute, the node may not have enough capacity to schedule the pod. If it is stuck in waiting_harness, the pod is running but the harness has not bound its HTTP port yet — usually a slow git clone or dependency install.
The diagnose endpoint
For any session that is stuck in creating or behaving unexpectedly, call the diagnose endpoint:
GET /api/v1/managed_agents/sessions/{session_id}/diagnose
Authorization: Bearer $MASTER_KEY
Example:
curl -H "Authorization: Bearer $MASTER_KEY" \
https://lap.acme.dev/api/v1/managed_agents/sessions/<session_id>/diagnose
The response is a single JSON object containing:
- The full session and agent rows from the database
- The Sandbox CR and NodePort Service state from Kubernetes
- The last 200 lines of pod logs
- The node’s Ready condition, CPU/memory capacity, and oversubscription ratio
- Warm pool counts for the agent
- A direct HTTP probe to the harness via the node’s ExternalIP (bypassing the platform’s internal node host cache)
- A
detected_issues array with machine-readable codes
Read detected_issues first. If the array is non-empty, each code maps to a specific cause and fix.
detected_issues codes
| Code | Meaning | Fix |
|---|
dead_node_assigned | The pod is scheduled on a node whose Ready condition is not True. | Drain and remove the unhealthy node; the pod will be rescheduled on a healthy node. |
stale_node_host_cache_suspect | The pod, service, and harness are all healthy, but the session has been in creating for more than 120 s. The platform’s in-process node host cache is almost certainly stuck on a terminated node’s IP. | Restart the platform service (web pod) to flush the cache. |
pod_image_pull_backoff | The pod is in ImagePullBackOff, ErrImagePull, or ErrImageNeverPull. | Check that the image name in K8S_HARNESS_IMAGE is correct, that the node can reach the registry, and that K8S_IMAGE_PULL_POLICY is not Never on a cluster that requires registry pulls. |
pod_not_ready_old | The pod has been not-Ready for more than 180 s. | Inspect pod events with kubectl describe pod <name> to see why the readiness probe is failing. |
harness_unreachable | The pod is Running but a direct HTTP probe to the harness failed. | Check the pod logs for startup errors. The harness may still be cloning the repository or installing dependencies. |
node_oversubscribed | The node’s allocated CPU or memory requests exceed 150% of its capacity. | Scale the node group, delete stale session pods, or reduce WARM_POOL_SIZE. |
service_missing | The pod exists but the corresponding -np NodePort Service does not. | Delete and recreate the session; the platform creates the Service together with the Sandbox CR. |
warm_pool_empty_for_agent | This agent has zero warm pool rows and WARM_POOL_SIZE > 0. | Wait for the next worker tick to top up the pool, or check the worker logs for provisioning errors. |
If detected_issues is empty and the session is still in creating, the bring-up is mid-flight. Check the platform (web or worker) logs for the session ID to see where it is.
If the diagnose endpoint returns stale_node_host_cache_suspect, the platform’s in-process node host cache is pointing at a terminated node’s IP address. This cache lives in the web pod’s memory and is not shared across replicas or persisted to the database.
Restarting the web pod flushes the cache:
# On EKS:
kubectl rollout restart deployment/litellm-web -n default
# On Render:
# Trigger a manual restart from the dashboard or use the Render API.
After the restart, the next session create or message will re-resolve the node IP from the Kubernetes API.
Common lap CLI symptoms
The following errors appear in the lap CLI output and map to specific root causes:
| Symptom | Likely cause | Fix |
|---|
✗ no agent named '…' | The agent name does not match any agent on the platform you are logged in to. | Run lap config to see which platform URL you are targeting, then check the agent list in the web UI or with lap agents. |
✗ session create failed: 401 | The master key stored in your local lap config is wrong or expired. | Re-run lap login and enter the correct MASTER_KEY. |
[ws closed] immediately after attaching | The harness pod was reaped between session creation and attach, or the bearer token is wrong. | Check lap config, then restart the session from the CLI or web UI. |
upgrade rejected with 401 from harness | The tty_token does not match what the sandbox pod expects. | Contact your platform administrator to verify the harness authentication token is correctly configured on the platform. |