Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.litellm-agent-platform.ai/llms.txt

Use this file to discover all available pages before exploring further.

When a session gets stuck or fails, the first tool to reach for is the built-in diagnose endpoint. It aggregates the session row, pod state, recent pod logs, and a live harness probe into a single JSON response, so you can identify the root cause without running multiple kubectl and curl commands by hand.

Session status values

The status field on a session row takes one of four values:
StatusMeaning
creatingThe platform is provisioning the sandbox pod and waiting for the harness to become ready.
readyThe harness is reachable and accepting messages.
failedThe session failed during bring-up (e.g. creating timeout, image pull error). The failure_reason field contains a short description.
deadThe session was reaped — either by the idle timeout (24 h of inactivity), by the reconciler detecting a gone pod, or by an explicit delete.

Session phases

While a session is in creating, the phase field tracks where in the bring-up sequence the platform is:
PhaseOwnerMeaning
creating_sandboxPlatformSubmitting the Sandbox CR and NodePort Service to Kubernetes.
pod_pendingPlatformKubernetes has accepted the CR; the pod is not yet scheduled or running.
pod_runningPlatformThe pod is running; the platform is waiting for the harness HTTP endpoint to respond.
injecting_filesPlatformInjecting configuration files into the pod before the harness starts.
waiting_harnessPlatformPolling the harness HTTP health endpoint.
harness_readyPlatformHarness responded healthy; creating the harness-side session.
cloning_repoHarnessThe harness is cloning the agent’s repository inside the container.
installing_depsHarnessThe harness is installing dependencies.
harness_listeningHarnessThe harness is fully initialized and listening for messages.
readyPlatformSession is ready; status flips to ready at the same time.
If a session is stuck in pod_pending for more than a minute, the node may not have enough capacity to schedule the pod. If it is stuck in waiting_harness, the pod is running but the harness has not bound its HTTP port yet — usually a slow git clone or dependency install.

The diagnose endpoint

For any session that is stuck in creating or behaving unexpectedly, call the diagnose endpoint:
GET /api/v1/managed_agents/sessions/{session_id}/diagnose
Authorization: Bearer $MASTER_KEY
Example:
curl -H "Authorization: Bearer $MASTER_KEY" \
  https://lap.acme.dev/api/v1/managed_agents/sessions/<session_id>/diagnose
The response is a single JSON object containing:
  • The full session and agent rows from the database
  • The Sandbox CR and NodePort Service state from Kubernetes
  • The last 200 lines of pod logs
  • The node’s Ready condition, CPU/memory capacity, and oversubscription ratio
  • Warm pool counts for the agent
  • A direct HTTP probe to the harness via the node’s ExternalIP (bypassing the platform’s internal node host cache)
  • A detected_issues array with machine-readable codes
Read detected_issues first. If the array is non-empty, each code maps to a specific cause and fix.

detected_issues codes

CodeMeaningFix
dead_node_assignedThe pod is scheduled on a node whose Ready condition is not True.Drain and remove the unhealthy node; the pod will be rescheduled on a healthy node.
stale_node_host_cache_suspectThe pod, service, and harness are all healthy, but the session has been in creating for more than 120 s. The platform’s in-process node host cache is almost certainly stuck on a terminated node’s IP.Restart the platform service (web pod) to flush the cache.
pod_image_pull_backoffThe pod is in ImagePullBackOff, ErrImagePull, or ErrImageNeverPull.Check that the image name in K8S_HARNESS_IMAGE is correct, that the node can reach the registry, and that K8S_IMAGE_PULL_POLICY is not Never on a cluster that requires registry pulls.
pod_not_ready_oldThe pod has been not-Ready for more than 180 s.Inspect pod events with kubectl describe pod <name> to see why the readiness probe is failing.
harness_unreachableThe pod is Running but a direct HTTP probe to the harness failed.Check the pod logs for startup errors. The harness may still be cloning the repository or installing dependencies.
node_oversubscribedThe node’s allocated CPU or memory requests exceed 150% of its capacity.Scale the node group, delete stale session pods, or reduce WARM_POOL_SIZE.
service_missingThe pod exists but the corresponding -np NodePort Service does not.Delete and recreate the session; the platform creates the Service together with the Sandbox CR.
warm_pool_empty_for_agentThis agent has zero warm pool rows and WARM_POOL_SIZE > 0.Wait for the next worker tick to top up the pool, or check the worker logs for provisioning errors.
If detected_issues is empty and the session is still in creating, the bring-up is mid-flight. Check the platform (web or worker) logs for the session ID to see where it is.

When to restart the platform service

If the diagnose endpoint returns stale_node_host_cache_suspect, the platform’s in-process node host cache is pointing at a terminated node’s IP address. This cache lives in the web pod’s memory and is not shared across replicas or persisted to the database. Restarting the web pod flushes the cache:
# On EKS:
kubectl rollout restart deployment/litellm-web -n default

# On Render:
# Trigger a manual restart from the dashboard or use the Render API.
After the restart, the next session create or message will re-resolve the node IP from the Kubernetes API.

Common lap CLI symptoms

The following errors appear in the lap CLI output and map to specific root causes:
SymptomLikely causeFix
✗ no agent named '…'The agent name does not match any agent on the platform you are logged in to.Run lap config to see which platform URL you are targeting, then check the agent list in the web UI or with lap agents.
✗ session create failed: 401The master key stored in your local lap config is wrong or expired.Re-run lap login and enter the correct MASTER_KEY.
[ws closed] immediately after attachingThe harness pod was reaped between session creation and attach, or the bearer token is wrong.Check lap config, then restart the session from the CLI or web UI.
upgrade rejected with 401 from harnessThe tty_token does not match what the sandbox pod expects.Contact your platform administrator to verify the harness authentication token is correctly configured on the platform.