Diagnose and fix stuck or failed sessions

When a session gets stuck or fails, the first tool to reach for is the built-in diagnose endpoint. It aggregates the session row, pod state, recent pod logs, and a live harness probe into a single JSON response, so you can identify the root cause without running multiple kubectl and curl commands by hand.

Session status values

The status field on a session row takes one of four values:

Status	Meaning
`creating`	The platform is provisioning the sandbox pod and waiting for the harness to become ready.
`ready`	The harness is reachable and accepting messages.
`failed`	The session failed during bring-up (e.g. creating timeout, image pull error). The `failure_reason` field contains a short description.
`dead`	The session was reaped — either by the idle timeout (24 h of inactivity), by the reconciler detecting a gone pod, or by an explicit delete.

Session phases

While a session is in creating, the phase field tracks where in the bring-up sequence the platform is:

Phase	Owner	Meaning
`creating_sandbox`	Platform	Submitting the Sandbox CR and NodePort Service to Kubernetes.
`pod_pending`	Platform	Kubernetes has accepted the CR; the pod is not yet scheduled or running.
`pod_running`	Platform	The pod is running; the platform is waiting for the harness HTTP endpoint to respond.
`injecting_files`	Platform	Injecting configuration files into the pod before the harness starts.
`waiting_harness`	Platform	Polling the harness HTTP health endpoint.
`harness_ready`	Platform	Harness responded healthy; creating the harness-side session.
`cloning_repo`	Harness	The harness is cloning the agent’s repository inside the container.
`installing_deps`	Harness	The harness is installing dependencies.
`harness_listening`	Harness	The harness is fully initialized and listening for messages.
`ready`	Platform	Session is ready; `status` flips to `ready` at the same time.

If a session is stuck in pod_pending for more than a minute, the node may not have enough capacity to schedule the pod. If it is stuck in waiting_harness, the pod is running but the harness has not bound its HTTP port yet — usually a slow git clone or dependency install.

The diagnose endpoint

For any session that is stuck in creating or behaving unexpectedly, call the diagnose endpoint:

GET /api/v1/managed_agents/sessions/{session_id}/diagnose
Authorization: Bearer $MASTER_KEY

Example:

curl -H "Authorization: Bearer $MASTER_KEY" \
  https://lap.acme.dev/api/v1/managed_agents/sessions/<session_id>/diagnose

The response is a single JSON object containing:

The full session and agent rows from the database
The Sandbox CR and NodePort Service state from Kubernetes
The last 200 lines of pod logs
The node’s Ready condition, CPU/memory capacity, and oversubscription ratio
Warm pool counts for the agent
A direct HTTP probe to the harness via the node’s ExternalIP (bypassing the platform’s internal node host cache)
A detected_issues array with machine-readable codes

Read detected_issues first. If the array is non-empty, each code maps to a specific cause and fix.

`detected_issues` codes

Code	Meaning	Fix
`dead_node_assigned`	The pod is scheduled on a node whose Ready condition is not `True`.	Drain and remove the unhealthy node; the pod will be rescheduled on a healthy node.
`stale_node_host_cache_suspect`	The pod, service, and harness are all healthy, but the session has been in `creating` for more than 120 s. The platform’s in-process node host cache is almost certainly stuck on a terminated node’s IP.	Restart the platform service (web pod) to flush the cache.
`pod_image_pull_backoff`	The pod is in `ImagePullBackOff`, `ErrImagePull`, or `ErrImageNeverPull`.	Check that the image name in `K8S_HARNESS_IMAGE` is correct, that the node can reach the registry, and that `K8S_IMAGE_PULL_POLICY` is not `Never` on a cluster that requires registry pulls.
`pod_not_ready_old`	The pod has been not-Ready for more than 180 s.	Inspect pod events with `kubectl describe pod <name>` to see why the readiness probe is failing.
`harness_unreachable`	The pod is Running but a direct HTTP probe to the harness failed.	Check the pod logs for startup errors. The harness may still be cloning the repository or installing dependencies.
`node_oversubscribed`	The node’s allocated CPU or memory requests exceed 150% of its capacity.	Scale the node group, delete stale session pods, or reduce `WARM_POOL_SIZE`.
`service_missing`	The pod exists but the corresponding `-np` NodePort Service does not.	Delete and recreate the session; the platform creates the Service together with the Sandbox CR.
`warm_pool_empty_for_agent`	This agent has zero warm pool rows and `WARM_POOL_SIZE > 0`.	Wait for the next worker tick to top up the pool, or check the worker logs for provisioning errors.

If detected_issues is empty and the session is still in creating, the bring-up is mid-flight. Check the platform (web or worker) logs for the session ID to see where it is.

When to restart the platform service

If the diagnose endpoint returns stale_node_host_cache_suspect, the platform’s in-process node host cache is pointing at a terminated node’s IP address. This cache lives in the web pod’s memory and is not shared across replicas or persisted to the database. Restarting the web pod flushes the cache:

# On EKS:
kubectl rollout restart deployment/litellm-web -n default

# On Render:
# Trigger a manual restart from the dashboard or use the Render API.

After the restart, the next session create or message will re-resolve the node IP from the Kubernetes API.

Common lap CLI symptoms

The following errors appear in the lap CLI output and map to specific root causes:

Symptom	Likely cause	Fix
`✗ no agent named '…'`	The agent name does not match any agent on the platform you are logged in to.	Run `lap config` to see which platform URL you are targeting, then check the agent list in the web UI or with `lap agents`.
`✗ session create failed: 401`	The master key stored in your local `lap` config is wrong or expired.	Re-run `lap login` and enter the correct `MASTER_KEY`.
`[ws closed]` immediately after attaching	The harness pod was reaped between session creation and attach, or the bearer token is wrong.	Check `lap config`, then restart the session from the CLI or web UI.
`upgrade rejected with 401 from harness`	The `tty_token` does not match what the sandbox pod expects.	Contact your platform administrator to verify the harness authentication token is correctly configured on the platform.

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Diagnose and fix stuck or failed sessions

Session status values

Session phases

The diagnose endpoint

`detected_issues` codes

When to restart the platform service

Common lap CLI symptoms

Get Started

Core Concepts

Guides

Configuration

Troubleshooting

Documentation Index

​Session status values

​Session phases

​The diagnose endpoint

​detected_issues codes

​When to restart the platform service

​Common lap CLI symptoms

Session status values

Session phases

The diagnose endpoint

`detected_issues` codes

When to restart the platform service

Common lap CLI symptoms