scripts.process_cleanup
scripts.process_cleanup
Reusable process lifecycle management for vLLM serve scripts.
Handles graceful shutdown, orphan cleanup, and health monitoring for multiprocessing-based server architectures where a main process dispatches work to worker subprocesses that spawn GPU-heavy children (e.g., vLLM EngineCore).
Usage:
from axolotl.scripts.process_cleanup import ProcessManager
manager = ProcessManager(processes, connections)
manager.register_signal_handlers()
# In FastAPI lifespan:
async with manager.lifespan_context():
yield # server runs here
# In endpoints:
manager.check_workers_alive() # raises if dead
# In worker command loop:
if manager.is_fatal_error(exc):
break # exit worker
Classes
| Name | Description |
|---|---|
| ProcessManager | Manages worker process lifecycle for a FastAPI-based serve script. |
ProcessManager
scripts.process_cleanup.ProcessManager(
processes,
connections,
orphan_patterns=None,
monitor_interval=5.0,
shutdown_timeout=30.0,
kill_timeout=15.0,
)Manages worker process lifecycle for a FastAPI-based serve script.
Handles: - Signal-based shutdown (SIGTERM) - Background health monitoring (detects dead workers) - Process tree cleanup on exit - Orphan EngineCore cleanup
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| processes | list[Process] | List of worker Process objects. | required |
| connections | list[Connection] | List of parent-side Pipe connections to workers. | required |
| orphan_patterns | list[str] | None | Process name patterns to search for orphans on cleanup. Defaults to ["VLLM::EngineCore"]. |
None |
| monitor_interval | float | Seconds between worker health checks. | 5.0 |
| shutdown_timeout | float | Seconds to wait for graceful worker exit before SIGTERM. | 30.0 |
| kill_timeout | float | Seconds to wait after SIGTERM before SIGKILL. | 15.0 |
Methods
| Name | Description |
|---|---|
| check_workers_alive | Raise RuntimeError if any worker process has died. |
| get_health_status | Return health status dict. Use as the /health endpoint response. |
| monitor_workers | Background coroutine that detects dead workers and exits. |
| register_cleanup | Register atexit cleanup for orphan processes. |
check_workers_alive
scripts.process_cleanup.ProcessManager.check_workers_alive()Raise RuntimeError if any worker process has died.
Call this at the start of request handlers to fail fast instead of hanging on a broken pipe.
get_health_status
scripts.process_cleanup.ProcessManager.get_health_status()Return health status dict. Use as the /health endpoint response.
monitor_workers
scripts.process_cleanup.ProcessManager.monitor_workers()Background coroutine that detects dead workers and exits.
When all workers are dead, cleans up their process trees and orphan subprocesses, then force-exits the server.
register_cleanup
scripts.process_cleanup.ProcessManager.register_cleanup()Register atexit cleanup for orphan processes.
Does NOT override SIGTERM — let uvicorn handle it naturally,
which triggers the lifespan shutdown where _shutdown_workers
runs. The atexit handler is a safety net for abnormal exits.
Functions
| Name | Description |
|---|---|
| cleanup_orphan_processes | Kill orphan processes matching any of the given patterns. |
| is_fatal_worker_error | Check if an exception indicates the worker should exit. |
| kill_process_tree | Kill a process and all its descendants (depth-first). |
| safe_recv | Receive from a pipe, returning an error dict if the pipe is broken. |
cleanup_orphan_processes
scripts.process_cleanup.cleanup_orphan_processes(*patterns)Kill orphan processes matching any of the given patterns.
Uses pgrep -f to find processes. Skips the current process.
Intended for cleaning up GPU-holding subprocesses (EngineCore)
that survive their parent’s death.
is_fatal_worker_error
scripts.process_cleanup.is_fatal_worker_error(exc)Check if an exception indicates the worker should exit.
Returns True for errors from which the worker cannot recover, such as the vLLM EngineCore dying.
kill_process_tree
scripts.process_cleanup.kill_process_tree(pid)Kill a process and all its descendants (depth-first).
safe_recv
scripts.process_cleanup.safe_recv(conn)Receive from a pipe, returning an error dict if the pipe is broken.