scripts.process_cleanup

scripts.process_cleanup

Reusable process lifecycle management for vLLM serve scripts.

Handles graceful shutdown, orphan cleanup, and health monitoring for multiprocessing-based server architectures where a main process dispatches work to worker subprocesses that spawn GPU-heavy children (e.g., vLLM EngineCore).

Usage:

from axolotl.scripts.process_cleanup import ProcessManager

manager = ProcessManager(processes, connections)
manager.register_signal_handlers()

# In FastAPI lifespan:
async with manager.lifespan_context():
    yield  # server runs here

# In endpoints:
manager.check_workers_alive()  # raises if dead

# In worker command loop:
if manager.is_fatal_error(exc):
    break  # exit worker

Classes

Name Description
ProcessManager Manages worker process lifecycle for a FastAPI-based serve script.

ProcessManager

scripts.process_cleanup.ProcessManager(
    processes,
    connections,
    orphan_patterns=None,
    monitor_interval=5.0,
    shutdown_timeout=30.0,
    kill_timeout=15.0,
)

Manages worker process lifecycle for a FastAPI-based serve script.

Handles: - Signal-based shutdown (SIGTERM) - Background health monitoring (detects dead workers) - Process tree cleanup on exit - Orphan EngineCore cleanup

Parameters

Name Type Description Default
processes list[Process] List of worker Process objects. required
connections list[Connection] List of parent-side Pipe connections to workers. required
orphan_patterns list[str] | None Process name patterns to search for orphans on cleanup. Defaults to ["VLLM::EngineCore"]. None
monitor_interval float Seconds between worker health checks. 5.0
shutdown_timeout float Seconds to wait for graceful worker exit before SIGTERM. 30.0
kill_timeout float Seconds to wait after SIGTERM before SIGKILL. 15.0

Methods

Name Description
check_workers_alive Raise RuntimeError if any worker process has died.
get_health_status Return health status dict. Use as the /health endpoint response.
monitor_workers Background coroutine that detects dead workers and exits.
register_cleanup Register atexit cleanup for orphan processes.
check_workers_alive
scripts.process_cleanup.ProcessManager.check_workers_alive()

Raise RuntimeError if any worker process has died.

Call this at the start of request handlers to fail fast instead of hanging on a broken pipe.

get_health_status
scripts.process_cleanup.ProcessManager.get_health_status()

Return health status dict. Use as the /health endpoint response.

monitor_workers
scripts.process_cleanup.ProcessManager.monitor_workers()

Background coroutine that detects dead workers and exits.

When all workers are dead, cleans up their process trees and orphan subprocesses, then force-exits the server.

register_cleanup
scripts.process_cleanup.ProcessManager.register_cleanup()

Register atexit cleanup for orphan processes.

Does NOT override SIGTERM — let uvicorn handle it naturally, which triggers the lifespan shutdown where _shutdown_workers runs. The atexit handler is a safety net for abnormal exits.

Functions

Name Description
cleanup_orphan_processes Kill orphan processes matching any of the given patterns.
is_fatal_worker_error Check if an exception indicates the worker should exit.
kill_process_tree Kill a process and all its descendants (depth-first).
safe_recv Receive from a pipe, returning an error dict if the pipe is broken.

cleanup_orphan_processes

scripts.process_cleanup.cleanup_orphan_processes(*patterns)

Kill orphan processes matching any of the given patterns.

Uses pgrep -f to find processes. Skips the current process. Intended for cleaning up GPU-holding subprocesses (EngineCore) that survive their parent’s death.

is_fatal_worker_error

scripts.process_cleanup.is_fatal_worker_error(exc)

Check if an exception indicates the worker should exit.

Returns True for errors from which the worker cannot recover, such as the vLLM EngineCore dying.

kill_process_tree

scripts.process_cleanup.kill_process_tree(pid)

Kill a process and all its descendants (depth-first).

safe_recv

scripts.process_cleanup.safe_recv(conn)

Receive from a pipe, returning an error dict if the pipe is broken.