integrations.nemo_gym.server
integrations.nemo_gym.server
NeMo Gym server lifecycle management.
Handles cloning the NeMo Gym repo, starting resource servers, waiting for readiness, and cleanup on exit.
Functions
| Name | Description |
|---|---|
| ensure_gym_repo | Clone the NeMo Gym repo if it doesn’t exist. |
| ensure_gym_venv | Set up the NeMo Gym Python venv if not present. |
| get_agent_servers | Discover NeMo Gym agent servers from the global config. |
| get_server_base_url | Get the base URL for a given resource server. |
| get_server_configs | Fetch the global config from the NeMo Gym head server. |
| get_verify_endpoint | Get the /verify endpoint URL for a given resource server. |
| start_servers | Start NeMo Gym resource servers via ng_run. |
| wait_for_resource_servers | Wait for all resource servers in the config to become reachable. |
ensure_gym_repo
integrations.nemo_gym.server.ensure_gym_repo(gym_dir, auto_clone=True)Clone the NeMo Gym repo if it doesn’t exist.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| gym_dir | str | Path to the NeMo Gym directory. | required |
| auto_clone | bool | Whether to auto-clone if missing. | True |
Returns
| Name | Type | Description |
|---|---|---|
| str | Resolved path to the NeMo Gym directory. |
ensure_gym_venv
integrations.nemo_gym.server.ensure_gym_venv(gym_dir)Set up the NeMo Gym Python venv if not present.
get_agent_servers
integrations.nemo_gym.server.get_agent_servers(
global_config,
head_host='127.0.0.1',
)Discover NeMo Gym agent servers from the global config.
Agent servers handle multi-turn orchestration via /run endpoint. Returns mapping of agent_name → URL (e.g., {“simple_agent”: “http://host:port”}).
get_server_base_url
integrations.nemo_gym.server.get_server_base_url(global_config, server_name)Get the base URL for a given resource server.
get_server_configs
integrations.nemo_gym.server.get_server_configs(head_port=11000, timeout=30.0)Fetch the global config from the NeMo Gym head server.
Retries up to 3 times with exponential backoff. The default per-attempt timeout is 30s (raised from the original 5s) because head servers can be slow to respond when they’re concurrently serving rollouts from a prior training run. A 5s timeout was empirically too tight to survive a kill-and-relaunch cycle.
Returns
| Name | Type | Description |
|---|---|---|
| dict | Dict mapping server_name -> server config. |
get_verify_endpoint
integrations.nemo_gym.server.get_verify_endpoint(global_config, server_name)Get the /verify endpoint URL for a given resource server.
start_servers
integrations.nemo_gym.server.start_servers(
gym_dir,
config_paths,
head_port=11000,
timeout=360,
)Start NeMo Gym resource servers via ng_run.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| gym_dir | str | Path to the NeMo Gym directory. | required |
| config_paths | list[str] | List of config YAML paths relative to gym_dir. | required |
| head_port | int | Port for the head server. | 11000 |
| timeout | int | Max seconds to wait for servers. | 360 |
wait_for_resource_servers
integrations.nemo_gym.server.wait_for_resource_servers(
global_config,
timeout=180,
)Wait for all resource servers in the config to become reachable.