integrations.nemo_gym.server

integrations.nemo_gym.server

NeMo Gym server lifecycle management.

Handles cloning the NeMo Gym repo, starting resource servers, waiting for readiness, and cleanup on exit.

Functions

Name Description
ensure_gym_repo Clone the NeMo Gym repo if it doesn’t exist.
ensure_gym_venv Set up the NeMo Gym Python venv if not present.
get_agent_servers Discover NeMo Gym agent servers from the global config.
get_server_base_url Get the base URL for a given resource server.
get_server_configs Fetch the global config from the NeMo Gym head server.
get_verify_endpoint Get the /verify endpoint URL for a given resource server.
start_servers Start NeMo Gym resource servers via ng_run.
wait_for_resource_servers Wait for all resource servers in the config to become reachable.

ensure_gym_repo

integrations.nemo_gym.server.ensure_gym_repo(gym_dir, auto_clone=True)

Clone the NeMo Gym repo if it doesn’t exist.

Parameters

Name Type Description Default
gym_dir str Path to the NeMo Gym directory. required
auto_clone bool Whether to auto-clone if missing. True

Returns

Name Type Description
str Resolved path to the NeMo Gym directory.

ensure_gym_venv

integrations.nemo_gym.server.ensure_gym_venv(gym_dir)

Set up the NeMo Gym Python venv if not present.

get_agent_servers

integrations.nemo_gym.server.get_agent_servers(
    global_config,
    head_host='127.0.0.1',
)

Discover NeMo Gym agent servers from the global config.

Agent servers handle multi-turn orchestration via /run endpoint. Returns mapping of agent_name → URL (e.g., {“simple_agent”: “http://host:port”}).

get_server_base_url

integrations.nemo_gym.server.get_server_base_url(global_config, server_name)

Get the base URL for a given resource server.

get_server_configs

integrations.nemo_gym.server.get_server_configs(head_port=11000, timeout=30.0)

Fetch the global config from the NeMo Gym head server.

Retries up to 3 times with exponential backoff. The default per-attempt timeout is 30s (raised from the original 5s) because head servers can be slow to respond when they’re concurrently serving rollouts from a prior training run. A 5s timeout was empirically too tight to survive a kill-and-relaunch cycle.

Returns

Name Type Description
dict Dict mapping server_name -> server config.

get_verify_endpoint

integrations.nemo_gym.server.get_verify_endpoint(global_config, server_name)

Get the /verify endpoint URL for a given resource server.

start_servers

integrations.nemo_gym.server.start_servers(
    gym_dir,
    config_paths,
    head_port=11000,
    timeout=360,
)

Start NeMo Gym resource servers via ng_run.

Parameters

Name Type Description Default
gym_dir str Path to the NeMo Gym directory. required
config_paths list[str] List of config YAML paths relative to gym_dir. required
head_port int Port for the head server. 11000
timeout int Max seconds to wait for servers. 360

wait_for_resource_servers

integrations.nemo_gym.server.wait_for_resource_servers(
    global_config,
    timeout=180,
)

Wait for all resource servers in the config to become reachable.