SSH Manager (ssh_manager.py)
File: head/ssh_manager.py
Manages SSH connections, port-forwarding tunnels, remote daemon deployment, and skills synchronization. This is the bridge between the local Head Node and remote machines.
Purpose
- Maintain a pool of SSH connections and tunnels to remote machines
- Create local port-forwarding tunnels to access remote daemons
- Deploy daemon code to remote machines via SCP
- Start and health-check daemons on remote machines
- Sync skills files to remote project directories
- List machines with their online/daemon status
Classes
SSHTunnel
Represents an active SSH tunnel to a remote machine.
class SSHTunnel:
machine_id: str # Machine this tunnel connects to
local_port: int # Local port (e.g., 19100)
conn: SSHClientConnection # asyncssh connection
listener: SSHListener # Port forwarding listener
Properties:
alive-- ReturnsTrueif the underlying SSH connection is still open.
Methods:
close()-- Closes the port forwarding listener and SSH connection.
SSHManager
Main class managing all SSH operations.
class SSHManager:
config: Config
machines: dict[str, MachineConfig]
tunnels: dict[str, SSHTunnel] # machine_id -> active tunnel
Key Methods
ensure_tunnel(machine_id: str) -> int
Ensures an SSH tunnel exists to the specified machine. Returns the local port number for accessing the daemon.
Flow:
- Check if a tunnel already exists and is alive -- return existing local port
- If the tunnel is dead, close and remove it
- Allocate a new local port (starting from 19100, incrementing)
- Establish SSH connection via
_connect_ssh() - Create local port forwarding:
127.0.0.1:<local_port>->127.0.0.1:<daemon_port> - Ensure the daemon is running on the remote machine via
_ensure_daemon() - Store the tunnel and return the local port
_connect_ssh(machine: MachineConfig) -> SSHClientConnection
Establishes an SSH connection to a machine. Handles:
- SSH key authentication: Uses
client_keysifssh_keyis configured - Password authentication: Supports direct passwords and
file:/pathsyntax - ProxyJump: Connects through a jump host by first establishing a connection to the jump machine, then using it as a
tunnelfor the final connection - Known hosts: Disabled (
known_hosts=None) for simplicity in trusted environments
_ensure_daemon(machine_id: str, conn: SSHClientConnection) -> None
Ensures the daemon process is running on the remote machine.
Flow:
- Check if a
node.*dist/server.jsprocess is already running viapgrep - If running, return immediately
- Check if daemon code exists at
install_dir(bothdist/server.jsandnode_modules/) - If missing and
auto_deployis enabled, call_deploy_daemon() - Start the daemon with
nohup, setting:DAEMON_PORTenvironment variablePATHincluding the Node.js binary directory and~/.local/bin(for Claude CLI)
- Poll the health endpoint (
health.checkRPC) every 2 seconds for up to 30 seconds - Raise
RuntimeErrorif the daemon does not respond within the timeout
_deploy_daemon(machine_id: str, conn: SSHClientConnection) -> None
Deploys daemon code to a remote machine via SCP.
Flow:
- Build the daemon locally if
daemon/dist/does not exist (npm run build) - Create the remote install directory
- SCP
package.jsonandpackage-lock.jsonto the remote - SCP the entire
dist/directory recursively - Run
npm install --productionon the remote machine - If npm is in a non-standard location, derive its path from
node_path
sync_skills(machine_id: str, remote_path: str) -> None
Syncs skills files from the local skills.shared_dir to a remote project path.
Behavior:
- Skips entirely if
skills.sync_on_startisfalse - Copies
CLAUDE.mdto the remote project root, but only if it does not already exist there - Copies the
.claude/skills/directory recursively to the remote project - Uses existing SSH tunnel connection if available, otherwise creates a new connection
- Errors are logged as warnings and do not fail the session creation
list_machines() -> list[dict]
Lists all configured machines with their online and daemon status.
Behavior:
- Skips machines that are only used as jump hosts (referenced by
proxy_jumpand having nodefault_paths) - For each machine, attempts an SSH connection with a 15-second timeout
- If reachable, checks if the daemon process is running via
pgrep - Returns a list of dicts with:
id,host,user,status(online/offline),daemon(running/stopped/unknown),default_paths
get_local_port(machine_id: str) -> Optional[int]
Returns the local tunnel port for a machine if a live tunnel exists, otherwise None.
close_all() -> None
Closes all SSH tunnels and connections. Called during graceful shutdown.
Port Allocation
Local ports for SSH tunnels are allocated sequentially starting from 19100:
gpu-1 -> localhost:19100
gpu-2 -> localhost:19101
gpu-3 -> localhost:19102
...
This simple allocation works because the Head Node manages all tunnels in a single process.
Connection to Other Modules
- main.py creates the SSHManager with the full config and calls
close_all()on shutdown - BotBase calls
ensure_tunnel()before every daemon RPC call andsync_skills()on/start - BotBase calls
list_machines()for the/ls machinecommand - BotBase calls
get_local_port()for the/healthcommand when checking all connected machines