feat: remove rpyc#2094
Conversation
Greptile SummaryThis PR removes RPyC from the dimos codebase and replaces it with the existing LCM-based RPC infrastructure already used for inter-module communication. The coordinator and worker-side RPyC servers are deleted; a new
Confidence Score: 5/5The RPyC removal is clean and complete — all call sites, registry fields, worker messages, and server threads are excised. The LCM-based replacement reuses the existing transport layer consistently. The change is a well-scoped excision of one dependency and its replacement with an already-proven transport. The new CoordinatorRPC wrapper is straightforward, Blueprint pickling is tested, and the forward-compatible RunEntry.load() handles upgrade scenarios gracefully. The only new finding is a cosmetic concern in an unrelated file ('true' as an alias in make_connection). All previously flagged issues are carry-overs from earlier threads. The make_connection change in dimos/robot/unitree/go2/connection.py introduces an undocumented 'true' synonym for the mujoco connection type that deserves a second look. Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as dimos CLI
participant MC as ModuleCoordinator
participant CRPC as CoordinatorRPC (LCM)
participant Client as Dimos.connect()
participant RMS as RemoteModuleSource
participant Proxy as RPCClient / _RemoteProxy
CLI->>MC: start_rpc_service()
MC->>CRPC: CoordinatorRPC.serve(coordinator)
CRPC-->>CRPC: _ensure_no_existing_service (0.5s probe)
CRPC-->>MC: CoordinatorRPC instance
Client->>RMS: "RemoteModuleSource(timeout=5.0)"
RMS->>CRPC: "CoordinatorRPC.connect(timeout=5.0)"
CRPC-->>CRPC: call("ping") — liveness check
CRPC-->>RMS: connected
Client->>RMS: list_module_names()
RMS->>CRPC: call("list_modules")
CRPC->>MC: list_modules()
MC-->>CRPC: [ModuleDescriptor, ...]
CRPC-->>RMS: descriptors
Client->>RMS: get_module("StressTestModule")
RMS-->>RMS: importlib.import_module(qualified_path)
alt class importable
RMS-->>Proxy: "RPCClient(None, cls, rpc=coord.rpc)"
else ImportError / AttributeError
RMS-->>Proxy: _RemoteProxy(coord.rpc, name, rpc_names)
end
RMS-->>Client: proxy
Client->>Proxy: proxy.ping()
Proxy->>CRPC: call_sync("StressTestModule/ping", ...)
CRPC-->>Proxy: "pong"
Reviews (3): Last reviewed commit: "feat: remove rpyc" | Re-trigger Greptile |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
afbb2d4 to
0b95215
Compare
| maintains for inter-module calls. Method calls flow over the same LCM | ||
| bus the modules use to talk to each other. | ||
| """ | ||
|
|
||
| is_remote = False |
There was a problem hiding this comment.
_deployed_modules iterated without the coordinator's lock
ModuleCoordinator always acquires _modules_lock before touching _deployed_modules (see list_module_names, load_module, _restart_module, etc.), but get_module here iterates the raw dict directly. Concurrent calls to Dimos.run() or a module restart on any other thread/task while get_module is mid-iteration will raise RuntimeError: dictionary changed size during iteration. Use the existing list_module_names() method (which holds the lock) together with a direct lookup, or add a lock-guarded helper on the coordinator.
| def connect(cls, *, run_id: str | None = None, timeout: float = 5.0) -> Dimos: | ||
| """Connect to an already-running DimOS instance over LCM. | ||
|
|
||
| With no arguments, finds the most recent alive `RunEntry` in the | ||
| registry and connects to its coordinator RPyC endpoint. Use `run_id=` to | ||
| select a specific run, or `host=` + `port=` to bypass the registry. | ||
| registry and connects via LCM to its `Coordinator` @rpc service. Use | ||
| `run_id=` to select a specific run. | ||
|
|
||
| Returns a `Dimos` instance in read/call mode: `skills`, attribute | ||
| access, `__repr__` and `__dir__` work, but `run()` and `restart()` raise | ||
| `NotImplementedError`. `stop()` closes the connection without | ||
| terminating the remote process. | ||
| access, `__repr__` and `__dir__` work, but only methods marked with | ||
| `@rpc` (and `@skill`, which implies `@rpc`) on a module are callable. | ||
| `stop()` closes the connection without terminating the remote process. | ||
| """ | ||
| if host is not None and port is not None: | ||
| source: ModuleSource = RemoteModuleSource(host, port) | ||
| if run_id is not None: | ||
| entries = [e for e in list_runs(alive_only=True) if e.run_id == run_id] | ||
| if not entries: | ||
| raise RuntimeError(f"No running DimOS instance with run_id={run_id!r}") | ||
| else: | ||
| rpyc_port = get_most_recent_rpyc_port(run_id=run_id) | ||
| source = RemoteModuleSource("localhost", rpyc_port) | ||
| if get_most_recent(alive_only=True) is None: | ||
| raise RuntimeError( | ||
| "No running DimOS instance. Start one with `dimos run <blueprint>`." | ||
| ) | ||
|
|
||
| source = RemoteModuleSource(timeout=timeout) | ||
| instance = cls() | ||
| instance._source = source | ||
| return instance | ||
|
|
||
| @classmethod | ||
| def connect_in_process(cls, *, timeout: float = 5.0) -> Dimos: | ||
| """Connect over LCM without consulting the run registry. | ||
|
|
||
| For tests where the coordinator and the client live in the same | ||
| process and there is no `RunEntry` on disk. | ||
| """ | ||
| source = RemoteModuleSource(timeout=timeout) | ||
| instance = cls() |
There was a problem hiding this comment.
run_id= validates existence but no longer routes the connection
The old implementation used get_most_recent_rpyc_port(run_id=run_id) to retrieve the specific TCP port for the requested run, so the socket actually landed on that process. The new code checks the registry entry exists, then creates RemoteModuleSource() which connects to whatever Coordinator happens to answer first on the shared LCM bus — regardless of which run_id was requested. In a registry with two alive entries (e.g. two daemon restarts whose old PIDs are still considered live), passing a run_id gives false confidence: the validation passes but the connection may go to the "wrong" coordinator. The docstring still says "Use run_id= to select a specific run", which is no longer accurate. Either remove the run_id parameter or document that it is now only a liveness guard, not a routing mechanism.
c063cac to
93fc89c
Compare
Problem
Solution
Contributor License Agreement