fix: make V1Channel re-subscribable after a failed subscribe#845
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes a regression where transient failures during device/channel setup could leave subscriptions partially active, preventing subsequent reconnects (e.g., “Only one subscription allowed at a time”).
Changes:
- Make
V1Channel.subscribe()teardown atomic and reusable via a shared_teardown()path. - Ensure
RoborockDevice.connect()always unsubscribes ifstart()fails (including non-RoborockExceptionerrors). - Add regression tests covering resubscribe/retry and atomic failure cleanup.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
roborock/devices/rpc/v1_channel.py |
Makes subscription setup atomic and introduces _teardown() to reliably reset internal state on error/unsubscribe. |
roborock/devices/device.py |
Broadens cleanup on start() failure to always unsubscribe before re-raising. |
tests/devices/test_v1_device.py |
Adds regression tests ensuring connect() unsubscribes on any start() failure and can retry cleanly. |
tests/devices/rpc/test_v1_channel.py |
Adds tests for subscribe→unsub→subscribe and for atomic cleanup after subscribe failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Can you take a look at Copilots first round of reviews @pike00. I can then take a look afterwards, just mark the PR as 'ready'. Thank you! |
|
Thank you for the fixes here.
Makes sense to me. Unsetting
How would that happen? I would think we wouldn't retry this. Notice we only deal with known RoborockException cases and not other uncaught exceptions which are considered failure. co-pilot made a comment about this catching BaseException and I agree we shouldn't catch such a broad exception here. if we're missing a new unexpeceted exception type we want to fix the bug instead. |
…nonce Address review feedback on Python-roborock#845: - Narrow the cleanup guards in RoborockDevice.connect() and V1Channel.subscribe() from BaseException to Exception so KeyboardInterrupt/SystemExit/CancelledError propagate untouched while still releasing the channel on normal failures (incl. the ValueError and RoborockException retry paths the fix targets). - Use an actually-16-byte test nonce to match its name and avoid relying on absent length validation.
|
On the partway leak — it's reachable through the existing retry path, not a new one. In On where Agreed on dropping |
allenporter
left a comment
There was a problem hiding this comment.
I had some pending review comments i forgot ot send.
Rewrite the V1Channel/RoborockDevice regression tests to assert observable behavior (the callback fires on message arrival; re-subscribe and re-connect succeed) instead of private attributes (_callback, _reconnect_task, _mqtt_unsub, _unsub), per review feedback. Add type: ignore for the test-only mock assignments and FakeChannel construction that were failing the mypy pre-commit hook.
allenporter
left a comment
There was a problem hiding this comment.
Looks great, thank you for this! Ready to merge but one last comment
| self._logger.debug("MQTT connection also failed: %s", err) | ||
| raise | ||
| self._logger.debug("MQTT subscription failed, continuing with local-only connection: %s", err) | ||
| except Exception: |
There was a problem hiding this comment.
This is a style nit but: I'd like to make this clear if its expected or unexpected exceptions here by explicitly catching and forwarding both RoborockException and Exception. w/ teardown code. Essentially i'd say it like: We do expect RoborockException but this is so critical to get right to avoid leaks that we also catch Exception. I'd also like to log that it was un expected uncaught exception here. Same in Device.
That is, i'd not like it to be normal to catch Exception in this code base. We can violate the rules here given this case is important to not leak, but it's a "shouldn't happen".
There was a problem hiding this comment.
Done in 5f25def. Both V1Channel.subscribe() and RoborockDevice.connect() now catch RoborockException (expected) and Exception (unexpected) as separate branches; each runs the teardown/unsub and re-raises, so behavior is unchanged.
The except Exception branch logs at exception level ("Unexpected error during subscribe/connect; ... to avoid a leak") and carries a comment making clear it is a deliberate, signposted exception to the usual "don't catch Exception" rule, kept only because a leaked subscription or reconnect task would brick the device. The expected RoborockException path stays quiet (the transient cause is already debug-logged at the point it happens).
…connect cleanup Split the leak-safety catch-all in V1Channel.subscribe() and RoborockDevice.connect() into an explicit `except RoborockException` (the expected failure path) and an `except Exception` that logs the unexpected, shouldn't-happen error before tearing down. Both branches still run teardown/unsub and re-raise, so behavior is unchanged; the broad catch is now a deliberate, signposted exception to the usual don't-catch-Exception rule, and unexpected errors become observable in logs.
Summary
A device whose first connect attempt hits a transient error is bricked for the rest of the session: it never gets entities and the log shows the misleading
ValueError: Only one subscription allowed at a time. This was observed with two V1 vacuums on one account — the second vacuum (sharing the MQTT session) reliably failed while the first connected fine.Root cause
The connect path is
RoborockDevice.start_connect()→connect_loop()→connect()→V1Channel.subscribe().V1Channel.subscribe()setself._callback = callbacklast, and the returnedunsub()closure tore down the reconnect task and MQTT/local subscriptions but never clearedself._callback.connect(), ifv1_properties.start()raised a transientRoborockException(e.g. the initial status fetch timing out for a second device contending for the MQTT session), theexcept RoborockExceptionblock calledunsub()and re-raised — leavingself._callbackstill set.connect_loop()caught theRoborockExceptionand retriedconnect(). The retry'ssubscribe()sawself._callback is not Noneand raisedValueError("Only one subscription allowed at a time"). ThatValueErroris not aRoborockException, so it escaped toconnect_loop()'s outerexcept Exception, which logsUncaught error during connectand returns — permanently giving up. No ready callbacks fire, so no entities are created.A secondary leak: if
subscribe()raised partway through (after starting_reconnect_task/ setting_mqtt_unsub), it returned nounsub, so the reconnect task and subscriptions leaked on every retry.Fix
V1Channel.subscribe()now claimsself._callbackup front and wraps the local-connect / reconnect-task / MQTT-subscribe setup intry/except BaseException: self._teardown(); raise._teardown()(also returned as the unsub function) that cancels the reconnect task, drops the MQTT/local subscriptions, closes the local channel, and clearsself._callback— leaving the channel cleanly re-subscribable.RoborockDevice.connect()'s cleanup is broadened fromexcept RoborockExceptiontoexcept BaseExceptionso anystart()failure (not just Roborock errors) releases the channel before propagating.Net effect: a retry after a transient first-attempt failure now re-subscribes cleanly instead of tripping the stale-callback guard. A genuinely persistent failure becomes a normal backoff-retry loop surfacing the real error, rather than a hard brick behind a misleading
ValueError.Tests
Added regression tests (all fail against the pre-fix code):
tests/devices/rpc/test_v1_channel.pysubscribe → unsub → subscribeno longer raises._reconnect_task is Noneand the channel re-callable (no leaked task/subscription).tests/devices/test_v1_device.pyconnect()unsubscribes whenstart()fails (parametrized:RoborockExceptionand a non-Roborock error).start()raises a transientRoborockExceptiononce then succeeds →connect_loop()reconnects and the device becomes ready, with noValueError.uv run pytest— full suite passes;ruff checkandmypyclean.