PYTHON-5903 : Retry attempts should share a single stable operation_id#2901
PYTHON-5903 : Retry attempts should share a single stable operation_id#2901shrey-mongo wants to merge 1 commit into
Conversation
247b26f to
682dbac
Compare
|
Please fill out the pull request template completely. |
There was a problem hiding this comment.
Pull request overview
This PR ensures command-monitoring operation_id remains stable across retry attempts for retryable reads/writes and relevant cursor operations, improving APM correlation for a single logical operation.
Changes:
- Mint a single per-operation
operation_idin the retryable-operation wrapper and stamp it onto the checked-out connection for each attempt. - Thread the connection’s
op_idthrough pool/server command execution intocommand_runnerso published command-monitoring events use it. - Add sync + async integration tests that force multiple retries via
failCommandand assert all related command events share oneoperation_id.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
tools/synchro.py |
Registers the new test file in the synchro conversion list. |
test/test_operation_id_retry.py |
Sync integration coverage asserting operation_id stability across retries (and no retry for non-retryable multi-writes). |
test/asynchronous/test_operation_id_retry.py |
Async equivalent of the retry operation_id stability tests. |
pymongo/synchronous/server.py |
Passes conn.op_id into cursor command execution for APM publishing. |
pymongo/synchronous/pool.py |
Adds Connection.op_id, forwards it into run_command, and clears it on check-in. |
pymongo/synchronous/mongo_client.py |
Mints/stabilizes per-operation ids across retries and stamps them on checked-out connections. |
pymongo/synchronous/command_runner.py |
Accepts/threads op_id through run_command / run_cursor_command into _run_command. |
pymongo/asynchronous/server.py |
Async counterpart: passes conn.op_id for cursor command execution. |
pymongo/asynchronous/pool.py |
Async counterpart: adds/forwards/clears op_id on connections. |
pymongo/asynchronous/mongo_client.py |
Async counterpart: stable per-operation id and per-attempt stamping onto connections. |
pymongo/asynchronous/command_runner.py |
Async counterpart: threads op_id through command execution to APM event publishing. |
|
Updated description to include full template |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| "data": { | ||
| "failCommands": [command_name], | ||
| "closeConnection": True, | ||
| "appName": _APP_NAME, |
There was a problem hiding this comment.
This class needs to add the @async_client_context.require_failCommand_appName annotation to safely use appName.
There was a problem hiding this comment.
This needs to be updated to use the new operation_id, if present.
|
There are merge conflicts, please resolve. |
5546477 to
75f4e21
Compare
|
I fixed the merge conflict and the typing error that was causing the workflow to fail |
| func=func, | ||
| ) | ||
|
|
||
| def require_failCommand_appName(self, func): |
There was a problem hiding this comment.
What's the purpose of this change?
There was a problem hiding this comment.
It was needed to fix an untyped-decorator mypy error that was occurring in my test. This matches the signature of require_failCommand_fail_point.
There was a problem hiding this comment.
You don't need both the require_failCommand_fail_point and require_failCommand_appName annotations, the latter is a strict subset of the former.
| # For gossiping $clusterTime from the connection handshake to the client. | ||
| self._cluster_time = None | ||
| # Stable operation id for the operation currently using this connection. | ||
| self.op_id: Optional[int] = None |
There was a problem hiding this comment.
I don't think we want to track op_id on connections. It introduces more risk of accidental inclusion or staleness on operations like auth compared to explicitly passing it through each operation. What about passing op_id as a kwarg on Connection.command() and on _Query/_GetMore where operations are actually initiated?
PYTHON-5903: Retry attempts should share a single stable operation_id
Changes in this PR
Command-monitoring events carry an
operation_idto correlate the events of one logical operation. For single-document CRUD and cursor operations, each retry attempt got a differentoperation_id(it defaulted to the per-attemptrequest_id), so consumers couldn't group a retried operation's events. Bulk writes already shared one across batches._ClientConnectionRetryablemints theoperation_idonce and stamps it on the checked-out connection (conn.op_id) on every attemptConnectiongains anop_idattribute, reset on check-in.pool.commandandserver.run_operationsourceop_idfrom the connection and pass it tocommand_runner.run_command/run_cursor_command, which thread it through to thepublish_command_*events.Test Plan
test/asynchronous/test_operation_id_retry.pyforces multiple retries via afailCommandfailpoint across a read/write matrix and asserts all command events share oneoperation_id; verifies non-retryable multi-writes don't retry.Checklist
Checklist for Author
Checklist for Reviewer