Skip to content

PYTHON-5903 : Retry attempts should share a single stable operation_id#2901

Open
shrey-mongo wants to merge 1 commit into
mongodb:masterfrom
shrey-mongo:shrey-mongo/PYTHON-5903
Open

PYTHON-5903 : Retry attempts should share a single stable operation_id#2901
shrey-mongo wants to merge 1 commit into
mongodb:masterfrom
shrey-mongo:shrey-mongo/PYTHON-5903

Conversation

@shrey-mongo

@shrey-mongo shrey-mongo commented Jun 29, 2026

Copy link
Copy Markdown

PYTHON-5903: Retry attempts should share a single stable operation_id

Changes in this PR

Command-monitoring events carry an operation_id to correlate the events of one logical operation. For single-document CRUD and cursor operations, each retry attempt got a different operation_id (it defaulted to the per-attempt request_id), so consumers couldn't group a retried operation's events. Bulk writes already shared one across batches.

  • _ClientConnectionRetryable mints the operation_id once and stamps it on the checked-out connection (conn.op_id) on every attempt
  • Connection gains an op_id attribute, reset on check-in.
  • pool.command and server.run_operation source op_id from the connection and pass it to command_runner.run_command / run_cursor_command, which thread it through to the publish_command_* events.

Test Plan

  • New test/asynchronous/test_operation_id_retry.py forces multiple retries via a failCommand failpoint across a read/write matrix and asserts all command events share one operation_id; verifies non-retryable multi-writes don't retry.
  • Ran the new tests plus the existing command-monitoring and retryable reads/writes suites against a single-node replica set which passed.

Checklist

Checklist for Author

  • Did you update the changelog (if necessary)?
  • Is there test coverage?
  • Is any followup work tracked in a JIRA ticket? If so, add link(s).

Checklist for Reviewer

  • Does the title of the PR reference a JIRA Ticket?
  • Do you fully understand the implementation? (Would you be comfortable explaining how this code works to someone else?)
  • Is all relevant documentation (README or docstring) updated?

@shrey-mongo shrey-mongo requested a review from a team as a code owner June 29, 2026 15:26
@shrey-mongo shrey-mongo requested a review from aclark4life June 29, 2026 15:26
@shrey-mongo shrey-mongo force-pushed the shrey-mongo/PYTHON-5903 branch 2 times, most recently from 247b26f to 682dbac Compare June 29, 2026 15:56
@NoahStapp NoahStapp requested review from NoahStapp and Copilot and removed request for aclark4life June 29, 2026 16:44
@NoahStapp

Copy link
Copy Markdown
Contributor

Please fill out the pull request template completely.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ensures command-monitoring operation_id remains stable across retry attempts for retryable reads/writes and relevant cursor operations, improving APM correlation for a single logical operation.

Changes:

  • Mint a single per-operation operation_id in the retryable-operation wrapper and stamp it onto the checked-out connection for each attempt.
  • Thread the connection’s op_id through pool/server command execution into command_runner so published command-monitoring events use it.
  • Add sync + async integration tests that force multiple retries via failCommand and assert all related command events share one operation_id.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tools/synchro.py Registers the new test file in the synchro conversion list.
test/test_operation_id_retry.py Sync integration coverage asserting operation_id stability across retries (and no retry for non-retryable multi-writes).
test/asynchronous/test_operation_id_retry.py Async equivalent of the retry operation_id stability tests.
pymongo/synchronous/server.py Passes conn.op_id into cursor command execution for APM publishing.
pymongo/synchronous/pool.py Adds Connection.op_id, forwards it into run_command, and clears it on check-in.
pymongo/synchronous/mongo_client.py Mints/stabilizes per-operation ids across retries and stamps them on checked-out connections.
pymongo/synchronous/command_runner.py Accepts/threads op_id through run_command / run_cursor_command into _run_command.
pymongo/asynchronous/server.py Async counterpart: passes conn.op_id for cursor command execution.
pymongo/asynchronous/pool.py Async counterpart: adds/forwards/clears op_id on connections.
pymongo/asynchronous/mongo_client.py Async counterpart: stable per-operation id and per-attempt stamping onto connections.
pymongo/asynchronous/command_runner.py Async counterpart: threads op_id through command execution to APM event publishing.

@shrey-mongo

Copy link
Copy Markdown
Author

Updated description to include full template

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

"data": {
"failCommands": [command_name],
"closeConnection": True,
"appName": _APP_NAME,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class needs to add the @async_client_context.require_failCommand_appName annotation to safely use appName.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread pymongo/asynchronous/command_runner.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated to use the new operation_id, if present.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@shrey-mongo shrey-mongo requested a review from NoahStapp June 30, 2026 15:04
@NoahStapp

Copy link
Copy Markdown
Contributor

There are merge conflicts, please resolve.

@shrey-mongo shrey-mongo force-pushed the shrey-mongo/PYTHON-5903 branch from 5546477 to 75f4e21 Compare June 30, 2026 15:46
@shrey-mongo

Copy link
Copy Markdown
Author

I fixed the merge conflict and the typing error that was causing the workflow to fail

func=func,
)

def require_failCommand_appName(self, func):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was needed to fix an untyped-decorator mypy error that was occurring in my test. This matches the signature of require_failCommand_fail_point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need both the require_failCommand_fail_point and require_failCommand_appName annotations, the latter is a strict subset of the former.

# For gossiping $clusterTime from the connection handshake to the client.
self._cluster_time = None
# Stable operation id for the operation currently using this connection.
self.op_id: Optional[int] = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to track op_id on connections. It introduces more risk of accidental inclusion or staleness on operations like auth compared to explicitly passing it through each operation. What about passing op_id as a kwarg on Connection.command() and on _Query/_GetMore where operations are actually initiated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants