Skip to content

feat(server-ng): commit-time metadata validation with result codes#3553

Open
krishvishal wants to merge 4 commits into
apache:masterfrom
krishvishal:workload-prereqs
Open

feat(server-ng): commit-time metadata validation with result codes#3553
krishvishal wants to merge 4 commits into
apache:masterfrom
krishvishal:workload-prereqs

Conversation

@krishvishal

Copy link
Copy Markdown
Contributor

Why

server-ng previously validated metadata business rules during preflight: primary-only, before replication. Invalid requests were then silently dropped.

This caused two issues:

1. VSR correctness. A rejection must be a deterministic function of the committed log. It should be computed during apply on every replica and recorded so that all replicas agree. Preflight validation uses a primary-local snapshot, and its verdict is never replicated. After a view change, the new primary is not guaranteed to reach the same decision.

2. Infinite retry. The in-process path mapped the preflight Err to Canceled, leaving the home shard silent. As a result, the SDK kept retrying permanently invalid requests forever, such as CreatePartitions on a missing topic.

What Changed

Result taxonomy. Added core/metadata/src/stm/result.rs. There is now one closed #[repr(u32)] enum per operation: Ok = 0, with other discriminants reused from the existing IggyError code space. This keeps SDK error mapping compatible with existing codes.

Apply now returns committed results. StateHandler::apply now returns ApplyReply { code, body } instead of Bytes. Every previous silent Bytes::new() no-op is now represented as a committed result code. MuxStateMachine::update now reserves Err for decode/corruption failures only.

Result is encoded in the reply body. The result rides in the reply body. The reply body now starts with a sparse result section, [count][index, result]*, followed by the payload. This is written in place via build_reply_message_with, using one allocation and one payload copy. There is no ReplyHeader layout change.

Eviction instead of silent drop. Structurally invalid requests are now evicted rather than silently dropped: not-client-allowed maps to InvalidRequestOperation, while undecodable/overflow maps to InvalidRequestBody.

Removed preflight partition-count read. Deleted the current_partition_count preflight read. Parent existence is now checked at commit time and returned as CreatePartitionsResult::{Stream, Topic}NotFound, removing the TOCTOU window.

Consensus frame size floor. Every consensus frame’s size must now span its header. This is asserted during consensus_message construction, PrepareOk projection, and in the simulator. This prevents reply-body slicing from underflowing.

Simulator changes. The simulator now uses outcome-first generation: operations target a chosen outcome, including error outcomes; on_reply decodes the committed code from the body; and the decoded code is asserted to be a declared result code. The strict targeted-equals-committed oracle is gated to serial runs only: client_count == 1 && CLIENT_REQUEST_QUEUE_MAX == 1.

Wire / SDK Impact

ReplyHeader is unchanged. This is not a #[repr(C)] layout change. However, the reply body format changes. A result section now precedes the payload, so body decoding changes for success replies as well. There is no released server-ng wire format to break.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.02715% with 318 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.48%. Comparing base (d5461a1) to head (21ac524).

Files with missing lines Patch % Lines
core/metadata/src/stm/stream.rs 70.83% 42 Missing ⚠️
...re/simulator/src/workload/ops/create_partitions.rs 0.00% 20 Missing and 1 partial ⚠️
...re/simulator/src/workload/ops/delete_partitions.rs 0.00% 20 Missing ⚠️
core/simulator/src/workload/ops/create_topic.rs 0.00% 19 Missing ⚠️
core/simulator/src/workload/ops/update_user.rs 0.00% 19 Missing ⚠️
...imulator/src/workload/ops/delete_consumer_group.rs 0.00% 17 Missing ⚠️
core/metadata/src/stm/user.rs 79.48% 16 Missing ⚠️
core/simulator/src/workload/ops/update_topic.rs 0.00% 16 Missing ⚠️
core/metadata/src/stm/result.rs 92.06% 15 Missing ⚠️
core/simulator/src/workload/ops/create_user.rs 0.00% 13 Missing ⚠️
... and 19 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3553      +/-   ##
============================================
- Coverage     74.44%   72.48%   -1.97%     
  Complexity      937      937              
============================================
  Files          1243     1244       +1     
  Lines        125987   121816    -4171     
  Branches     101856    97730    -4126     
============================================
- Hits          93795    88293    -5502     
- Misses        29180    30250    +1070     
- Partials       3012     3273     +261     
Components Coverage Δ
Rust Core 72.80% <64.02%> (-2.40%) ⬇️
Java SDK 62.44% <ø> (ø)
C# SDK 71.41% <ø> (-0.70%) ⬇️
Python SDK 88.88% <ø> (ø)
PHP SDK 84.29% <ø> (ø)
Node SDK 91.22% <ø> (-0.13%) ⬇️
Go SDK 40.14% <ø> (ø)
Files with missing lines Coverage Δ
core/common/src/error/iggy_error.rs 100.00% <ø> (ø)
core/consensus/src/metadata_helpers.rs 91.51% <100.00%> (ø)
core/consensus/src/plane_helpers.rs 92.53% <100.00%> (+0.14%) ⬆️
core/metadata/src/stm/mod.rs 75.44% <ø> (ø)
core/metadata/src/stm/mux.rs 94.64% <100.00%> (ø)
core/shard/src/lib.rs 73.97% <100.00%> (ø)
core/simulator/src/lib.rs 93.04% <100.00%> (+0.54%) ⬆️
core/simulator/src/packet.rs 80.21% <100.00%> (+0.25%) ⬆️
core/simulator/src/workload/auditor.rs 77.19% <100.00%> (+5.76%) ⬆️
core/simulator/src/workload/ops/create_stream.rs 100.00% <100.00%> (ø)
... and 35 more

... and 127 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@numinnex

Copy link
Copy Markdown
Contributor

I think is worth addressing as this PR is using the custom decoder in the SDK (hidden behind the vsr feature flag) to take into account those status error codes and map them to IggyError. Alternatively you could put that inside of the binary_protocol crate, and reuse that inside of simulator for the mapping.

@numinnex numinnex left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add support to SDK to parse the new status codes

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants