Benchmark comparison gates#804
Conversation
Implement repeated benchmark runs and the fedify bench compare command so CI can compare a head ref against a base ref on the same runner. The comparison path checks out temporary worktrees, starts each target sequentially, and fails when performance regressions exceed the configured tolerance plus the measured noise band. Publish benchmark report schema version 3 and comparison report schema version 1, including fixtures and schema drift guards for both formats. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Explain repeated benchmark runs, the fedify bench compare workflow, and the CI versus controlled-runner profile split. Record the CLI changelog entry for the comparison command and the report schema updates. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Keep benchmark target output away from stdout, enforce readiness timeouts, terminate Windows target process trees, and avoid infinite noise allowances in comparison reports. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Treat missing server-side measurements in any repeated benchmark run as missing from the aggregate so signature and queue expectations cannot pass on a partial subset of runs. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
Show repeated run counts, total duration, and open-loop request volume in benchmark dry-run plans. Compare benchmark reports by scenario position plus name and type so duplicate scenario names cannot be matched to the wrong base result. fedify-dev#744 fedify-dev#786 Assisted-by: Codex:gpt-5.5
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a bench "compare" mode, enables repeated scenario runs (default 3), aggregates per-run metrics into report-v3, computes noise-adjusted regressions, implements worktree/target orchestration and rendering, and updates schemas, fixtures, tests, docs, and changelog. ChangesBenchmark Compare and Repeated Runs
Sequence Diagram(s)sequenceDiagram
participant CLI
participant GitWorktree
participant TargetProcess
participant BenchRunner
participant ReportWriter
CLI->>GitWorktree: create worktree(base)
CLI->>GitWorktree: create worktree(head)
GitWorktree-->>CLI: worktrees ready
CLI->>TargetProcess: startBenchmarkTarget (base worktree)
TargetProcess-->>CLI: started (pid)
CLI->>CLI: waitReadyUrl(base.readyUrl)
CLI->>BenchRunner: defaultRunBenchInWorktree(base)
BenchRunner-->>CLI: BenchReport(base)
CLI->>TargetProcess: stopTargetProcess(base)
CLI->>TargetProcess: startBenchmarkTarget (head worktree)
CLI->>CLI: waitReadyUrl(head.readyUrl)
CLI->>BenchRunner: defaultRunBenchInWorktree(head)
BenchRunner-->>CLI: BenchReport(head)
CLI->>ReportWriter: buildCompareReport(base, head) -> render/write (json/markdown/text)
ReportWriter-->>CLI: written output
CLI->>GitWorktree: remove worktrees
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression testing by comparing base and head git revisions. It also adds support for repeated benchmark runs, aggregates metrics (such as median latency and pessimistic success rates), and updates the benchmark report schema to version 3 alongside a new comparison report schema. The review feedback highlights opportunities to improve robustness, specifically by adding defensive checks in metricValue to prevent null pointer exceptions, implementing a secondary timeout in stopTargetProcess to avoid indefinite hangs, and prioritizing abort signal reasons over transport-level errors in waitReadyUrl.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. What shall we delve into next? Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/cli/src/bench/command.test.ts`:
- Around line 108-120: Add a regression test that asserts a bare "bench compare"
subcommand without any flags or refs fails parsing to guard against parser
disambiguation; specifically, in the existing test file add a new test that
calls parse(benchCommand, [COMMAND, "compare"]) and asserts result.success is
false (using the same assert.ok(!result.success) style), ensuring you reference
the existing benchCommand, parse, and COMMAND symbols so this covers the bare
subcommand case.
In `@packages/cli/src/bench/compare/schema.ts`:
- Around line 83-106: The benchmarkReport schema currently marks "target" as
required but doesn't define its shape, allowing invalid values; update the
benchmarkReport.properties to include a "target" entry (e.g., "target": {
"$ref": "`#/`$defs/target" } or an appropriate explicit object/type schema) and
add a corresponding $defs.target definition (or refine the existing $defs) that
constrains allowed fields/types for the target object so null/number/other
invalid types are rejected; ensure the new target schema matches the project's
expected target fields and validation style used elsewhere (e.g., other $defs
like tool/environment) so the required "target" is properly typed and validated.
In `@packages/cli/src/bench/mod.ts`:
- Around line 7-11: Add a JSDoc comment above the exported runBench function
describing its purpose (dispatching to runBenchCompare or runBenchSuite based on
the command.mode), document the parameter (BenchCommand) including expected
properties used (e.g., mode values "compare" vs others), and specify the return
type Promise<void>; reference the exported symbol runBench so the comment is
placed immediately above that function and include any relevant tags like `@param`
and `@returns`.
In `@packages/cli/src/bench/result/schema.ts`:
- Around line 353-367: The schema for scenarioResult in reportSchemaV2 currently
adds "runCount" but leaves "runs" optional; update the scenarioResult required
list to include "runs" as well so that whenever runCount is present the runs
array is required — specifically modify the spread of
reportSchemaV2.$defs.scenarioResult.required inside the scenarioResult
definition to add "runs" (in addition to "runCount") so the properties runCount
and runs are both required in the scenarioResult schema.
In `@schema/bench/report-v3.json`:
- Around line 392-478: The schema allows runCount>1 without requiring per-run
data; update the "scenarioResult" schema so that when "runCount" is >= 2 the
"runs" property is required. Add an if/then at the same level as "properties" in
the "scenarioResult" definition: if {"properties":{"runCount":{"minimum":2}}}
then {"required":["runs"]}. Target the "scenarioResult" object and its
"runCount" and "runs" properties to enforce this invariant.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 74e2c8ec-7ff9-44f4-8596-5b8b851aff26
📒 Files selected for processing (23)
CHANGES.mddocs/manual/benchmarking.mdpackages/cli/src/bench/__fixtures__/compare-reports/basic.jsonpackages/cli/src/bench/__fixtures__/reports/inbox-report.jsonpackages/cli/src/bench/action.test.tspackages/cli/src/bench/action.tspackages/cli/src/bench/command.test.tspackages/cli/src/bench/command.tspackages/cli/src/bench/compare.test.tspackages/cli/src/bench/compare.tspackages/cli/src/bench/compare/schema.tspackages/cli/src/bench/mod.tspackages/cli/src/bench/render/render.test.tspackages/cli/src/bench/result/build.test.tspackages/cli/src/bench/result/build.tspackages/cli/src/bench/result/model.tspackages/cli/src/bench/result/schema.tspackages/cli/src/bench/scenario/normalize.test.tspackages/cli/src/bench/scenario/normalize.tspackages/cli/src/bench/schema.test.tspackages/cli/src/bench/schemas.tsschema/bench/compare-report-v1.jsonschema/bench/report-v3.json
Bound target shutdown after SIGKILL, keep readiness timeout errors tied back to the abort reason, and make malformed comparison metrics fail instead of throwing. Tighten benchmark report schemas so embedded targets are typed and repeated reports must include per-run data. Add parser and public-export coverage requested in review. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
✅ Deploy Preview for fedify-json-schema ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new fedify bench compare command for CI-friendly performance regression gates, allowing sequential benchmark runs of base and head git revisions inside temporary worktrees. It also adds support for repeated benchmark runs (defaulting to 3) and aggregates their results (median latency/throughput, pessimistic success rate, merged histograms) into a new benchmark report schema version 3. The review feedback focuses on improving the robustness of the comparison and process management logic in packages/cli/src/bench/compare.ts. Specifically, the reviewer suggests matching scenarios by name rather than relying on exact ordering, cleaning up partial git worktrees on checkout failures to prevent leaks, and resolving target process termination immediately if the process ID is null to avoid hanging.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
packages/cli/src/bench/mod.ts (1)
7-7:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winExpand JSDoc to document parameters, return type, and dispatch behavior.
The JSDoc is too brief for a public API. It should document the
commandparameter (including the mode discriminator), thePromise<void>return, and clarify the dispatch behavior (compare vs. run modes).📝 Proposed expanded JSDoc
-/** Runs a parsed benchmark command. */ +/** + * Runs a parsed benchmark command. + * + * Dispatches to {`@link` runBenchCompare} when `command.mode === "compare"`, + * otherwise dispatches to the suite runner for `mode === "run"`. + * + * `@param` command The parsed benchmark command with mode discriminator. + * `@returns` A promise that resolves when the benchmark completes. + */ export function runBench(command: BenchCommand): Promise<void> {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/cli/src/bench/mod.ts` at line 7, Replace the terse JSDoc comment above the benchmark entry function with a full docblock that documents the parameter `command` (describe its shape and the mode discriminator values, e.g., "run" vs "compare"), the return type (Promise<void>), and the dispatch behavior (explain that when command.mode === "compare" it dispatches comparison logic and when command.mode === "run" it executes a single benchmark run, including any side effects such as logging or process exit). Mention any thrown errors or rejection conditions and note that the function performs asynchronous work. Keep the docblock adjacent to the same function so the symbols (the `command` parameter and the function itself) are clearly described.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@packages/cli/src/bench/mod.ts`:
- Line 7: Replace the terse JSDoc comment above the benchmark entry function
with a full docblock that documents the parameter `command` (describe its shape
and the mode discriminator values, e.g., "run" vs "compare"), the return type
(Promise<void>), and the dispatch behavior (explain that when command.mode ===
"compare" it dispatches comparison logic and when command.mode === "run" it
executes a single benchmark run, including any side effects such as logging or
process exit). Mention any thrown errors or rejection conditions and note that
the function performs asynchronous work. Keep the docblock adjacent to the same
function so the symbols (the `command` parameter and the function itself) are
clearly described.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c2b9d19a-c0a4-4840-9225-3826a851781c
📒 Files selected for processing (9)
packages/cli/src/bench/command.test.tspackages/cli/src/bench/compare.test.tspackages/cli/src/bench/compare.tspackages/cli/src/bench/compare/schema.tspackages/cli/src/bench/mod.tspackages/cli/src/bench/result/schema.tspackages/cli/src/bench/schema.test.tsschema/bench/compare-report-v1.jsonschema/bench/report-v3.json
Document every published benchmark schema on the schema landing page, including the current scenario and report versions plus the comparison report schema. Keep the schema directory README aligned with the same published files and source locations. Assisted-by: Codex:gpt-5.5
Match scenarios by name, type, and occurrence so reordered suites do not produce false missing-scenario comparisons. Clean partial worktree registrations after failed checkout and treat child processes without a pid as already stopped. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression checks by comparing base and head git revisions. It also updates the benchmark report schema to version 3 and scenario schema to version 2 to support and aggregate repeated runs. Review feedback suggests gracefully handling cases where the base metric is zero in regressionRatio for 'higher-is-better' metrics, and ensuring that the Windows process tree is only force-killed with /F when the signal is SIGKILL to allow graceful shutdowns on SIGTERM.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b880cc7c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Keep compare's derived target separate from the explicit CLI target signal, so unsafe overrides still require --target. Also handle zero throughput baselines as improvements and let Windows targets receive a graceful taskkill before forced cleanup. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to facilitate CI-friendly performance regression comparisons between Git revisions, updating the benchmark report schema to version 3 to support repeated runs and adding a new comparison report schema. The review feedback suggests merging duplicate imports from node:fs/promises, adding a JSDoc comment to the public windowsTaskkillArgs function to adhere to the repository style guide, and using safer optional chaining when canceling response bodies to prevent potential runtime crashes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab34e0e986
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The public-target unsafe override already required an explicit target, load, and duration. Include the repeated-run count in that contract so the default run multiplier cannot expand public benchmark traffic without an explicit operator choice. fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
Merge the duplicate fs/promises imports, document the taskkill argument helper, and tolerate custom response bodies without cancel methods while polling readiness endpoints. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable comparing performance regression between base and head Git revisions in CI environments. It also implements support for repeated benchmark runs (defaulting to 3 runs) and aggregates repeated runs using median metrics for latency/throughput and pessimistic correctness. This change updates the benchmark report schema to version 3 and introduces the comparison report schema version 1. Feedback on the changes highlights two key improvement opportunities: first, regressionRatio returns null when base is 0 and the direction is "lower-is-better", which can cause microbenchmarks with millisecond precision to fail on negligible differences; second, the partialField aggregation discards standard-specific or queue metrics entirely if they are missing in a single run, whereas it should filter out undefined values and compute the median of the present values.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82c39d39c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Treat one-millisecond latency movement from a zero baseline as noise, preserve by-standard report detail across repeated runs, and avoid publishing a pooled aggregate histogram that disagrees with the reported median run latency fields. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to facilitate CI-friendly performance regression testing by checking out base and head refs into temporary worktrees, running the benchmark suite sequentially, and failing if regression exceeds a configured threshold. Additionally, benchmark scenarios now run three times by default to aggregate repeated runs with median latency/throughput and pessimistic correctness results, updating the report schema to version 3 and introducing comparison report schema version 1. Feedback on the implementation highlights a compatibility issue with toReversed() in Node.js 18, a potential CI failure when comparing newly introduced metrics without a baseline, and a TOCTOU security vulnerability caused by deleting temporary directories immediately after creation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. Delightful! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Avoid Node 18-incompatible reverse iteration, let newly introduced head-side metrics pass when the baseline lacks that measurement, and keep temporary worktree directories intact until git owns them. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to compare performance metrics between two git revisions (base and head) in a CI-friendly manner, failing if the head revision regresses beyond a specified threshold plus measured noise. It updates the benchmark report schema to version 3 to support repeated runs (defaulting to 3 runs) and aggregates their results, and adds a new comparison report schema version 1. Feedback on the changes highlights two key improvements: wrapping killTargetProcess in a try...catch block inside stopTargetProcess to prevent unhandled exceptions from crashing the CLI, and registering SIGINT/SIGTERM signal handlers to clean up temporary worktrees and stop active target processes upon user interruption.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d2edea8be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Race target exits against readiness, make interrupted comparisons stop the active target and clean temporary worktrees, and reject cleanly if forced target termination throws. fedify-dev#804 (comment) fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command to enable CI-friendly performance regression gates by comparing base and head git revisions. It also implements support for repeated benchmark runs (defaulting to three runs), aggregating their results (using median values for latency/throughput and pessimistic correctness results), and updates the benchmark report schema to version 3. Feedback on the changes suggests using a type assertion to cast fieldValues to readonly number[] in packages/cli/src/bench/result/build.ts to resolve a TypeScript compilation error under strict null checks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41d38f42bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Carry interrupt and target-exit cancellation through compare readiness, in-process benchmark runs, runner fetches, and load-generator sleeps so raced benchmark work does not continue after cleanup starts. fedify-dev#804 (comment) fedify-dev#804 (comment) Assisted-by: Codex:gpt-5.5
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the fedify bench compare command, which enables CI-friendly performance regression testing by comparing a base and head git revision on the same runner. It automates checking out revisions into temporary worktrees, starting target processes, waiting for readiness, and running the benchmark suite. Additionally, the PR implements support for repeated benchmark runs (defaulting to three) and aggregates their metrics (using median values for latency/throughput and pessimistic success rates) to ensure stable results. To support these features, the benchmark report schema has been updated to version 3, a new comparison report schema version 1 has been introduced, and abort signals have been integrated to cancel in-flight benchmark tasks. Since no review comments were provided, there is no further feedback to address.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codex Review: Didn't find any major issues. Nice work! Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
This adds
fedify bench compare, a CI-oriented command for checking benchmark regressions between two Git refs. The command checks out the base/head refs in temporary worktrees, starts each target with the same command, waits for a ready URL, runs the same benchmark suite, and reports whether the head result stays within the configured regression allowance.The comparison uses the explicit
--max-regressionvalue plus a measured noise band from repeated runs, so the gate can be stricter than a broad absolute threshold without failing on small local variance. When a scenario does not name performance metrics inexpect, the comparison falls back tolatency.p95andthroughputPerSec.This also makes repeated runs real in
fedify bench. Scenarios now run three times by default, aggregate latency and throughput with median values, keep correctness gates pessimistic, and preserve per-run data in the JSON report. Dry-run output now shows the run count, total scheduled duration, and open-loop request volume so users can review benchmark impact before sending load.The report format is versioned as benchmark report schema v3 and comparison report schema v1. The published schemas live in schema/bench/report-v3.json and schema/bench/compare-report-v1.json, with user documentation in docs/manual/benchmarking.md and a release note in CHANGES.md.
Refs #744. Fixes #786.