Skip to content

Benchmark comparison gates#804

Open
dahlia wants to merge 17 commits into
fedify-dev:mainfrom
dahlia:feat/bench/compare-ci-gates
Open

Benchmark comparison gates#804
dahlia wants to merge 17 commits into
fedify-dev:mainfrom
dahlia:feat/bench/compare-ci-gates

Conversation

@dahlia

@dahlia dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member

This adds fedify bench compare, a CI-oriented command for checking benchmark regressions between two Git refs. The command checks out the base/head refs in temporary worktrees, starts each target with the same command, waits for a ready URL, runs the same benchmark suite, and reports whether the head result stays within the configured regression allowance.

The comparison uses the explicit --max-regression value plus a measured noise band from repeated runs, so the gate can be stricter than a broad absolute threshold without failing on small local variance. When a scenario does not name performance metrics in expect, the comparison falls back to latency.p95 and throughputPerSec.

This also makes repeated runs real in fedify bench. Scenarios now run three times by default, aggregate latency and throughput with median values, keep correctness gates pessimistic, and preserve per-run data in the JSON report. Dry-run output now shows the run count, total scheduled duration, and open-loop request volume so users can review benchmark impact before sending load.

The report format is versioned as benchmark report schema v3 and comparison report schema v1. The published schemas live in schema/bench/report-v3.json and schema/bench/compare-report-v1.json, with user documentation in docs/manual/benchmarking.md and a release note in CHANGES.md.

Refs #744. Fixes #786.

dahlia added 6 commits June 13, 2026 11:35
Implement repeated benchmark runs and the fedify bench compare command so CI
can compare a head ref against a base ref on the same runner.  The comparison
path checks out temporary worktrees, starts each target sequentially, and
fails when performance regressions exceed the configured tolerance plus the
measured noise band.

Publish benchmark report schema version 3 and comparison report schema
version 1, including fixtures and schema drift guards for both formats.

fedify-dev#744
fedify-dev#786

Assisted-by: Codex:gpt-5.5
Explain repeated benchmark runs, the fedify bench compare workflow, and the
CI versus controlled-runner profile split.  Record the CLI changelog entry
for the comparison command and the report schema updates.

fedify-dev#744
fedify-dev#786

Assisted-by: Codex:gpt-5.5
Keep benchmark target output away from stdout, enforce readiness timeouts,
terminate Windows target process trees, and avoid infinite noise allowances in
comparison reports.

fedify-dev#744
fedify-dev#786

Assisted-by: Codex:gpt-5.5
Treat missing server-side measurements in any repeated benchmark run as missing
from the aggregate so signature and queue expectations cannot pass on a partial
subset of runs.

fedify-dev#744
fedify-dev#786

Assisted-by: Codex:gpt-5.5
Show repeated run counts, total duration, and open-loop request volume in
benchmark dry-run plans.  Compare benchmark reports by scenario position plus
name and type so duplicate scenario names cannot be matched to the wrong base
result.

fedify-dev#744
fedify-dev#786

Assisted-by: Codex:gpt-5.5
@dahlia dahlia added this to the Fedify 2.3 milestone Jun 13, 2026
@dahlia dahlia self-assigned this Jun 13, 2026
@dahlia dahlia added the component/cli CLI tools related label Jun 13, 2026
@dahlia dahlia added the component/otel OpenTelemetry integration label Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a bench "compare" mode, enables repeated scenario runs (default 3), aggregates per-run metrics into report-v3, computes noise-adjusted regressions, implements worktree/target orchestration and rendering, and updates schemas, fixtures, tests, docs, and changelog.

Changes

Benchmark Compare and Repeated Runs

Layer / File(s) Summary
Documentation and changelog
CHANGES.md, docs/manual/benchmarking.md
Documents fedify bench compare, repeated-runs default/aggregation, report-v3 and compare-report schema references, CI instructions, and adds changelog references.
CLI Command Structure: Run and Compare Modes
packages/cli/src/bench/command.ts, packages/cli/src/bench/command.test.ts
Refactors benchmark command parser to or(compareParser, runParser) with mode: "run"/"compare", adds typed variants and parsing tests for compare flags.
Repeated Runs Foundation: Scenario Normalization
packages/cli/src/bench/scenario/normalize.ts, packages/cli/src/bench/scenario/normalize.test.ts
Defaults runs to 3, removes rejection of runs > 1, and updates tests for defaults and overrides.
Repeated Runs Execution: Action Loop and Dry-run Plan
packages/cli/src/bench/action.ts, packages/cli/src/bench/action.test.ts
Changes runBench to accept BenchRunCommand, executes each scenario scenario.runs times collecting per-run measurements, and expands dry-run plan output to include runs and total duration; adds related tests.
Result Aggregation and Benchmark Report v3
packages/cli/src/bench/result/build.ts, packages/cli/src/bench/result/model.ts, packages/cli/src/bench/result/schema.ts, packages/cli/src/bench/result/build.test.ts, packages/cli/src/bench/render/render.test.ts
Aggregates repeated-run measurements (medians for throughput/latency, pessimistic successRate), adds runCount and optional runs[] to scenario results, bumps report schemaVersion to 3, updates schemas and validators, and adds aggregation tests including missing server stats scenarios.
Benchmark Compare: Orchestration and Process Management
packages/cli/src/bench/compare.ts
Implements runBenchCompare: create/remove worktrees, start/stop target processes, waitReadyUrl polling, run benchmarks in worktrees, compute noise-aware comparisons, render/write outputs, and set exit codes.
Compare Helpers: start/stop, waitReadyUrl, worktree/run helpers
packages/cli/src/bench/compare.ts
Exports startBenchmarkTarget, stopTargetProcess (platform-aware), waitReadyUrl (fetch with abort/backoff), defaultRunBenchInWorktree, git worktree helpers, and defaultWriteOutput.
Compare Report Rendering
packages/cli/src/bench/compare.ts
Renders compare reports in json, markdown, or text with formatted numeric/regression/noise fields and pass/fail markers.
Compare Report Schema
packages/cli/src/bench/compare/schema.ts, schema/bench/compare-report-v1.json
Adds compare-report schema (compare-report-v1) validating envelope, sides (embedded report-v3), comparisons array, and numeric constraints for noise/allowedRegression.
Published JSON Schema Definitions & Registry
packages/cli/src/bench/result/schema.ts, packages/cli/src/bench/schemas.ts, schema/bench/report-v3.json
Introduces report-v3 schema and registers report-v3.json and compare-report-v1.json in PUBLISHED_SCHEMAS; adjusts v2/v3 IDs and schema objects.
Tests: compare, command, action, result, render, schema
packages/cli/src/bench/compare.test.ts, packages/cli/src/bench/command.test.ts, packages/cli/src/bench/action.test.ts, packages/cli/src/bench/result/build.test.ts, packages/cli/src/bench/render/render.test.ts, packages/cli/src/bench/schema.test.ts
Adds comprehensive tests for compare mode, tolerance parsing, noise-band logic, process lifecycle (including Windows), waitReadyUrl behavior, command parsing, action dry-run and runs behavior, result aggregation, and schema guards.
Fixtures and Examples
packages/cli/src/bench/__fixtures__/reports/inbox-report.json, packages/cli/src/bench/__fixtures__/compare-reports/basic.json
Updates inbox-report fixture to schema v3 with runCount: 1 and adds a compare-report fixture demonstrating a single comparison entry.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant GitWorktree
  participant TargetProcess
  participant BenchRunner
  participant ReportWriter

  CLI->>GitWorktree: create worktree(base)
  CLI->>GitWorktree: create worktree(head)
  GitWorktree-->>CLI: worktrees ready
  CLI->>TargetProcess: startBenchmarkTarget (base worktree)
  TargetProcess-->>CLI: started (pid)
  CLI->>CLI: waitReadyUrl(base.readyUrl)
  CLI->>BenchRunner: defaultRunBenchInWorktree(base)
  BenchRunner-->>CLI: BenchReport(base)
  CLI->>TargetProcess: stopTargetProcess(base)
  CLI->>TargetProcess: startBenchmarkTarget (head worktree)
  CLI->>CLI: waitReadyUrl(head.readyUrl)
  CLI->>BenchRunner: defaultRunBenchInWorktree(head)
  BenchRunner-->>CLI: BenchReport(head)
  CLI->>ReportWriter: buildCompareReport(base, head) -> render/write (json/markdown/text)
  ReportWriter-->>CLI: written output
  CLI->>GitWorktree: remove worktrees
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • fedify-dev/fedify#795: Overlaps at run-mode dry-run/plan generation and safety gating around run execution.
  • fedify-dev/fedify#802: Related changes to benchmark result building and metric representations; both touch aggregation and schema boundaries.
  • fedify-dev/fedify#791: Earlier bench engine changes that this PR extends (run behavior and report aggregation).

Suggested reviewers

  • sij411
  • 2chanhaeng
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Benchmark comparison gates' clearly and concisely summarizes the main feature addition of a CI-oriented comparison command for benchmark regression detection.
Description check ✅ Passed The description provides comprehensive context about the new fedify bench compare command, repeated run aggregation, noise-aware gating, and all associated documentation/schema changes.
Linked Issues check ✅ Passed The PR fully implements the acceptance criteria from issue #786: same-runner compare with noise-aware gating, median-of-N aggregation for latency/throughput, and documented CI-safe/perf-lab profiles.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the benchmark comparison feature: new CLI command, schema versioning, documentation, aggregation logic, and report fixtures—no unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to enable CI-friendly performance regression testing by comparing base and head git revisions. It also adds support for repeated benchmark runs, aggregates metrics (such as median latency and pessimistic success rates), and updates the benchmark report schema to version 3 alongside a new comparison report schema. The review feedback highlights opportunities to improve robustness, specifically by adding defensive checks in metricValue to prevent null pointer exceptions, implementing a secondary timeout in stopTargetProcess to avoid indefinite hangs, and prioritizing abort signal reasons over transport-level errors in waitReadyUrl.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/compare.ts
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

Reviewed commit: b18c937e5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cli/src/bench/command.test.ts`:
- Around line 108-120: Add a regression test that asserts a bare "bench compare"
subcommand without any flags or refs fails parsing to guard against parser
disambiguation; specifically, in the existing test file add a new test that
calls parse(benchCommand, [COMMAND, "compare"]) and asserts result.success is
false (using the same assert.ok(!result.success) style), ensuring you reference
the existing benchCommand, parse, and COMMAND symbols so this covers the bare
subcommand case.

In `@packages/cli/src/bench/compare/schema.ts`:
- Around line 83-106: The benchmarkReport schema currently marks "target" as
required but doesn't define its shape, allowing invalid values; update the
benchmarkReport.properties to include a "target" entry (e.g., "target": {
"$ref": "`#/`$defs/target" } or an appropriate explicit object/type schema) and
add a corresponding $defs.target definition (or refine the existing $defs) that
constrains allowed fields/types for the target object so null/number/other
invalid types are rejected; ensure the new target schema matches the project's
expected target fields and validation style used elsewhere (e.g., other $defs
like tool/environment) so the required "target" is properly typed and validated.

In `@packages/cli/src/bench/mod.ts`:
- Around line 7-11: Add a JSDoc comment above the exported runBench function
describing its purpose (dispatching to runBenchCompare or runBenchSuite based on
the command.mode), document the parameter (BenchCommand) including expected
properties used (e.g., mode values "compare" vs others), and specify the return
type Promise<void>; reference the exported symbol runBench so the comment is
placed immediately above that function and include any relevant tags like `@param`
and `@returns`.

In `@packages/cli/src/bench/result/schema.ts`:
- Around line 353-367: The schema for scenarioResult in reportSchemaV2 currently
adds "runCount" but leaves "runs" optional; update the scenarioResult required
list to include "runs" as well so that whenever runCount is present the runs
array is required — specifically modify the spread of
reportSchemaV2.$defs.scenarioResult.required inside the scenarioResult
definition to add "runs" (in addition to "runCount") so the properties runCount
and runs are both required in the scenarioResult schema.

In `@schema/bench/report-v3.json`:
- Around line 392-478: The schema allows runCount>1 without requiring per-run
data; update the "scenarioResult" schema so that when "runCount" is >= 2 the
"runs" property is required. Add an if/then at the same level as "properties" in
the "scenarioResult" definition: if {"properties":{"runCount":{"minimum":2}}}
then {"required":["runs"]}. Target the "scenarioResult" object and its
"runCount" and "runs" properties to enforce this invariant.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74e2c8ec-7ff9-44f4-8596-5b8b851aff26

📥 Commits

Reviewing files that changed from the base of the PR and between d351e55 and b18c937.

📒 Files selected for processing (23)
  • CHANGES.md
  • docs/manual/benchmarking.md
  • packages/cli/src/bench/__fixtures__/compare-reports/basic.json
  • packages/cli/src/bench/__fixtures__/reports/inbox-report.json
  • packages/cli/src/bench/action.test.ts
  • packages/cli/src/bench/action.ts
  • packages/cli/src/bench/command.test.ts
  • packages/cli/src/bench/command.ts
  • packages/cli/src/bench/compare.test.ts
  • packages/cli/src/bench/compare.ts
  • packages/cli/src/bench/compare/schema.ts
  • packages/cli/src/bench/mod.ts
  • packages/cli/src/bench/render/render.test.ts
  • packages/cli/src/bench/result/build.test.ts
  • packages/cli/src/bench/result/build.ts
  • packages/cli/src/bench/result/model.ts
  • packages/cli/src/bench/result/schema.ts
  • packages/cli/src/bench/scenario/normalize.test.ts
  • packages/cli/src/bench/scenario/normalize.ts
  • packages/cli/src/bench/schema.test.ts
  • packages/cli/src/bench/schemas.ts
  • schema/bench/compare-report-v1.json
  • schema/bench/report-v3.json

Comment thread packages/cli/src/bench/command.test.ts
Comment thread packages/cli/src/bench/compare/schema.ts
Comment thread packages/cli/src/bench/mod.ts
Comment thread packages/cli/src/bench/result/schema.ts
Comment thread schema/bench/report-v3.json Outdated
@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.22129% with 311 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
packages/cli/src/bench/compare.ts 67.47% 195 Missing and 61 partials ⚠️
packages/cli/src/bench/result/build.ts 84.97% 12 Missing and 14 partials ⚠️
packages/cli/src/bench/load/generator.ts 55.81% 15 Missing and 4 partials ⚠️
packages/cli/src/bench/action.ts 90.36% 4 Missing and 4 partials ⚠️
packages/cli/src/bench/load/clock.ts 90.00% 0 Missing and 2 partials ⚠️
Files with missing lines Coverage Δ
packages/cli/src/bench/command.ts 100.00% <100.00%> (ø)
packages/cli/src/bench/compare/schema.ts 100.00% <100.00%> (ø)
packages/cli/src/bench/result/schema.ts 100.00% <100.00%> (ø)
packages/cli/src/bench/safety/gate.ts 100.00% <100.00%> (ø)
packages/cli/src/bench/scenario/normalize.ts 100.00% <100.00%> (ø)
packages/cli/src/bench/scenarios/failure.ts 84.45% <100.00%> (+0.03%) ⬆️
packages/cli/src/bench/scenarios/fanout.ts 87.93% <100.00%> (+0.03%) ⬆️
packages/cli/src/bench/scenarios/inbox.ts 89.87% <100.00%> (+0.06%) ⬆️
packages/cli/src/bench/scenarios/read.ts 95.07% <100.00%> (+0.03%) ⬆️
packages/cli/src/bench/scenarios/runner.ts 95.12% <ø> (ø)
... and 7 more

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Bound target shutdown after SIGKILL, keep readiness timeout errors tied
back to the abort reason, and make malformed comparison metrics fail
instead of throwing.  Tighten benchmark report schemas so embedded
targets are typed and repeated reports must include per-run data.

Add parser and public-export coverage requested in review.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@netlify

netlify Bot commented Jun 13, 2026

Copy link
Copy Markdown

Deploy Preview for fedify-json-schema ready!

Name Link
🔨 Latest commit 356539c
🔍 Latest deploy log https://app.netlify.com/projects/fedify-json-schema/deploys/6a2d3f5280431d0008503090
😎 Deploy Preview https://deploy-preview-804--fedify-json-schema.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new fedify bench compare command for CI-friendly performance regression gates, allowing sequential benchmark runs of base and head git revisions inside temporary worktrees. It also adds support for repeated benchmark runs (defaulting to 3) and aggregates their results (median latency/throughput, pessimistic success rate, merged histograms) into a new benchmark report schema version 3. The review feedback focuses on improving the robustness of the comparison and process management logic in packages/cli/src/bench/compare.ts. Specifically, the reviewer suggests matching scenarios by name rather than relying on exact ordering, cleaning up partial git worktrees on checkout failures to prevent leaks, and resolving target process termination immediately if the process ID is null to avoid hanging.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/compare.ts Outdated
Comment thread packages/cli/src/bench/compare.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
packages/cli/src/bench/mod.ts (1)

7-7: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Expand JSDoc to document parameters, return type, and dispatch behavior.

The JSDoc is too brief for a public API. It should document the command parameter (including the mode discriminator), the Promise<void> return, and clarify the dispatch behavior (compare vs. run modes).

📝 Proposed expanded JSDoc
-/** Runs a parsed benchmark command. */
+/**
+ * Runs a parsed benchmark command.
+ *
+ * Dispatches to {`@link` runBenchCompare} when `command.mode === "compare"`,
+ * otherwise dispatches to the suite runner for `mode === "run"`.
+ *
+ * `@param` command The parsed benchmark command with mode discriminator.
+ * `@returns` A promise that resolves when the benchmark completes.
+ */
 export function runBench(command: BenchCommand): Promise<void> {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/cli/src/bench/mod.ts` at line 7, Replace the terse JSDoc comment
above the benchmark entry function with a full docblock that documents the
parameter `command` (describe its shape and the mode discriminator values, e.g.,
"run" vs "compare"), the return type (Promise<void>), and the dispatch behavior
(explain that when command.mode === "compare" it dispatches comparison logic and
when command.mode === "run" it executes a single benchmark run, including any
side effects such as logging or process exit). Mention any thrown errors or
rejection conditions and note that the function performs asynchronous work. Keep
the docblock adjacent to the same function so the symbols (the `command`
parameter and the function itself) are clearly described.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@packages/cli/src/bench/mod.ts`:
- Line 7: Replace the terse JSDoc comment above the benchmark entry function
with a full docblock that documents the parameter `command` (describe its shape
and the mode discriminator values, e.g., "run" vs "compare"), the return type
(Promise<void>), and the dispatch behavior (explain that when command.mode ===
"compare" it dispatches comparison logic and when command.mode === "run" it
executes a single benchmark run, including any side effects such as logging or
process exit). Mention any thrown errors or rejection conditions and note that
the function performs asynchronous work. Keep the docblock adjacent to the same
function so the symbols (the `command` parameter and the function itself) are
clearly described.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c2b9d19a-c0a4-4840-9225-3826a851781c

📥 Commits

Reviewing files that changed from the base of the PR and between b18c937 and 662c265.

📒 Files selected for processing (9)
  • packages/cli/src/bench/command.test.ts
  • packages/cli/src/bench/compare.test.ts
  • packages/cli/src/bench/compare.ts
  • packages/cli/src/bench/compare/schema.ts
  • packages/cli/src/bench/mod.ts
  • packages/cli/src/bench/result/schema.ts
  • packages/cli/src/bench/schema.test.ts
  • schema/bench/compare-report-v1.json
  • schema/bench/report-v3.json

dahlia added 2 commits June 13, 2026 15:03
Document every published benchmark schema on the schema landing page,
including the current scenario and report versions plus the comparison
report schema.  Keep the schema directory README aligned with the same
published files and source locations.

Assisted-by: Codex:gpt-5.5
Match scenarios by name, type, and occurrence so reordered suites do not
produce false missing-scenario comparisons.  Clean partial worktree
registrations after failed checkout and treat child processes without a
pid as already stopped.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to enable CI-friendly performance regression checks by comparing base and head git revisions. It also updates the benchmark report schema to version 3 and scenario schema to version 2 to support and aggregate repeated runs. Review feedback suggests gracefully handling cases where the base metric is zero in regressionRatio for 'higher-is-better' metrics, and ensuring that the Windows process tree is only force-killed with /F when the signal is SIGKILL to allow graceful shutdowns on SIGTERM.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/compare.ts
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b880cc7c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/compare.ts
Keep compare's derived target separate from the explicit CLI target
signal, so unsafe overrides still require --target.  Also handle zero
throughput baselines as improvements and let Windows targets receive a
graceful taskkill before forced cleanup.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to facilitate CI-friendly performance regression comparisons between Git revisions, updating the benchmark report schema to version 3 to support repeated runs and adding a new comparison report schema. The review feedback suggests merging duplicate imports from node:fs/promises, adding a JSDoc comment to the public windowsTaskkillArgs function to adhere to the repository style guide, and using safer optional chaining when canceling response bodies to prevent potential runtime crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts Outdated
Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/compare.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab34e0e986

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/scenario/normalize.ts
dahlia added 2 commits June 13, 2026 16:01
The public-target unsafe override already required an explicit target,
load, and duration.  Include the repeated-run count in that contract so
the default run multiplier cannot expand public benchmark traffic without
an explicit operator choice.

fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
Merge the duplicate fs/promises imports, document the taskkill argument
helper, and tolerate custom response bodies without cancel methods while
polling readiness endpoints.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to enable comparing performance regression between base and head Git revisions in CI environments. It also implements support for repeated benchmark runs (defaulting to 3 runs) and aggregates repeated runs using median metrics for latency/throughput and pessimistic correctness. This change updates the benchmark report schema to version 3 and introduces the comparison report schema version 1. Feedback on the changes highlights two key improvement opportunities: first, regressionRatio returns null when base is 0 and the direction is "lower-is-better", which can cause microbenchmarks with millisecond precision to fail on negligible differences; second, the partialField aggregation discards standard-specific or queue metrics entirely if they are missing in a single run, whereas it should filter out undefined values and compute the median of the present values.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/result/build.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82c39d39c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/result/build.ts Outdated
Treat one-millisecond latency movement from a zero baseline as noise,
preserve by-standard report detail across repeated runs, and avoid
publishing a pooled aggregate histogram that disagrees with the reported
median run latency fields.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to facilitate CI-friendly performance regression testing by checking out base and head refs into temporary worktrees, running the benchmark suite sequentially, and failing if regression exceeds a configured threshold. Additionally, benchmark scenarios now run three times by default to aggregate repeated runs with median latency/throughput and pessimistic correctness results, updating the report schema to version 3 and introducing comparison report schema version 1. Feedback on the implementation highlights a compatibility issue with toReversed() in Node.js 18, a potential CI failure when comparing newly introduced metrics without a baseline, and a TOCTOU security vulnerability caused by deleting temporary directories immediately after creation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts Outdated
Comment thread packages/cli/src/bench/compare.ts Outdated
Comment thread packages/cli/src/bench/compare.ts Outdated
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

Reviewed commit: de36221102

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Avoid Node 18-incompatible reverse iteration, let newly introduced
head-side metrics pass when the baseline lacks that measurement, and keep
temporary worktree directories intact until git owns them.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to compare performance metrics between two git revisions (base and head) in a CI-friendly manner, failing if the head revision regresses beyond a specified threshold plus measured noise. It updates the benchmark report schema to version 3 to support repeated runs (defaulting to 3 runs) and aggregates their results, and adds a new comparison report schema version 1. Feedback on the changes highlights two key improvements: wrapping killTargetProcess in a try...catch block inside stopTargetProcess to prevent unhandled exceptions from crashing the CLI, and registering SIGINT/SIGTERM signal handlers to clean up temporary worktrees and stop active target processes upon user interruption.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/compare.ts
Comment thread packages/cli/src/bench/compare.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d2edea8be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/compare.ts Outdated
Race target exits against readiness, make interrupted comparisons stop
the active target and clean temporary worktrees, and reject cleanly if
forced target termination throws.

fedify-dev#804 (comment)
fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command to enable CI-friendly performance regression gates by comparing base and head git revisions. It also implements support for repeated benchmark runs (defaulting to three runs), aggregating their results (using median values for latency/throughput and pessimistic correctness results), and updates the benchmark report schema to version 3. Feedback on the changes suggests using a type assertion to cast fieldValues to readonly number[] in packages/cli/src/bench/result/build.ts to resolve a TypeScript compilation error under strict null checks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread packages/cli/src/bench/result/build.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41d38f42bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/cli/src/bench/compare.ts
Carry interrupt and target-exit cancellation through compare readiness,
in-process benchmark runs, runner fetches, and load-generator sleeps so
raced benchmark work does not continue after cleanup starts.

fedify-dev#804 (comment)
fedify-dev#804 (comment)

Assisted-by: Codex:gpt-5.5
@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the fedify bench compare command, which enables CI-friendly performance regression testing by comparing a base and head git revision on the same runner. It automates checking out revisions into temporary worktrees, starting target processes, waiting for readiness, and running the benchmark suite. Additionally, the PR implements support for repeated benchmark runs (defaulting to three) and aggregates their metrics (using median values for latency/throughput and pessimistic success rates) to ensure stable results. To support these features, the benchmark report schema has been updated to version 3, a new comparison report schema version 1 has been introduced, and abort signals have been integrated to cancel in-flight benchmark tasks. Since no review comments were provided, there is no further feedback to address.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

Reviewed commit: 356539c6c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dahlia dahlia requested review from 2chanhaeng and sij411 June 13, 2026 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/cli CLI tools related component/otel OpenTelemetry integration

Development

Successfully merging this pull request may close these issues.

Benchmarking: fedify bench compare A/B mode and CI gate robustness

1 participant