feat(harbor): expose score_baseline in the build YAML#24
Open
shehabyasser-scale wants to merge 2 commits into
Open
feat(harbor): expose score_baseline in the build YAML#24shehabyasser-scale wants to merge 2 commits into
shehabyasser-scale wants to merge 2 commits into
Conversation
ServeConfig.score_baseline (baseline scored at finalize so regressions are visible) existed but was unreachable: BuildConfig had no such field, pydantic silently dropped it from YAML, and the compiler never emitted the key, so it was always False in a compiled task. The regression it detects is real: in a live weak-model trial, auto_best's winner scored 0.2 on validation against the untouched repo's 0.3, invisibly. Adds the field to BuildConfig and emits it into serve.json. On stacks where ServeConfig predates the field the extra key is ignored (pydantic extra=ignore), so this composes with the sidecar-side feature branch in either merge order. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile on #24: only the default-False case went through the full pipeline; a compile_task refactor dropping the field would have kept the True test green. Now both paths hit the written serve.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to #19 (sidecar stack); based on the compiler stack.
ServeConfig.score_baseline(admin-score the unmodified baseline at finalize, write<admin_volume>/baseline.jsonso regressions are visible) exists on the sidecar branch but was unreachable:BuildConfighad no such field, pydantic silently dropped it from YAML, and the compiler never emitted the key, so it was always False in a compiled task. The regression it detects is real: in our weak-model trial, auto_best's winner scored 0.2 on validation against the untouched repo's 0.3, invisibly.This adds the field to
BuildConfigand emits it intoserve.json. Merge-order safe in both directions: whereServeConfigpredates the field, the extra key is ignored (pydanticextra=ignore, pinned by a raw-JSON test); once #19 lands, the knob becomes live with no further change.Tests: default
falsereaches serve.json through a fullcompile_task;score_baseline: truetravels the actual YAML path (yaml.safe_load->model_validate->_serve_config).Follow-ups deliberately not in scope:
extra="forbid"on BuildConfig (a typo'd key anywhere in build.yaml is still silently dropped, worth its own PR), and documenting the knob in the harbor-4 docs branch.🤖 Generated with Claude Code
Greptile Summary
This PR exposes
score_baselinefrombuild.yamlthrough the full compiler pipeline intoserve.json, fixing a silent Pydantic drop that had kept the feature permanentlyFalsein compiled tasks.config.py: Addsscore_baseline: bool = FalsetoBuildConfigwith an explanatory comment.compiler.py: Emits"score_baseline": config.score_baselineinto the_serve_configdict, wiring it into the compiler-to-sidecar contract.test_harbor_build.py: Adds three tests — default-Falsevia raw JSON,Truethrough the YAML→model_validate→_serve_configpath, andTruethrough the fullcompile_taskpipeline — addressing the previous review comment about theTruepath lacking end-to-end coverage.Confidence Score: 5/5
Safe to merge — the change is a minimal, additive field that defaults to
False, leaving all existing behavior unchanged.The diff is a one-field addition to
BuildConfig, one-line addition to_serve_config, and three well-layered tests. The default isFalse, so no existing compiled tasks change behavior. Forward compatibility is covered by pydanticextra=ignoreonServeConfig. The previously flagged gap (True path not tested throughcompile_task) is now closed bytest_score_baseline_true_through_compile_task.No files require special attention.
Important Files Changed
score_baseline: bool = Falsefield toBuildConfigwith a clear inline comment explaining its purpose; minimal and correct change.score_baselineinto the_serve_configdict alongsidesubmit_enabled; one-line change, correctly placed in the compiler-to-serve contract dict._serve_config, and True through the fullcompile_taskpipeline (addressing the previous review comment).Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant YAML as build.yaml participant BC as BuildConfig participant SC as _serve_config() participant SJ as serve.json participant SVC as ServeConfig (sidecar) YAML->>BC: "yaml.safe_load → model_validate<br/>(score_baseline: true)" BC->>SC: config.score_baseline passed in SC->>SJ: "{"score_baseline": true, ...}" SJ->>SVC: "parsed at sidecar runtime<br/>(extra=ignore if field absent)"%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant YAML as build.yaml participant BC as BuildConfig participant SC as _serve_config() participant SJ as serve.json participant SVC as ServeConfig (sidecar) YAML->>BC: "yaml.safe_load → model_validate<br/>(score_baseline: true)" BC->>SC: config.score_baseline passed in SC->>SJ: "{"score_baseline": true, ...}" SJ->>SVC: "parsed at sidecar runtime<br/>(extra=ignore if field absent)"Reviews (2): Last reviewed commit: "test(harbor): drive score_baseline=True ..." | Re-trigger Greptile