feat(harbor): admin /experiments endpoint for mid-run observability#14
feat(harbor): admin /experiments endpoint for mid-run observability#14shehabyasser-scale wants to merge 1 commit into
Conversation
Operating a live optimization run today, the only way to see what the optimizer had measured mid-flight was mining its transcript from outside or exec-ing into the sidecar. Add a token-gated GET /experiments that returns every recorded experiment (commit, dataset/split, recorded score, created_at) so an operator or outer harness can watch the trajectory as it happens. Admin-only on purpose: recorded scores on non_viewable splits must not reach the agent; the endpoint reuses the finalize bearer-token gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| """ | ||
| if self.engine.db is None: | ||
| return [] | ||
| df = self.engine.db.get_experiments_df() |
There was a problem hiding this comment.
mean_score silently replaced with default_minimum_score for errored experiments
get_experiments_df() is called with its default fill_score=default_minimum_score, which propagates as nan_score_fill_value into as_pandas_series() → sample_results_statistics(). Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before mean_score is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass fill_score=None so that errored experiments surface mean_score: null and add "status" from r.get("status") so operators can distinguish genuine scores from failures.
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 115
Comment:
**`mean_score` silently replaced with `default_minimum_score` for errored experiments**
`get_experiments_df()` is called with its default `fill_score=default_minimum_score`, which propagates as `nan_score_fill_value` into `as_pandas_series()` → `sample_results_statistics()`. Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before `mean_score` is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass `fill_score=None` so that errored experiments surface `mean_score: null` and add `"status"` from `r.get("status")` so operators can distinguish genuine scores from failures.
How can I resolve this? If you propose a fix, please make it concise.| "dataset_id": r.get("dataset_subset_dataset_id"), | ||
| "split": r.get("dataset_subset_split"), | ||
| "mean_score": r.get("mean_score"), | ||
| "created_at": str(r.get("candidate_created_at")), |
There was a problem hiding this comment.
"None" / "NaT" string instead of JSON null for missing timestamps
str(r.get("candidate_created_at")) produces the literal strings "None" or "NaT" when the Series value is absent or a pandas NaT, rather than a JSON null. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a None guard so the JSON field is either a proper timestamp string or null.
| "created_at": str(r.get("candidate_created_at")), | |
| "created_at": ( | |
| v.isoformat() | |
| if (v := r.get("candidate_created_at")) is not None | |
| and str(v) not in ("NaT", "None") | |
| else None | |
| ), |
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 126
Comment:
**`"None"` / `"NaT"` string instead of JSON `null` for missing timestamps**
`str(r.get("candidate_created_at"))` produces the literal strings `"None"` or `"NaT"` when the Series value is absent or a pandas `NaT`, rather than a JSON `null`. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a `None` guard so the JSON field is either a proper timestamp string or `null`.
```suggestion
"created_at": (
v.isoformat()
if (v := r.get("candidate_created_at")) is not None
and str(v) not in ("NaT", "None")
else None
),
```
How can I resolve this? If you propose a fix, please make it concise.
Stacked on #8. Operational gap found running live GAIA optimization trials: mid-run, the only visibility into what the optimizer had measured was mining its transcript or exec-ing into the sidecar container.
Adds token-gated
GET /experimentsreturning every recorded experiment (commit, dataset/split, recorded score, created_at). An operator or the outer harness can now watch an optimization trajectory as it happens, e.g.:Admin-only on purpose (reuses the finalize bearer gate): recorded scores on
non_viewablesplits must not reach the agent. Test covers the 403 paths and the row shape. 11 pass.🤖 Generated with Claude Code
Greptile Summary
Adds a token-gated
GET /experimentsadmin endpoint that returns every recorded experiment (commit, dataset/split, mean score, timestamp) for mid-run observability, reusing the existing bearer-token gate from/finalize.app.py: NewGET /experimentsroute behindcheck_admin; clean auth delegation tosidecar.list_experiments().server.py: Newlist_experiments()iterates the engine's experiment DataFrame and shapes rows for the HTTP response; silently applies a NaN-fill that replaces error-experiment scores with the minimum score floor.test_harbor_app.py: NewTestExperimentsEndpointcovers the 403 paths and basic row shape; happy-path mocking avoids the fill-score path.Confidence Score: 3/5
The auth gate is correctly applied and the endpoint is properly admin-only, but the score data returned for errored experiments is misleading in a way that directly undermines the stated purpose of the feature.
The
list_experiments()method callsget_experiments_df()with its default fill — any experiment whose samples errored will have itsmean_scoresynthetically floored todefault_minimum_scoreand returned as a real-looking number. An operator using this endpoint to watch an optimization trajectory mid-run would see a fabricated score for failed evaluations, which is the opposite of the observability guarantee the feature is meant to provide.vero/src/vero/harbor/server.py — the
list_experiments()implementation needs thefill_score=Nonefix and ideally astatusfield before this endpoint can be trusted for production use.Important Files Changed
list_experiments()for admin observability; silently fills errored experiment scores withdefault_minimum_scoreinstead ofNone, and serializes missing timestamps as the string "None" instead of JSON null.GET /experimentsadmin endpoint, correctly reusing the samecheck_adminbearer-token gate as/finalize; clean and straightforward.TestExperimentsEndpointcovering auth rejection (no token, wrong token) and basic row shape; does not cover thedb is None/df.emptypaths or errored-experiment score representation.Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram actor Operator participant App as FastAPI (app.py) participant Sidecar as EvaluationSidecar (server.py) participant DB as ExperimentDatabase Operator->>App: GET /experiments App->>App: check_admin(token) alt invalid token App-->>Operator: 403 Forbidden else valid token App->>Sidecar: list_experiments() Sidecar->>DB: "get_experiments_df(fill_score=default_minimum_score)" DB-->>Sidecar: DataFrame (NaN scores filled) Sidecar->>Sidecar: iterate rows, shape dicts Sidecar-->>App: list[dict] App-->>Operator: "200 {experiments: [...]}" end%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram actor Operator participant App as FastAPI (app.py) participant Sidecar as EvaluationSidecar (server.py) participant DB as ExperimentDatabase Operator->>App: GET /experiments App->>App: check_admin(token) alt invalid token App-->>Operator: 403 Forbidden else valid token App->>Sidecar: list_experiments() Sidecar->>DB: "get_experiments_df(fill_score=default_minimum_score)" DB-->>Sidecar: DataFrame (NaN scores filled) Sidecar->>Sidecar: iterate rows, shape dicts Sidecar-->>App: list[dict] App-->>Operator: "200 {experiments: [...]}" endPrompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat(harbor): admin /experiments endpoin..." | Re-trigger Greptile