Skip to content

feat(harbor): admin /experiments endpoint for mid-run observability#14

Open
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecar-fixesfrom
harbor-2-sidecar-observability
Open

feat(harbor): admin /experiments endpoint for mid-run observability#14
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecar-fixesfrom
harbor-2-sidecar-observability

Conversation

@shehabyasser-scale

@shehabyasser-scale shehabyasser-scale commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Stacked on #8. Operational gap found running live GAIA optimization trials: mid-run, the only visibility into what the optimizer had measured was mining its transcript or exec-ing into the sidecar container.

Adds token-gated GET /experiments returning every recorded experiment (commit, dataset/split, recorded score, created_at). An operator or the outer harness can now watch an optimization trajectory as it happens, e.g.:

TOKEN=$(cat /state/token/admin.token)
curl -s -H "Authorization: Bearer $TOKEN" localhost:8000/experiments

Admin-only on purpose (reuses the finalize bearer gate): recorded scores on non_viewable splits must not reach the agent. Test covers the 403 paths and the row shape. 11 pass.

🤖 Generated with Claude Code

Greptile Summary

Adds a token-gated GET /experiments admin endpoint that returns every recorded experiment (commit, dataset/split, mean score, timestamp) for mid-run observability, reusing the existing bearer-token gate from /finalize.

  • app.py: New GET /experiments route behind check_admin; clean auth delegation to sidecar.list_experiments().
  • server.py: New list_experiments() iterates the engine's experiment DataFrame and shapes rows for the HTTP response; silently applies a NaN-fill that replaces error-experiment scores with the minimum score floor.
  • test_harbor_app.py: New TestExperimentsEndpoint covers the 403 paths and basic row shape; happy-path mocking avoids the fill-score path.

Confidence Score: 3/5

The auth gate is correctly applied and the endpoint is properly admin-only, but the score data returned for errored experiments is misleading in a way that directly undermines the stated purpose of the feature.

The list_experiments() method calls get_experiments_df() with its default fill — any experiment whose samples errored will have its mean_score synthetically floored to default_minimum_score and returned as a real-looking number. An operator using this endpoint to watch an optimization trajectory mid-run would see a fabricated score for failed evaluations, which is the opposite of the observability guarantee the feature is meant to provide.

vero/src/vero/harbor/server.py — the list_experiments() implementation needs the fill_score=None fix and ideally a status field before this endpoint can be trusted for production use.

Important Files Changed

Filename Overview
vero/src/vero/harbor/server.py Adds list_experiments() for admin observability; silently fills errored experiment scores with default_minimum_score instead of None, and serializes missing timestamps as the string "None" instead of JSON null.
vero/src/vero/harbor/app.py Adds GET /experiments admin endpoint, correctly reusing the same check_admin bearer-token gate as /finalize; clean and straightforward.
vero/tests/test_harbor_app.py Adds TestExperimentsEndpoint covering auth rejection (no token, wrong token) and basic row shape; does not cover the db is None / df.empty paths or errored-experiment score representation.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    actor Operator
    participant App as FastAPI (app.py)
    participant Sidecar as EvaluationSidecar (server.py)
    participant DB as ExperimentDatabase

    Operator->>App: GET /experiments
    App->>App: check_admin(token)
    alt invalid token
        App-->>Operator: 403 Forbidden
    else valid token
        App->>Sidecar: list_experiments()
        Sidecar->>DB: "get_experiments_df(fill_score=default_minimum_score)"
        DB-->>Sidecar: DataFrame (NaN scores filled)
        Sidecar->>Sidecar: iterate rows, shape dicts
        Sidecar-->>App: list[dict]
        App-->>Operator: "200 {experiments: [...]}"
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    actor Operator
    participant App as FastAPI (app.py)
    participant Sidecar as EvaluationSidecar (server.py)
    participant DB as ExperimentDatabase

    Operator->>App: GET /experiments
    App->>App: check_admin(token)
    alt invalid token
        App-->>Operator: 403 Forbidden
    else valid token
        App->>Sidecar: list_experiments()
        Sidecar->>DB: "get_experiments_df(fill_score=default_minimum_score)"
        DB-->>Sidecar: DataFrame (NaN scores filled)
        Sidecar->>Sidecar: iterate rows, shape dicts
        Sidecar-->>App: list[dict]
        App-->>Operator: "200 {experiments: [...]}"
    end
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
vero/src/vero/harbor/server.py:115
**`mean_score` silently replaced with `default_minimum_score` for errored experiments**

`get_experiments_df()` is called with its default `fill_score=default_minimum_score`, which propagates as `nan_score_fill_value` into `as_pandas_series()``sample_results_statistics()`. Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before `mean_score` is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass `fill_score=None` so that errored experiments surface `mean_score: null` and add `"status"` from `r.get("status")` so operators can distinguish genuine scores from failures.

### Issue 2 of 2
vero/src/vero/harbor/server.py:126
**`"None"` / `"NaT"` string instead of JSON `null` for missing timestamps**

`str(r.get("candidate_created_at"))` produces the literal strings `"None"` or `"NaT"` when the Series value is absent or a pandas `NaT`, rather than a JSON `null`. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a `None` guard so the JSON field is either a proper timestamp string or `null`.

```suggestion
                    "created_at": (
                        v.isoformat()
                        if (v := r.get("candidate_created_at")) is not None
                        and str(v) not in ("NaT", "None")
                        else None
                    ),
```

Reviews (1): Last reviewed commit: "feat(harbor): admin /experiments endpoin..." | Re-trigger Greptile

Greptile also left 2 inline comments on this PR.

Operating a live optimization run today, the only way to see what the
optimizer had measured mid-flight was mining its transcript from outside
or exec-ing into the sidecar. Add a token-gated GET /experiments that
returns every recorded experiment (commit, dataset/split, recorded
score, created_at) so an operator or outer harness can watch the
trajectory as it happens.

Admin-only on purpose: recorded scores on non_viewable splits must not
reach the agent; the endpoint reuses the finalize bearer-token gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"""
if self.engine.db is None:
return []
df = self.engine.db.get_experiments_df()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 mean_score silently replaced with default_minimum_score for errored experiments

get_experiments_df() is called with its default fill_score=default_minimum_score, which propagates as nan_score_fill_value into as_pandas_series()sample_results_statistics(). Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before mean_score is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass fill_score=None so that errored experiments surface mean_score: null and add "status" from r.get("status") so operators can distinguish genuine scores from failures.

Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 115

Comment:
**`mean_score` silently replaced with `default_minimum_score` for errored experiments**

`get_experiments_df()` is called with its default `fill_score=default_minimum_score`, which propagates as `nan_score_fill_value` into `as_pandas_series()``sample_results_statistics()`. Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before `mean_score` is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass `fill_score=None` so that errored experiments surface `mean_score: null` and add `"status"` from `r.get("status")` so operators can distinguish genuine scores from failures.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

"dataset_id": r.get("dataset_subset_dataset_id"),
"split": r.get("dataset_subset_split"),
"mean_score": r.get("mean_score"),
"created_at": str(r.get("candidate_created_at")),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "None" / "NaT" string instead of JSON null for missing timestamps

str(r.get("candidate_created_at")) produces the literal strings "None" or "NaT" when the Series value is absent or a pandas NaT, rather than a JSON null. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a None guard so the JSON field is either a proper timestamp string or null.

Suggested change
"created_at": str(r.get("candidate_created_at")),
"created_at": (
v.isoformat()
if (v := r.get("candidate_created_at")) is not None
and str(v) not in ("NaT", "None")
else None
),
Prompt To Fix With AI
This is a comment left during a code review.
Path: vero/src/vero/harbor/server.py
Line: 126

Comment:
**`"None"` / `"NaT"` string instead of JSON `null` for missing timestamps**

`str(r.get("candidate_created_at"))` produces the literal strings `"None"` or `"NaT"` when the Series value is absent or a pandas `NaT`, rather than a JSON `null`. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a `None` guard so the JSON field is either a proper timestamp string or `null`.

```suggestion
                    "created_at": (
                        v.isoformat()
                        if (v := r.get("candidate_created_at")) is not None
                        and str(v) not in ("NaT", "None")
                        else None
                    ),
```

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant