feat(harbor): admin /experiments endpoint for mid-run observability by shehabyasser-scale · Pull Request #14 · scaleapi/vero

shehabyasser-scale · 2026-07-03T03:05:43Z

Stacked on #8. Operational gap found running live GAIA optimization trials: mid-run, the only visibility into what the optimizer had measured was mining its transcript or exec-ing into the sidecar container.

Adds token-gated GET /experiments returning every recorded experiment (commit, dataset/split, recorded score, created_at). An operator or the outer harness can now watch an optimization trajectory as it happens, e.g.:

TOKEN=$(cat /state/token/admin.token)
curl -s -H "Authorization: Bearer $TOKEN" localhost:8000/experiments

Admin-only on purpose (reuses the finalize bearer gate): recorded scores on non_viewable splits must not reach the agent. Test covers the 403 paths and the row shape. 11 pass.

🤖 Generated with Claude Code

Greptile Summary

Adds a token-gated GET /experiments admin endpoint that returns every recorded experiment (commit, dataset/split, mean score, timestamp) for mid-run observability, reusing the existing bearer-token gate from /finalize.

app.py: New GET /experiments route behind check_admin; clean auth delegation to sidecar.list_experiments().
server.py: New list_experiments() iterates the engine's experiment DataFrame and shapes rows for the HTTP response; silently applies a NaN-fill that replaces error-experiment scores with the minimum score floor.
test_harbor_app.py: New TestExperimentsEndpoint covers the 403 paths and basic row shape; happy-path mocking avoids the fill-score path.

Confidence Score: 3/5

The auth gate is correctly applied and the endpoint is properly admin-only, but the score data returned for errored experiments is misleading in a way that directly undermines the stated purpose of the feature.

The list_experiments() method calls get_experiments_df() with its default fill — any experiment whose samples errored will have its mean_score synthetically floored to default_minimum_score and returned as a real-looking number. An operator using this endpoint to watch an optimization trajectory mid-run would see a fabricated score for failed evaluations, which is the opposite of the observability guarantee the feature is meant to provide.

vero/src/vero/harbor/server.py — the list_experiments() implementation needs the fill_score=None fix and ideally a status field before this endpoint can be trusted for production use.

Important Files Changed

Filename	Overview
vero/src/vero/harbor/server.py	Adds `list_experiments()` for admin observability; silently fills errored experiment scores with `default_minimum_score` instead of `None`, and serializes missing timestamps as the string "None" instead of JSON null.
vero/src/vero/harbor/app.py	Adds `GET /experiments` admin endpoint, correctly reusing the same `check_admin` bearer-token gate as `/finalize`; clean and straightforward.
vero/tests/test_harbor_app.py	Adds `TestExperimentsEndpoint` covering auth rejection (no token, wrong token) and basic row shape; does not cover the `db is None` / `df.empty` paths or errored-experiment score representation.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    actor Operator
    participant App as FastAPI (app.py)
    participant Sidecar as EvaluationSidecar (server.py)
    participant DB as ExperimentDatabase

    Operator->>App: GET /experiments
    App->>App: check_admin(token)
    alt invalid token
        App-->>Operator: 403 Forbidden
    else valid token
        App->>Sidecar: list_experiments()
        Sidecar->>DB: "get_experiments_df(fill_score=default_minimum_score)"
        DB-->>Sidecar: DataFrame (NaN scores filled)
        Sidecar->>Sidecar: iterate rows, shape dicts
        Sidecar-->>App: list[dict]
        App-->>Operator: "200 {experiments: [...]}"
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    actor Operator
    participant App as FastAPI (app.py)
    participant Sidecar as EvaluationSidecar (server.py)
    participant DB as ExperimentDatabase

    Operator->>App: GET /experiments
    App->>App: check_admin(token)
    alt invalid token
        App-->>Operator: 403 Forbidden
    else valid token
        App->>Sidecar: list_experiments()
        Sidecar->>DB: "get_experiments_df(fill_score=default_minimum_score)"
        DB-->>Sidecar: DataFrame (NaN scores filled)
        Sidecar->>Sidecar: iterate rows, shape dicts
        Sidecar-->>App: list[dict]
        App-->>Operator: "200 {experiments: [...]}"
    end

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
vero/src/vero/harbor/server.py:115
**`mean_score` silently replaced with `default_minimum_score` for errored experiments**

`get_experiments_df()` is called with its default `fill_score=default_minimum_score`, which propagates as `nan_score_fill_value` into `as_pandas_series()` → `sample_results_statistics()`. Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before `mean_score` is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass `fill_score=None` so that errored experiments surface `mean_score: null` and add `"status"` from `r.get("status")` so operators can distinguish genuine scores from failures.

### Issue 2 of 2
vero/src/vero/harbor/server.py:126
**`"None"` / `"NaT"` string instead of JSON `null` for missing timestamps**

`str(r.get("candidate_created_at"))` produces the literal strings `"None"` or `"NaT"` when the Series value is absent or a pandas `NaT`, rather than a JSON `null`. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a `None` guard so the JSON field is either a proper timestamp string or `null`.

```suggestion
                    "created_at": (
                        v.isoformat()
                        if (v := r.get("candidate_created_at")) is not None
                        and str(v) not in ("NaT", "None")
                        else None
                    ),
```

_{Reviews (1): Last reviewed commit: "feat(harbor): admin /experiments endpoin..." | Re-trigger Greptile}

Greptile also left 2 inline comments on this PR.

Operating a live optimization run today, the only way to see what the optimizer had measured mid-flight was mining its transcript from outside or exec-ing into the sidecar. Add a token-gated GET /experiments that returns every recorded experiment (commit, dataset/split, recorded score, created_at) so an operator or outer harness can watch the trajectory as it happens. Admin-only on purpose: recorded scores on non_viewable splits must not reach the agent; the endpoint reuses the finalize bearer-token gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-07-03T03:08:32Z

+        """
+        if self.engine.db is None:
+            return []
+        df = self.engine.db.get_experiments_df()


mean_score silently replaced with default_minimum_score for errored experiments

get_experiments_df() is called with its default fill_score=default_minimum_score, which propagates as nan_score_fill_value into as_pandas_series() → sample_results_statistics(). Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before mean_score is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass fill_score=None so that errored experiments surface mean_score: null and add "status" from r.get("status") so operators can distinguish genuine scores from failures.

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/server.py Line: 115 Comment: **`mean_score` silently replaced with `default_minimum_score` for errored experiments** `get_experiments_df()` is called with its default `fill_score=default_minimum_score`, which propagates as `nan_score_fill_value` into `as_pandas_series()` → `sample_results_statistics()`. Any experiment whose samples errored will have its per-sample scores replaced with the minimum score floor before `mean_score` is averaged — so the response returns a synthetic floor value that looks like a real measured score. An operator watching the optimization trajectory mid-run would incorrectly treat a failed evaluation as having scored at the minimum, skewing their assessment. Pass `fill_score=None` so that errored experiments surface `mean_score: null` and add `"status"` from `r.get("status")` so operators can distinguish genuine scores from failures. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-07-03T03:08:33Z

+                    "dataset_id": r.get("dataset_subset_dataset_id"),
+                    "split": r.get("dataset_subset_split"),
+                    "mean_score": r.get("mean_score"),
+                    "created_at": str(r.get("candidate_created_at")),


"None" / "NaT" string instead of JSON null for missing timestamps

str(r.get("candidate_created_at")) produces the literal strings "None" or "NaT" when the Series value is absent or a pandas NaT, rather than a JSON null. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a None guard so the JSON field is either a proper timestamp string or null.

Suggested change

"created_at": str(r.get("candidate_created_at")),

"created_at": (

v.isoformat()

if (v := r.get("candidate_created_at")) is not None

and str(v) not in ("NaT", "None")

else None

),

Prompt To Fix With AI

This is a comment left during a code review. Path: vero/src/vero/harbor/server.py Line: 126 Comment: **`"None"` / `"NaT"` string instead of JSON `null` for missing timestamps** `str(r.get("candidate_created_at"))` produces the literal strings `"None"` or `"NaT"` when the Series value is absent or a pandas `NaT`, rather than a JSON `null`. These string sentinels are hard for consumers to detect and handle programmatically. Use an explicit ISO-format conversion with a `None` guard so the JSON field is either a proper timestamp string or `null`. ```suggestion "created_at": ( v.isoformat() if (v := r.get("candidate_created_at")) is not None and str(v) not in ("NaT", "None") else None ), ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(harbor): admin /experiments endpoint for mid-run observability#14

feat(harbor): admin /experiments endpoint for mid-run observability#14
shehabyasser-scale wants to merge 1 commit into
harbor-2-sidecar-fixesfrom
harbor-2-sidecar-observability

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot Jul 3, 2026

Uh oh!

greptile-apps Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shehabyasser-scale commented Jul 3, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shehabyasser-scale commented Jul 3, 2026 •

edited by greptile-apps Bot

Loading