Skip to content

Add slm.evaluate() — a standalone eval harness (SDK + CLI)#1

Open
shreyas-lyzr wants to merge 3 commits into
mainfrom
feat/evaluate-harness
Open

Add slm.evaluate() — a standalone eval harness (SDK + CLI)#1
shreyas-lyzr wants to merge 3 commits into
mainfrom
feat/evaluate-harness

Conversation

@shreyas-lyzr

Copy link
Copy Markdown

What & why

ShadowLM ships the full capture → judge → train → own loop, but there was no way to score a model on a task. The only "eval" today is train-time eval_loss (next-token loss on held-out data) — not task quality. The README roadmap calls task-level evals an unbuilt "eval gate," and the product thesis ("run the shadow until it does the job as well as the frontier, then switch") is unprovable without a quality score.

This adds the smallest meaningful slice of that: slm.evaluate(model, dataset, metric=...) and a shadowlm eval CLI command. Purely additive — no behavior changes.

Changes

  • shadowlm/eval.py (new) — evaluate(model, data, *, metric="contains", judge=None, system=None, sample=None, ...) returning an EvalResult (aggregate score, per-row scores/examples, plus .sparkline(), .worst(k), .to_dict()).
    • Metrics: contains (default — expected answer appears in output), exact (normalized equality), judge (LLM-as-judge), or a custom (output, expected, prompt) -> float callable.
    • Reuses existing scorers rather than reimplementing them: apo._contains_score, apo._judge_one, and apo._cols column detection. Handles chat / instruction / preference rows and dataset-path inputs.
  • shadowlm/__init__.py — export evaluate, EvalResult.
  • shadowlm/cli.pyshadowlm eval <model> <dataset> [--metric contains|exact|judge] [--judge <id>] [--sample N] ...], prints a headline score + a worst-examples table.
  • examples/evaluate.py (new) — end-to-end demo.

Usage

res = slm.evaluate(model, "qa.jsonl")            # contains-match (default)
res = slm.evaluate(model, ds, metric="exact")     # exact-match
res = slm.evaluate(model, ds, judge=judge)        # LLM-as-judge
res = slm.evaluate(model, ds, metric=my_score_fn) # custom scorer
print(res.score, res.sparkline())
shadowlm eval mlx-community/Qwen2.5-0.5B-Instruct-4bit data.jsonl --metric contains

Verification

  • python -m compileall shadowlm clean; import shadowlm as slm; slm.evaluate resolves.
  • No-GPU stub tests pass for all metrics (contains/exact/custom/judge), chat + preference formats, dataset-path input, the judge-implies-metric default, and the error cases (judge-without-model, missing-prompt-column).
  • CLI shadowlm eval --help registers and renders; eval appears in the top-level Models panel.
  • Not run in this environment (needs a backend + model weights): the live examples/evaluate.py mlx run and a real CLI run against a downloaded model. The logic they exercise is covered by the stub tests.

Out of scope

No /v1/evaluate server endpoint, no studio UI, no train-time auto-eval hook, no token-F1 metric — those can follow once this SDK surface lands.

ShadowLM had no way to score a model on a task — only train-time eval_loss
(next-token loss), not task quality. This adds the smallest meaningful slice
of an "eval gate": evaluate a loaded model over a dataset and get an aggregate
score plus a per-row breakdown.

- shadowlm/eval.py: evaluate(model, data, metric=...) + EvalResult. Metrics:
  contains (default), exact, judge (LLM-as-judge), or a custom
  (output, expected, prompt) -> float callable. Reuses APO's existing scorers
  (_contains_score, _judge_one) and column detection; handles chat /
  instruction / preference rows and dataset-path inputs.
- shadowlm/__init__.py: export evaluate, EvalResult.
- shadowlm/cli.py: `shadowlm eval <model> <dataset>` command.
- examples/evaluate.py: end-to-end demo.

@patel-lyzr patel-lyzr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — slm.evaluate() eval harness

@shreyas-lyzr — reviewed the implementation end to end. The core is correct: scorer arg mapping (out, expected, prompt), [0,1] clamping, the judge-implies-metric flip, format dispatch via _row_io, and the APO scorer reuse all hold up. Is it needed? Yes — there's no task-quality score today (only train-time eval_loss), and the "switch when the shadow does the job" thesis needs one. It also reuses the APO scorers rather than reinventing them. Fixes below before merge.

Must fix

  1. The demo crashes. examples/evaluate.py loads examples/sample_dataset.jsonl, which doesn't exist in the repo — python examples/evaluate.py (the command in its own docstring) dies with FileNotFoundError. Point it at examples/shadowlm_qa.jsonl or examples/data/chat.jsonl.
  2. CLI eval --metric judge with no --judge throws a raw traceback. metric passes the (contains|exact|judge) check, judge_model is None, then _resolve_scorer raises a bare ValueErrorafter _resolve_target has already loaded/downloaded the model. So the user waits for a full model load, then gets a stack trace instead of the typer.BadParameter the rest of the command uses. Validate judge-presence up front, next to the metric check.

Structural — the judge metric duplicates-and-downgrades our RL judge

metric="judge" routes through apo._judge_one, which is the thin scorer: a bare "score 0–1" prompt and a single regex grab ([01](?:\.\d+)?). Our RL judge in rl.py is the one we actually trust — it has an explicit DEFAULT_RUBRIC and tolerant JSON parsing (_parse_scores).

To be precise about what transfers (I checked): judge_group's flat-score → best/worst ranking fallback is relative within a group and RL-only — it can't score eval rows (one trajectory per row hits len(set(scores))==1, and _rank_scores raises on best == worst). So don't route eval through judge_group. What is worth sharing is the rubric + tolerant single-number parse: factor that into one _judge_one-style helper that both apo and eval call, instead of eval inheriting the weaker regex scorer. On the 0.5B–8B judges we target, the parse robustness is the part that matters.

Worth fixing

  1. Multi-turn chat rows lose context. _row_io takes the first user turn as the prompt and the last assistant turn as the reference — so a multi-turn row is scored as "answer the opening question" against a final-turn answer. Fine for single-turn QA, wrong for real conversations.
  2. --sample 0 evaluates the whole datasetif sample: treats 0 as falsy. Use if sample is not None:.
  3. Tests. The description lists passing no-GPU stub tests for every metric and both error paths, but no tests/test_eval.py is in the diff. Please commit it — the two raise ValueError paths have no coverage otherwise.

One YAGNI nit: _exact_score re-implements the same " ".join(str(x).lower().split()) normalization as apo._contains_score — pull it into one helper. Otherwise the surface matches models.py/TrainingRun conventions. Fix 1–3 and this is good to merge.

@patel-lyzr patel-lyzr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded — folded into the consolidated review above (the judge-reuse note is corrected there: judge_group is relative/ranking and RL-only, so the fix is to share the rubric + parse, not route eval through it).

…ests

- apo._judge_one: add a rubric + tolerant number parse (_parse_judge_score
  handles "0.7", "7/10", "8"→0.8); evaluate routes through this shared scorer,
  so APO and eval agree on a good answer. Not routing eval through judge_group
  (it's group-relative / RL-only and can't score a lone row).
- eval._row_io: feed the full conversation prefix for multi-turn chat rows and
  compare to the final assistant turn (was: first-user vs last-assistant).
- cli eval: reject --metric judge with no --judge up front (typer.BadParameter)
  instead of a raw ValueError after the model has loaded.
- eval: `if sample is not None` so --sample 0 doesn't mean "evaluate everything".
- Pull shared whitespace/lowercase normalization into apo._norm (used by both
  _contains_score and _exact_score).
- tests/test_eval.py: no-GPU stub coverage for every metric, both error paths,
  multi-turn context, preference rows, sample=0, and the judge parser.
@shreyas-lyzr

Copy link
Copy Markdown
Author

Thanks @patel-lyzr — addressed in 5ec2028. Point by point:

Must fix

  1. Demo crash — couldn't reproduce: examples/sample_dataset.jsonl is tracked (it's the file examples/quickstart.py already loads), and python examples/evaluate.py's dataset resolves. The two suggested replacements (examples/shadowlm_qa.jsonl, examples/data/chat.jsonl) aren't in the repo — pointing at them would introduce the FileNotFoundError. Left the demo on sample_dataset.jsonl. Shout if you're seeing it missing on a clean checkout.
  2. CLI judge w/o --judge — fixed. The command now raises typer.BadParameter up front (next to the metric check), before _resolve_target loads anything. Verified via CliRunner: exit 2, no model load.

Structural — judge scorer
3. Agreed on not routing eval through judge_group (group-relative ranking, raises on a lone row). Strengthened the shared single-answer scorer instead: apo._judge_one now carries an explicit rubric and a tolerant parse, _parse_judge_score, handling 0.7, 7/10, and integer 80.8 (was a bare [01](?:\.\d+)? grab that read 7/10 as 1.0). evaluate already routes through _judge_one, so APO and eval now share one scorer.

Worth fixing
4. Multi-turn context — fixed. _row_io now feeds the full conversation prefix and scores against the final assistant turn; single-turn QA is unchanged. Covered by test_chat_multiturn_keeps_context (asserts the model receives all 3 prior turns).
5. --sample 0 — fixed: if sample is not None. --sample 0 now yields no rows (a clear error) rather than the whole set. Covered by test_sample_zero_is_not_whole_dataset.
6. Tests — added tests/test_eval.py: no-GPU stub coverage for all metrics, both ValueError paths, multi-turn, preference rows, sample=0, and the judge parser. python tests/test_eval.py → 11 passed.

YAGNI — pulled the shared " ".join(str(x).lower().split()) normalization into apo._norm, used by both _contains_score and _exact_score.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants