Add slm.evaluate() — a standalone eval harness (SDK + CLI) by shreyas-lyzr · Pull Request #1 · open-gitagent/shadowLM

shreyas-lyzr · 2026-06-18T17:36:43Z

What & why

ShadowLM ships the full capture → judge → train → own loop, but there was no way to score a model on a task. The only "eval" today is train-time eval_loss (next-token loss on held-out data) — not task quality. The README roadmap calls task-level evals an unbuilt "eval gate," and the product thesis ("run the shadow until it does the job as well as the frontier, then switch") is unprovable without a quality score.

This adds the smallest meaningful slice of that: slm.evaluate(model, dataset, metric=...) and a shadowlm eval CLI command. Purely additive — no behavior changes.

Changes

shadowlm/eval.py (new) — evaluate(model, data, *, metric="contains", judge=None, system=None, sample=None, ...) returning an EvalResult (aggregate score, per-row scores/examples, plus .sparkline(), .worst(k), .to_dict()).
- Metrics: contains (default — expected answer appears in output), exact (normalized equality), judge (LLM-as-judge), or a custom (output, expected, prompt) -> float callable.
- Reuses existing scorers rather than reimplementing them: apo._contains_score, apo._judge_one, and apo._cols column detection. Handles chat / instruction / preference rows and dataset-path inputs.
shadowlm/__init__.py — export evaluate, EvalResult.
shadowlm/cli.py — shadowlm eval <model> <dataset> [--metric contains|exact|judge] [--judge <id>] [--sample N] ...], prints a headline score + a worst-examples table.
examples/evaluate.py (new) — end-to-end demo.

Usage

res = slm.evaluate(model, "qa.jsonl")            # contains-match (default)
res = slm.evaluate(model, ds, metric="exact")     # exact-match
res = slm.evaluate(model, ds, judge=judge)        # LLM-as-judge
res = slm.evaluate(model, ds, metric=my_score_fn) # custom scorer
print(res.score, res.sparkline())

shadowlm eval mlx-community/Qwen2.5-0.5B-Instruct-4bit data.jsonl --metric contains

Verification

python -m compileall shadowlm clean; import shadowlm as slm; slm.evaluate resolves.
No-GPU stub tests pass for all metrics (contains/exact/custom/judge), chat + preference formats, dataset-path input, the judge-implies-metric default, and the error cases (judge-without-model, missing-prompt-column).
CLI shadowlm eval --help registers and renders; eval appears in the top-level Models panel.
Not run in this environment (needs a backend + model weights): the live examples/evaluate.py mlx run and a real CLI run against a downloaded model. The logic they exercise is covered by the stub tests.

Out of scope

No /v1/evaluate server endpoint, no studio UI, no train-time auto-eval hook, no token-F1 metric — those can follow once this SDK surface lands.

ShadowLM had no way to score a model on a task — only train-time eval_loss (next-token loss), not task quality. This adds the smallest meaningful slice of an "eval gate": evaluate a loaded model over a dataset and get an aggregate score plus a per-row breakdown. - shadowlm/eval.py: evaluate(model, data, metric=...) + EvalResult. Metrics: contains (default), exact, judge (LLM-as-judge), or a custom (output, expected, prompt) -> float callable. Reuses APO's existing scorers (_contains_score, _judge_one) and column detection; handles chat / instruction / preference rows and dataset-path inputs. - shadowlm/__init__.py: export evaluate, EvalResult. - shadowlm/cli.py: `shadowlm eval <model> <dataset>` command. - examples/evaluate.py: end-to-end demo.

patel-lyzr

Review — `slm.evaluate()` eval harness

@shreyas-lyzr — reviewed the implementation end to end. The core is correct: scorer arg mapping (out, expected, prompt), [0,1] clamping, the judge-implies-metric flip, format dispatch via _row_io, and the APO scorer reuse all hold up. Is it needed? Yes — there's no task-quality score today (only train-time eval_loss), and the "switch when the shadow does the job" thesis needs one. It also reuses the APO scorers rather than reinventing them. Fixes below before merge.

Must fix

The demo crashes. examples/evaluate.py loads examples/sample_dataset.jsonl, which doesn't exist in the repo — python examples/evaluate.py (the command in its own docstring) dies with FileNotFoundError. Point it at examples/shadowlm_qa.jsonl or examples/data/chat.jsonl.
CLI eval --metric judge with no --judge throws a raw traceback. metric passes the (contains|exact|judge) check, judge_model is None, then _resolve_scorer raises a bare ValueError — after _resolve_target has already loaded/downloaded the model. So the user waits for a full model load, then gets a stack trace instead of the typer.BadParameter the rest of the command uses. Validate judge-presence up front, next to the metric check.

Structural — the judge metric duplicates-and-downgrades our RL judge

metric="judge" routes through apo._judge_one, which is the thin scorer: a bare "score 0–1" prompt and a single regex grab ([01](?:\.\d+)?). Our RL judge in rl.py is the one we actually trust — it has an explicit DEFAULT_RUBRIC and tolerant JSON parsing (_parse_scores).

To be precise about what transfers (I checked): judge_group's flat-score → best/worst ranking fallback is relative within a group and RL-only — it can't score eval rows (one trajectory per row hits len(set(scores))==1, and _rank_scores raises on best == worst). So don't route eval through judge_group. What is worth sharing is the rubric + tolerant single-number parse: factor that into one _judge_one-style helper that both apo and eval call, instead of eval inheriting the weaker regex scorer. On the 0.5B–8B judges we target, the parse robustness is the part that matters.

Worth fixing

Multi-turn chat rows lose context. _row_io takes the first user turn as the prompt and the last assistant turn as the reference — so a multi-turn row is scored as "answer the opening question" against a final-turn answer. Fine for single-turn QA, wrong for real conversations.
--sample 0 evaluates the whole dataset — if sample: treats 0 as falsy. Use if sample is not None:.
Tests. The description lists passing no-GPU stub tests for every metric and both error paths, but no tests/test_eval.py is in the diff. Please commit it — the two raise ValueError paths have no coverage otherwise.

One YAGNI nit: _exact_score re-implements the same " ".join(str(x).lower().split()) normalization as apo._contains_score — pull it into one helper. Otherwise the surface matches models.py/TrainingRun conventions. Fix 1–3 and this is good to merge.

patel-lyzr

Superseded — folded into the consolidated review above (the judge-reuse note is corrected there: judge_group is relative/ranking and RL-only, so the fix is to share the rubric + parse, not route eval through it).

…ests - apo._judge_one: add a rubric + tolerant number parse (_parse_judge_score handles "0.7", "7/10", "8"→0.8); evaluate routes through this shared scorer, so APO and eval agree on a good answer. Not routing eval through judge_group (it's group-relative / RL-only and can't score a lone row). - eval._row_io: feed the full conversation prefix for multi-turn chat rows and compare to the final assistant turn (was: first-user vs last-assistant). - cli eval: reject --metric judge with no --judge up front (typer.BadParameter) instead of a raw ValueError after the model has loaded. - eval: `if sample is not None` so --sample 0 doesn't mean "evaluate everything". - Pull shared whitespace/lowercase normalization into apo._norm (used by both _contains_score and _exact_score). - tests/test_eval.py: no-GPU stub coverage for every metric, both error paths, multi-turn context, preference rows, sample=0, and the judge parser.

shreyas-lyzr · 2026-06-20T13:20:41Z

Thanks @patel-lyzr — addressed in 5ec2028. Point by point:

Must fix

Demo crash — couldn't reproduce: examples/sample_dataset.jsonl is tracked (it's the file examples/quickstart.py already loads), and python examples/evaluate.py's dataset resolves. The two suggested replacements (examples/shadowlm_qa.jsonl, examples/data/chat.jsonl) aren't in the repo — pointing at them would introduce the FileNotFoundError. Left the demo on sample_dataset.jsonl. Shout if you're seeing it missing on a clean checkout.
CLI judge w/o --judge — fixed. The command now raises typer.BadParameter up front (next to the metric check), before _resolve_target loads anything. Verified via CliRunner: exit 2, no model load.

Structural — judge scorer
3. Agreed on not routing eval through judge_group (group-relative ranking, raises on a lone row). Strengthened the shared single-answer scorer instead: apo._judge_one now carries an explicit rubric and a tolerant parse, _parse_judge_score, handling 0.7, 7/10, and integer 8→0.8 (was a bare [01](?:\.\d+)? grab that read 7/10 as 1.0). evaluate already routes through _judge_one, so APO and eval now share one scorer.

Worth fixing
4. Multi-turn context — fixed. _row_io now feeds the full conversation prefix and scores against the final assistant turn; single-turn QA is unchanged. Covered by test_chat_multiturn_keeps_context (asserts the model receives all 3 prior turns).
5. --sample 0 — fixed: if sample is not None. --sample 0 now yields no rows (a clear error) rather than the whole set. Covered by test_sample_zero_is_not_whole_dataset.
6. Tests — added tests/test_eval.py: no-GPU stub coverage for all metrics, both ValueError paths, multi-turn, preference rows, sample=0, and the judge parser. python tests/test_eval.py → 11 passed.

YAGNI — pulled the shared " ".join(str(x).lower().split()) normalization into apo._norm, used by both _contains_score and _exact_score.

shreyas-lyzr added 2 commits June 18, 2026 13:36

README: add Shreyas Kapale as second maintainer

1b5544c

patel-lyzr requested changes Jun 19, 2026

View reviewed changes

patel-lyzr reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add slm.evaluate() — a standalone eval harness (SDK + CLI)#1

Add slm.evaluate() — a standalone eval harness (SDK + CLI)#1
shreyas-lyzr wants to merge 3 commits into
mainfrom
feat/evaluate-harness

shreyas-lyzr commented Jun 18, 2026

Uh oh!

patel-lyzr left a comment •

edited

Loading

Uh oh!

patel-lyzr left a comment •

edited

Loading

Uh oh!

shreyas-lyzr commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shreyas-lyzr commented Jun 18, 2026

What & why

Changes

Usage

Verification

Out of scope

Uh oh!

patel-lyzr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review — slm.evaluate() eval harness

Must fix

Structural — the judge metric duplicates-and-downgrades our RL judge

Worth fixing

Uh oh!

patel-lyzr left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shreyas-lyzr commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

patel-lyzr left a comment •

edited

Loading

Review — `slm.evaluate()` eval harness

patel-lyzr left a comment •

edited

Loading