Add slm.evaluate() — a standalone eval harness (SDK + CLI)#1
Add slm.evaluate() — a standalone eval harness (SDK + CLI)#1shreyas-lyzr wants to merge 3 commits into
Conversation
ShadowLM had no way to score a model on a task — only train-time eval_loss (next-token loss), not task quality. This adds the smallest meaningful slice of an "eval gate": evaluate a loaded model over a dataset and get an aggregate score plus a per-row breakdown. - shadowlm/eval.py: evaluate(model, data, metric=...) + EvalResult. Metrics: contains (default), exact, judge (LLM-as-judge), or a custom (output, expected, prompt) -> float callable. Reuses APO's existing scorers (_contains_score, _judge_one) and column detection; handles chat / instruction / preference rows and dataset-path inputs. - shadowlm/__init__.py: export evaluate, EvalResult. - shadowlm/cli.py: `shadowlm eval <model> <dataset>` command. - examples/evaluate.py: end-to-end demo.
There was a problem hiding this comment.
Review — slm.evaluate() eval harness
@shreyas-lyzr — reviewed the implementation end to end. The core is correct: scorer arg mapping (out, expected, prompt), [0,1] clamping, the judge-implies-metric flip, format dispatch via _row_io, and the APO scorer reuse all hold up. Is it needed? Yes — there's no task-quality score today (only train-time eval_loss), and the "switch when the shadow does the job" thesis needs one. It also reuses the APO scorers rather than reinventing them. Fixes below before merge.
Must fix
- The demo crashes.
examples/evaluate.pyloadsexamples/sample_dataset.jsonl, which doesn't exist in the repo —python examples/evaluate.py(the command in its own docstring) dies withFileNotFoundError. Point it atexamples/shadowlm_qa.jsonlorexamples/data/chat.jsonl. - CLI
eval --metric judgewith no--judgethrows a raw traceback.metricpasses the(contains|exact|judge)check,judge_modelisNone, then_resolve_scorerraises a bareValueError— after_resolve_targethas already loaded/downloaded the model. So the user waits for a full model load, then gets a stack trace instead of thetyper.BadParameterthe rest of the command uses. Validate judge-presence up front, next to the metric check.
Structural — the judge metric duplicates-and-downgrades our RL judge
metric="judge" routes through apo._judge_one, which is the thin scorer: a bare "score 0–1" prompt and a single regex grab ([01](?:\.\d+)?). Our RL judge in rl.py is the one we actually trust — it has an explicit DEFAULT_RUBRIC and tolerant JSON parsing (_parse_scores).
To be precise about what transfers (I checked): judge_group's flat-score → best/worst ranking fallback is relative within a group and RL-only — it can't score eval rows (one trajectory per row hits len(set(scores))==1, and _rank_scores raises on best == worst). So don't route eval through judge_group. What is worth sharing is the rubric + tolerant single-number parse: factor that into one _judge_one-style helper that both apo and eval call, instead of eval inheriting the weaker regex scorer. On the 0.5B–8B judges we target, the parse robustness is the part that matters.
Worth fixing
- Multi-turn chat rows lose context.
_row_iotakes the first user turn as the prompt and the last assistant turn as the reference — so a multi-turn row is scored as "answer the opening question" against a final-turn answer. Fine for single-turn QA, wrong for real conversations. --sample 0evaluates the whole dataset —if sample:treats0as falsy. Useif sample is not None:.- Tests. The description lists passing no-GPU stub tests for every metric and both error paths, but no
tests/test_eval.pyis in the diff. Please commit it — the tworaise ValueErrorpaths have no coverage otherwise.
One YAGNI nit: _exact_score re-implements the same " ".join(str(x).lower().split()) normalization as apo._contains_score — pull it into one helper. Otherwise the surface matches models.py/TrainingRun conventions. Fix 1–3 and this is good to merge.
…ests - apo._judge_one: add a rubric + tolerant number parse (_parse_judge_score handles "0.7", "7/10", "8"→0.8); evaluate routes through this shared scorer, so APO and eval agree on a good answer. Not routing eval through judge_group (it's group-relative / RL-only and can't score a lone row). - eval._row_io: feed the full conversation prefix for multi-turn chat rows and compare to the final assistant turn (was: first-user vs last-assistant). - cli eval: reject --metric judge with no --judge up front (typer.BadParameter) instead of a raw ValueError after the model has loaded. - eval: `if sample is not None` so --sample 0 doesn't mean "evaluate everything". - Pull shared whitespace/lowercase normalization into apo._norm (used by both _contains_score and _exact_score). - tests/test_eval.py: no-GPU stub coverage for every metric, both error paths, multi-turn context, preference rows, sample=0, and the judge parser.
|
Thanks @patel-lyzr — addressed in 5ec2028. Point by point: Must fix
Structural — judge scorer Worth fixing YAGNI — pulled the shared |
What & why
ShadowLM ships the full capture → judge → train → own loop, but there was no way to score a model on a task. The only "eval" today is train-time
eval_loss(next-token loss on held-out data) — not task quality. The README roadmap calls task-level evals an unbuilt "eval gate," and the product thesis ("run the shadow until it does the job as well as the frontier, then switch") is unprovable without a quality score.This adds the smallest meaningful slice of that:
slm.evaluate(model, dataset, metric=...)and ashadowlm evalCLI command. Purely additive — no behavior changes.Changes
shadowlm/eval.py(new) —evaluate(model, data, *, metric="contains", judge=None, system=None, sample=None, ...)returning anEvalResult(aggregatescore, per-rowscores/examples, plus.sparkline(),.worst(k),.to_dict()).contains(default — expected answer appears in output),exact(normalized equality),judge(LLM-as-judge), or a custom(output, expected, prompt) -> floatcallable.apo._contains_score,apo._judge_one, andapo._colscolumn detection. Handles chat / instruction / preference rows and dataset-path inputs.shadowlm/__init__.py— exportevaluate,EvalResult.shadowlm/cli.py—shadowlm eval <model> <dataset> [--metric contains|exact|judge] [--judge <id>] [--sample N] ...], prints a headline score + a worst-examples table.examples/evaluate.py(new) — end-to-end demo.Usage
shadowlm eval mlx-community/Qwen2.5-0.5B-Instruct-4bit data.jsonl --metric containsVerification
python -m compileall shadowlmclean;import shadowlm as slm; slm.evaluateresolves.judge-implies-metric default, and the error cases (judge-without-model, missing-prompt-column).shadowlm eval --helpregisters and renders;evalappears in the top-level Models panel.examples/evaluate.pymlx run and a real CLI run against a downloaded model. The logic they exercise is covered by the stub tests.Out of scope
No
/v1/evaluateserver endpoint, no studio UI, no train-time auto-eval hook, no token-F1 metric — those can follow once this SDK surface lands.