feat(benchmark): add SkillSpector benchmark harness#5
Conversation
Standalone uv project that runs SkillSpector over a labeled corpus, persists scan results to DuckDB, and ships SQL queries for metrics (recall, false positives/negatives, threshold sweeps, timing).
| source_path=f"{f}#{i}", | ||
| display_name=f"{f.stem}:{pyfile}", | ||
| is_malicious=True, | ||
| attack_vector="CI", |
There was a problem hiding this comment.
how is attach-vector use? kinda surprised it just always pin to CI
| ) | ||
| # Default path: OpenAI-compatible Bedrock endpoint with a minted token. | ||
| model = os.environ.get("MODEL_OVERRIDE") or os.environ.get("AGENTGUARD_MODEL_ID", MODEL_ID) | ||
| region = os.environ.get("AWS_REGION") or BEDROCK_REGION |
There was a problem hiding this comment.
if we support region runtime overload, we should build the BEDROCK_BASE_URL based on the read region value
|
|
||
|
|
||
| def record_result(con, run_id: str, res: ScanResult, no_llm: bool) -> None: | ||
| u = res.unit |
There was a problem hiding this comment.
is this really inserting a row at a time into the database? probably the anti-pattern for duckdb since its storage format is row group. (it prefers batches of rows).
smoy
left a comment
There was a problem hiding this comment.
i am surprise F1 result is not surface in the console.
Since F1 captures both precision and recall into a single number, its easier to compare. The key trap is accuracy is misleading under class imbalance. If the corpus is 90% benign, a scanner that flags nothing scores 90% accuracy while catching zero malware (recall = 0). That's why this harness reports precision and recall separately, and why F1
(their harmonic mean, in the SQL layer) exists — to summarize the precision/recall tradeoff in one number without letting a lopsided class ratio paper over misses.
smoy
left a comment
There was a problem hiding this comment.
i didn't find anything particular show stopper. I do have some usability questions we should go over in real time.
Review complete — 8 finder angles produced 27 candidates, each independently verified; 9 were refuted, 10 survived (the "18" in the workflow's own summary line is a miscount — stats.reported is 10). Here they are, most severe first.
High — silent metric corruption
-
Bad-input units silently vanish from precision/recall · runner.py:108
The broad except Exception collapses SkillSpector's ValueError/FileNotFoundError (which the real CLI treats as exit-2 input errors) into a generic scan_status="error". The evaluation view then drops every ERROR row from TP/FP/TN/FN. So if a whole corpus fails
to resolve (e.g. a materialized SKILL.md SkillSpector can't parse), those units disappear from the headline numbers — you get an artificially clean precision/recall over a biased subset, with only an opaque count in 08_errors.sql. -
Resume blends incompatible runs under one run_id · db.py:144
open_db resumes by reusing the most-recent run_id without checking the stored config matches. Re-run the same -o file with the LLM on after a --no-llm run (or different --categories) and the new scans append to the old run — print_summary and the view mix
StaticAnalysis and LLM verdicts into one set of metrics, no warning. -
Probability-style labels corrupt ground truth · malskillbench.py:337 (plausible — depends on dataset)
_prompt_field_label does float(v) != 0, so a label of 0.3 or '0.2' (a probability, not a 0/1 class) flips a benign sample to is_malicious=True. That poisons the ground truth every metric is computed against. Plausible because it depends on whether the prompt
corpus actually carries non-binary label fields.
Medium — failure-mode handling
-
--timeout 0 silently disables the timeout · runner.py:86,89
timeout = cfg.get("timeout") or 0 turns a falsy --timeout 0 into "no timeout," so no SIGALRM is armed and a hung LLM call blocks its worker indefinitely. (Default is 300, so only bites if someone passes 0.) -
Three failure modes collapse to one DB status · db.py:101,114
classification_status maps error, timeout, and auth_failed all to 'Error'. 08_errors.sql then reports recoverable SSO auth-aborts and genuine crashes and timeouts as indistinguishable "errors," and 12_scan_timing can't tell a timeout from a crash. -
Auth-abort discards already-completed scans · main.py:199
On auth_failed the loop breaks after shutdown(cancel_futures=True), so futures that already finished successfully but weren't yet consumed are dropped instead of recorded. On resume they're re-scanned — wasted work, and non-deterministic LLM verdicts can
differ from the silently-discarded first result. -
Crashed/aborted units pollute timing stats · main.py:196 + runner.py:100
Error and auth_failed ScanResults carry the default scan_seconds=0.0, which lands in classifications.run_time. 12_scan_timing averages those 0.0 rows in, understating avg/p95 scan time and total CPU cost.
Low — design caveats (all plausible)
- Verdict ignores risk_score · db.py:196 — outcome is purely recommendation == "DO_NOT_INSTALL"; a high score with a different recommendation reads as benign, so 11_threshold_sweep.sql (score-based) can disagree with the recorded outcomes.
- Temp-dir leak on the error path · runner.py:45 — cleanup only runs on normal return; the CLI did it in a finally. When graph.invoke raises, the resolver's temp dir leaks — accumulates over a large run with a non-trivial error rate.
- No token refresh with a caller-supplied provider · runner.py:94 — configure_run sets mint_bedrock=False when SKILLSPECTOR_PROVIDER is pre-set, so a multi-hour run with an externally-supplied credential never refreshes; the back half silently turns to
errors once it expires.
A theme worth noting: findings 1, 2, 3, 5, 7 all share one root cause — failure and edge-case states silently degrade the headline metrics instead of being surfaced. That's the cluster I'd prioritize, since the benchmark's whole value is trustworthy numbers.
smoy
left a comment
There was a problem hiding this comment.
good enough stand alone for the benchmark

Summary
benchmark/, a standaloneuvproject that runs SkillSpector over a labeled corpus and scores its classificationsDetails
Standalone project with an editable dependency on
skillspector; root files stay at upstream parity. Worker timeouts usesetitimerso sub-second--timeoutvalues arm correctly, and--overwriteclears the sibling.walto avoid replaying stale state into a fresh DB.