feat(benchmark): add SkillSpector benchmark harness by will-exaforce · Pull Request #5 · ExaForce/SkillSpector

will-exaforce · 2026-06-19T21:22:06Z

Summary

Add benchmark/, a standalone uv project that runs SkillSpector over a labeled corpus and scores its classifications
Persist per-unit scan results to DuckDB and ship 14 SQL queries for metrics (recall, false positives/negatives, threshold sweeps, scan timing, per-category/vector breakdowns)
MalSkillBench dataset handler with parallel scan workers, per-worker auth handling, and per-unit timeouts

Details

Standalone project with an editable dependency on skillspector; root files stay at upstream parity. Worker timeouts use setitimer so sub-second --timeout values arm correctly, and --overwrite clears the sibling .wal to avoid replaying stale state into a fresh DB.

Standalone uv project that runs SkillSpector over a labeled corpus, persists scan results to DuckDB, and ships SQL queries for metrics (recall, false positives/negatives, threshold sweeps, timing).

smoy

Am i reading this correctly that it takes $70 to run the benchmark (i guess twice to before-modify and after-modify)?

It's not too surprising. i guess, roughly each run is 4000 (mal skills), 4000 (benignn skills). so, $35 dollars to process 8000 skills.

I point this out for context understanding.

smoy · 2026-06-22T19:16:58Z

+                    source_path=f"{f}#{i}",
+                    display_name=f"{f.stem}:{pyfile}",
+                    is_malicious=True,
+                    attack_vector="CI",


how is attach-vector use? kinda surprised it just always pin to CI

smoy · 2026-06-22T19:22:39Z

+        )
+    # Default path: OpenAI-compatible Bedrock endpoint with a minted token.
+    model = os.environ.get("MODEL_OVERRIDE") or os.environ.get("AGENTGUARD_MODEL_ID", MODEL_ID)
+    region = os.environ.get("AWS_REGION") or BEDROCK_REGION


if we support region runtime overload, we should build the BEDROCK_BASE_URL based on the read region value

smoy · 2026-06-22T19:24:31Z

+
+
+def record_result(con, run_id: str, res: ScanResult, no_llm: bool) -> None:
+    u = res.unit


is this really inserting a row at a time into the database? probably the anti-pattern for duckdb since its storage format is row group. (it prefers batches of rows).

smoy

i am surprise F1 result is not surface in the console.

Since F1 captures both precision and recall into a single number, its easier to compare. The key trap is accuracy is misleading under class imbalance. If the corpus is 90% benign, a scanner that flags nothing scores 90% accuracy while catching zero malware (recall = 0). That's why this harness reports precision and recall separately, and why F1
(their harmonic mean, in the SQL layer) exists — to summarize the precision/recall tradeoff in one number without letting a lopsided class ratio paper over misses.

smoy

i didn't find anything particular show stopper. I do have some usability questions we should go over in real time.

Review complete — 8 finder angles produced 27 candidates, each independently verified; 9 were refuted, 10 survived (the "18" in the workflow's own summary line is a miscount — stats.reported is 10). Here they are, most severe first.

High — silent metric corruption

Bad-input units silently vanish from precision/recall · runner.py:108
The broad except Exception collapses SkillSpector's ValueError/FileNotFoundError (which the real CLI treats as exit-2 input errors) into a generic scan_status="error". The evaluation view then drops every ERROR row from TP/FP/TN/FN. So if a whole corpus fails
to resolve (e.g. a materialized SKILL.md SkillSpector can't parse), those units disappear from the headline numbers — you get an artificially clean precision/recall over a biased subset, with only an opaque count in 08_errors.sql.
Resume blends incompatible runs under one run_id · db.py:144
open_db resumes by reusing the most-recent run_id without checking the stored config matches. Re-run the same -o file with the LLM on after a --no-llm run (or different --categories) and the new scans append to the old run — print_summary and the view mix
StaticAnalysis and LLM verdicts into one set of metrics, no warning.
Probability-style labels corrupt ground truth · malskillbench.py:337 (plausible — depends on dataset)
_prompt_field_label does float(v) != 0, so a label of 0.3 or '0.2' (a probability, not a 0/1 class) flips a benign sample to is_malicious=True. That poisons the ground truth every metric is computed against. Plausible because it depends on whether the prompt
corpus actually carries non-binary label fields.

Medium — failure-mode handling

--timeout 0 silently disables the timeout · runner.py:86,89
timeout = cfg.get("timeout") or 0 turns a falsy --timeout 0 into "no timeout," so no SIGALRM is armed and a hung LLM call blocks its worker indefinitely. (Default is 300, so only bites if someone passes 0.)
Three failure modes collapse to one DB status · db.py:101,114
classification_status maps error, timeout, and auth_failed all to 'Error'. 08_errors.sql then reports recoverable SSO auth-aborts and genuine crashes and timeouts as indistinguishable "errors," and 12_scan_timing can't tell a timeout from a crash.
Auth-abort discards already-completed scans · main.py:199
On auth_failed the loop breaks after shutdown(cancel_futures=True), so futures that already finished successfully but weren't yet consumed are dropped instead of recorded. On resume they're re-scanned — wasted work, and non-deterministic LLM verdicts can
differ from the silently-discarded first result.
Crashed/aborted units pollute timing stats · main.py:196 + runner.py:100
Error and auth_failed ScanResults carry the default scan_seconds=0.0, which lands in classifications.run_time. 12_scan_timing averages those 0.0 rows in, understating avg/p95 scan time and total CPU cost.

Low — design caveats (all plausible)

Verdict ignores risk_score · db.py:196 — outcome is purely recommendation == "DO_NOT_INSTALL"; a high score with a different recommendation reads as benign, so 11_threshold_sweep.sql (score-based) can disagree with the recorded outcomes.
Temp-dir leak on the error path · runner.py:45 — cleanup only runs on normal return; the CLI did it in a finally. When graph.invoke raises, the resolver's temp dir leaks — accumulates over a large run with a non-trivial error rate.
No token refresh with a caller-supplied provider · runner.py:94 — configure_run sets mint_bedrock=False when SKILLSPECTOR_PROVIDER is pre-set, so a multi-hour run with an externally-supplied credential never refreshes; the back half silently turns to
errors once it expires.

A theme worth noting: findings 1, 2, 3, 5, 7 all share one root cause — failure and edge-case states silently degrade the headline metrics instead of being surfaced. That's the cluster I'd prioritize, since the benchmark's whole value is trustworthy numbers.

smoy

good enough stand alone for the benchmark

feat(benchmark): add SkillSpector benchmark harness

6dd230f

Standalone uv project that runs SkillSpector over a labeled corpus, persists scan results to DuckDB, and ships SQL queries for metrics (recall, false positives/negatives, threshold sweeps, timing).

will-exaforce requested review from pupapaik and smoy June 19, 2026 21:22

smoy reviewed Jun 22, 2026

View reviewed changes

smoy reviewed Jun 23, 2026

View reviewed changes

smoy approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): add SkillSpector benchmark harness#5

feat(benchmark): add SkillSpector benchmark harness#5
will-exaforce wants to merge 1 commit into
mainfrom
wbeasley/add-benchmark

will-exaforce commented Jun 19, 2026

Uh oh!

smoy left a comment

Uh oh!

smoy Jun 22, 2026

Uh oh!

smoy Jun 22, 2026

Uh oh!

smoy Jun 22, 2026

Uh oh!

smoy left a comment

Uh oh!

smoy left a comment

Uh oh!

smoy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def record_result(con, run_id: str, res: ScanResult, no_llm: bool) -> None:
		u = res.unit

Uh oh!

Conversation

will-exaforce commented Jun 19, 2026

Summary

Details

Uh oh!

smoy left a comment

Choose a reason for hiding this comment

Uh oh!

smoy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

smoy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

smoy Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

smoy left a comment

Choose a reason for hiding this comment

Uh oh!

smoy left a comment

Choose a reason for hiding this comment

Uh oh!

smoy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants