Skip to content

feat(benchmark): add SkillSpector benchmark harness#5

Open
will-exaforce wants to merge 1 commit into
mainfrom
wbeasley/add-benchmark
Open

feat(benchmark): add SkillSpector benchmark harness#5
will-exaforce wants to merge 1 commit into
mainfrom
wbeasley/add-benchmark

Conversation

@will-exaforce

Copy link
Copy Markdown

Summary

  • Add benchmark/, a standalone uv project that runs SkillSpector over a labeled corpus and scores its classifications
  • Persist per-unit scan results to DuckDB and ship 14 SQL queries for metrics (recall, false positives/negatives, threshold sweeps, scan timing, per-category/vector breakdowns)
  • MalSkillBench dataset handler with parallel scan workers, per-worker auth handling, and per-unit timeouts

Details

Standalone project with an editable dependency on skillspector; root files stay at upstream parity. Worker timeouts use setitimer so sub-second --timeout values arm correctly, and --overwrite clears the sibling .wal to avoid replaying stale state into a fresh DB.

Standalone uv project that runs SkillSpector over a labeled corpus,
persists scan results to DuckDB, and ships SQL queries for metrics
(recall, false positives/negatives, threshold sweeps, timing).
@will-exaforce will-exaforce requested review from pupapaik and smoy June 19, 2026 21:22

@smoy smoy left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am i reading this correctly that it takes $70 to run the benchmark (i guess twice to before-modify and after-modify)?

Image

It's not too surprising. i guess, roughly each run is 4000 (mal skills), 4000 (benignn skills). so, $35 dollars to process 8000 skills.

I point this out for context understanding.

source_path=f"{f}#{i}",
display_name=f"{f.stem}:{pyfile}",
is_malicious=True,
attack_vector="CI",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is attach-vector use? kinda surprised it just always pin to CI

)
# Default path: OpenAI-compatible Bedrock endpoint with a minted token.
model = os.environ.get("MODEL_OVERRIDE") or os.environ.get("AGENTGUARD_MODEL_ID", MODEL_ID)
region = os.environ.get("AWS_REGION") or BEDROCK_REGION

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we support region runtime overload, we should build the BEDROCK_BASE_URL based on the read region value

Comment thread benchmark/benchmark/db.py


def record_result(con, run_id: str, res: ScanResult, no_llm: bool) -> None:
u = res.unit

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really inserting a row at a time into the database? probably the anti-pattern for duckdb since its storage format is row group. (it prefers batches of rows).

@smoy smoy left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am surprise F1 result is not surface in the console.

Since F1 captures both precision and recall into a single number, its easier to compare. The key trap is accuracy is misleading under class imbalance. If the corpus is 90% benign, a scanner that flags nothing scores 90% accuracy while catching zero malware (recall = 0). That's why this harness reports precision and recall separately, and why F1
(their harmonic mean, in the SQL layer) exists — to summarize the precision/recall tradeoff in one number without letting a lopsided class ratio paper over misses.

@smoy smoy left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't find anything particular show stopper. I do have some usability questions we should go over in real time.

Review complete — 8 finder angles produced 27 candidates, each independently verified; 9 were refuted, 10 survived (the "18" in the workflow's own summary line is a miscount — stats.reported is 10). Here they are, most severe first.

High — silent metric corruption

  1. Bad-input units silently vanish from precision/recall · runner.py:108
    The broad except Exception collapses SkillSpector's ValueError/FileNotFoundError (which the real CLI treats as exit-2 input errors) into a generic scan_status="error". The evaluation view then drops every ERROR row from TP/FP/TN/FN. So if a whole corpus fails
    to resolve (e.g. a materialized SKILL.md SkillSpector can't parse), those units disappear from the headline numbers — you get an artificially clean precision/recall over a biased subset, with only an opaque count in 08_errors.sql.

  2. Resume blends incompatible runs under one run_id · db.py:144
    open_db resumes by reusing the most-recent run_id without checking the stored config matches. Re-run the same -o file with the LLM on after a --no-llm run (or different --categories) and the new scans append to the old run — print_summary and the view mix
    StaticAnalysis and LLM verdicts into one set of metrics, no warning.

  3. Probability-style labels corrupt ground truth · malskillbench.py:337 (plausible — depends on dataset)
    _prompt_field_label does float(v) != 0, so a label of 0.3 or '0.2' (a probability, not a 0/1 class) flips a benign sample to is_malicious=True. That poisons the ground truth every metric is computed against. Plausible because it depends on whether the prompt
    corpus actually carries non-binary label fields.

Medium — failure-mode handling

  1. --timeout 0 silently disables the timeout · runner.py:86,89
    timeout = cfg.get("timeout") or 0 turns a falsy --timeout 0 into "no timeout," so no SIGALRM is armed and a hung LLM call blocks its worker indefinitely. (Default is 300, so only bites if someone passes 0.)

  2. Three failure modes collapse to one DB status · db.py:101,114
    classification_status maps error, timeout, and auth_failed all to 'Error'. 08_errors.sql then reports recoverable SSO auth-aborts and genuine crashes and timeouts as indistinguishable "errors," and 12_scan_timing can't tell a timeout from a crash.

  3. Auth-abort discards already-completed scans · main.py:199
    On auth_failed the loop breaks after shutdown(cancel_futures=True), so futures that already finished successfully but weren't yet consumed are dropped instead of recorded. On resume they're re-scanned — wasted work, and non-deterministic LLM verdicts can
    differ from the silently-discarded first result.

  4. Crashed/aborted units pollute timing stats · main.py:196 + runner.py:100
    Error and auth_failed ScanResults carry the default scan_seconds=0.0, which lands in classifications.run_time. 12_scan_timing averages those 0.0 rows in, understating avg/p95 scan time and total CPU cost.

Low — design caveats (all plausible)

  1. Verdict ignores risk_score · db.py:196 — outcome is purely recommendation == "DO_NOT_INSTALL"; a high score with a different recommendation reads as benign, so 11_threshold_sweep.sql (score-based) can disagree with the recorded outcomes.
  2. Temp-dir leak on the error path · runner.py:45 — cleanup only runs on normal return; the CLI did it in a finally. When graph.invoke raises, the resolver's temp dir leaks — accumulates over a large run with a non-trivial error rate.
  3. No token refresh with a caller-supplied provider · runner.py:94 — configure_run sets mint_bedrock=False when SKILLSPECTOR_PROVIDER is pre-set, so a multi-hour run with an externally-supplied credential never refreshes; the back half silently turns to
    errors once it expires.

A theme worth noting: findings 1, 2, 3, 5, 7 all share one root cause — failure and edge-case states silently degrade the headline metrics instead of being surfaced. That's the cluster I'd prioritize, since the benchmark's whole value is trustworthy numbers.

@smoy smoy left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good enough stand alone for the benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants