fix(eval): fix search_tool correctness always scoring 0%#1675
Conversation
_args_match checked emit_widget (renamed to user_requested_search in the tool schema) and limit (optional, server-side default). Both mismatched on every real model call, so tool_correctness was always False regardless of whether the model used the right keywords and object types. Fix: evaluate only keywords (case-insensitive) and object_types — the two fields that actually determine whether the search was semantically correct.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
ChangesSearch Tool Argument Matching
Estimated code review effort: 1 (Trivial) | ~5 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb
📒 Files selected for processing (1)
packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1675 +/- ##
=======================================
Coverage 79.21% 79.21%
=======================================
Files 232 232
Lines 15809 15809
=======================================
Hits 12523 12523
Misses 3286 3286 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
parsed_arguments() returns raw model-emitted JSON, so a bad tool call
like {"keywords":[1]} or mixed-type object_types would raise on
.lower()/sorted() and abort the whole eval run instead of scoring
tool_correctness=False. Add _normalize_str_list to drop non-string
entries defensively; valid comparisons (incl. case-insensitive keyword
match) are unchanged.
Addresses CodeRabbit review on PR #1675.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
The
tool_correctnessmetric insearch_toolevaluations was alwaysFalse(0%), making the dashboard metric useless.Root cause:
_args_matchchecked two fields that never matched real model calls:emit_widget— this parameter was renamed touser_requested_searchin the tool schema; the model never sendsemit_widget, soactual_args.get("emit_widget")was alwaysNonevs the expectedfalselimit— declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode10Fix: Only check
keywords(case-insensitive) andobject_types— the two fields that determine whether the search was semantically correct.limitand the widget display flag are not part of search correctness.Test Plan
uv run pytest tests/ -x -qinpackages/gooddata-evaltool_correctnessnow reflects actual model behaviour (models that call with the right keywords and object types should score True)Risk
Low — single function change in the evaluator, no effect on eval runner or production code.
Summary by CodeRabbit