Skip to content

fix(eval): fix search_tool correctness always scoring 0%#1675

Open
Tomkess wants to merge 2 commits into
masterfrom
fix/eval-search-tool-correctness
Open

fix(eval): fix search_tool correctness always scoring 0%#1675
Tomkess wants to merge 2 commits into
masterfrom
fix/eval-search-tool-correctness

Conversation

@Tomkess

@Tomkess Tomkess commented Jun 27, 2026

Copy link
Copy Markdown

Summary

The tool_correctness metric in search_tool evaluations was always False (0%), making the dashboard metric useless.

Root cause: _args_match checked two fields that never matched real model calls:

  • emit_widget — this parameter was renamed to user_requested_search in the tool schema; the model never sends emit_widget, so actual_args.get("emit_widget") was always None vs the expected false
  • limit — declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode 10

Fix: Only check keywords (case-insensitive) and object_types — the two fields that determine whether the search was semantically correct. limit and the widget display flag are not part of search correctness.

Test Plan

  • Run uv run pytest tests/ -x -q in packages/gooddata-eval
  • Re-run search_tool eval items and verify tool_correctness now reflects actual model behaviour (models that call with the right keywords and object types should score True)

Risk

Low — single function change in the evaluator, no effect on eval runner or production code.

Summary by CodeRabbit

  • Bug Fixes
    • Improved search tool call matching to evaluate correctness using the most relevant inputs.
    • Search keywords are now compared case-insensitively, and object types are normalized for more reliable matching.
    • Search options like limit and widget emission are no longer required to match exactly when judging search quality, reducing false mismatches.
    • Malformed or non-string keyword/object type inputs are handled more defensively, preventing errors during matching.

_args_match checked emit_widget (renamed to user_requested_search in the
tool schema) and limit (optional, server-side default). Both mismatched on
every real model call, so tool_correctness was always False regardless of
whether the model used the right keywords and object types.

Fix: evaluate only keywords (case-insensitive) and object_types — the two
fields that actually determine whether the search was semantically correct.
@Tomkess Tomkess requested review from hkad98, lupko and pcerny as code owners June 27, 2026 20:34
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad59c555-206e-48c1-b386-fb7c35f68da0

📥 Commits

Reviewing files that changed from the base of the PR and between 2ebc8b0 and f847679.

📒 Files selected for processing (1)
  • packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

📝 Walkthrough

Walkthrough

_args_match now compares only normalized keywords and object_types, while malformed list inputs are normalized to empty lists instead of being matched through direct list handling.

Changes

Search Tool Argument Matching

Layer / File(s) Summary
Normalize string lists
packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
Adds a helper that filters list inputs to strings, optionally lowercases them, sorts them, and returns an empty list for malformed values.
Relax _args_match criteria
packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
Compares normalized keywords and object_types only, and removes the previous limit and emit_widget equality checks.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Poem

🐇 I sniff the keywords, round and neat,
Object types line up to the beat.
Limits drift off into the mist,
Widget flags no longer insist.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the PR’s main change: fixing search_tool evaluation correctness always scoring 0%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb

📥 Commits

Reviewing files that changed from the base of the PR and between 726a0b0 and 2ebc8b0.

📒 Files selected for processing (1)
  • packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

Comment thread packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py Outdated
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.21%. Comparing base (653f5bc) to head (f847679).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1675   +/-   ##
=======================================
  Coverage   79.21%   79.21%           
=======================================
  Files         232      232           
  Lines       15809    15809           
=======================================
  Hits        12523    12523           
  Misses       3286     3286           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parsed_arguments() returns raw model-emitted JSON, so a bad tool call
like {"keywords":[1]} or mixed-type object_types would raise on
.lower()/sorted() and abort the whole eval run instead of scoring
tool_correctness=False. Add _normalize_str_list to drop non-string
entries defensively; valid comparisons (incl. case-insensitive keyword
match) are unchanged.

Addresses CodeRabbit review on PR #1675.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant