fix(eval): fix search_tool correctness always scoring 0% by Tomkess · Pull Request #1675 · gooddata/gooddata-python-sdk

Tomkess · 2026-06-27T20:34:01Z

Summary

The tool_correctness metric in search_tool evaluations was always False (0%), making the dashboard metric useless.

Root cause: _args_match checked two fields that never matched real model calls:

emit_widget — this parameter was renamed to user_requested_search in the tool schema; the model never sends emit_widget, so actual_args.get("emit_widget") was always None vs the expected false
limit — declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode 10

Fix: Only check keywords (case-insensitive) and object_types — the two fields that determine whether the search was semantically correct. limit and the widget display flag are not part of search correctness.

Test Plan

Run uv run pytest tests/ -x -q in packages/gooddata-eval
Re-run search_tool eval items and verify tool_correctness now reflects actual model behaviour (models that call with the right keywords and object types should score True)

Risk

Low — single function change in the evaluator, no effect on eval runner or production code.

Summary by CodeRabbit

Bug Fixes
- Improved search tool call matching to evaluate correctness using the most relevant inputs.
- Search keywords are now compared case-insensitively, and object types are normalized for more reliable matching.
- Search options like limit and widget emission are no longer required to match exactly when judging search quality, reducing false mismatches.
- Malformed or non-string keyword/object type inputs are handled more defensively, preventing errors during matching.

_args_match checked emit_widget (renamed to user_requested_search in the tool schema) and limit (optional, server-side default). Both mismatched on every real model call, so tool_correctness was always False regardless of whether the model used the right keywords and object types. Fix: evaluate only keywords (case-insensitive) and object_types — the two fields that actually determine whether the search was semantically correct.

coderabbitai · 2026-06-27T20:34:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ad59c555-206e-48c1-b386-fb7c35f68da0

📥 Commits

Reviewing files that changed from the base of the PR and between 2ebc8b0 and f847679.

📒 Files selected for processing (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

🚧 Files skipped from review as they are similar to previous changes (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

📝 Walkthrough

Walkthrough

_args_match now compares only normalized keywords and object_types, while malformed list inputs are normalized to empty lists instead of being matched through direct list handling.

Changes

Search Tool Argument Matching

Layer / File(s)	Summary
Normalize string lists `packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`	Adds a helper that filters list inputs to strings, optionally lowercases them, sorts them, and returns an empty list for malformed values.
Relax `_args_match` criteria `packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`	Compares normalized `keywords` and `object_types` only, and removes the previous `limit` and `emit_widget` equality checks.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Poem

🐇 I sniff the keywords, round and neat,
Object types line up to the beat.
Limits drift off into the mist,
Widget flags no longer insist.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the PR’s main change: fixing search_tool evaluation correctness always scoring 0%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb

📥 Commits

Reviewing files that changed from the base of the PR and between 726a0b0 and 2ebc8b0.

📒 Files selected for processing (1)

packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py

codecov · 2026-06-27T20:37:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.21%. Comparing base (653f5bc) to head (f847679).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1675   +/-   ##
=======================================
  Coverage   79.21%   79.21%           
=======================================
  Files         232      232           
  Lines       15809    15809           
=======================================
  Hits        12523    12523           
  Misses       3286     3286

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parsed_arguments() returns raw model-emitted JSON, so a bad tool call like {"keywords":[1]} or mixed-type object_types would raise on .lower()/sorted() and abort the whole eval run instead of scoring tool_correctness=False. Add _normalize_str_list to drop non-string entries defensively; valid comparisons (incl. case-insensitive keyword match) are unchanged. Addresses CodeRabbit review on PR #1675. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tomkess requested review from hkad98, lupko and pcerny as code owners June 27, 2026 20:34

coderabbitai Bot reviewed Jun 27, 2026

View reviewed changes

Comment thread packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): fix search_tool correctness always scoring 0%#1675

fix(eval): fix search_tool correctness always scoring 0%#1675
Tomkess wants to merge 2 commits into
masterfrom
fix/eval-search-tool-correctness

Tomkess commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Tomkess commented Jun 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Risk

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tomkess commented Jun 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

codecov Bot commented Jun 27, 2026 •

edited

Loading