opentelemetry-sdk: drop NaN measurements at instrument level to prevent aggregation poisoning#5336
Conversation
| for callback in self._callbacks: | ||
| try: | ||
| for api_measurement in callback(callback_options): | ||
| if math.isnan(api_measurement.value): |
There was a problem hiding this comment.
This does not catch math.inf though which I guess will have the same issue as math.nan, what about using math.isfinite to cover both instead?
There was a problem hiding this comment.
Good catch sir, completely missed the inf check, apologies, will update soon
|
hey sir, switched all isnan checks to math.isfinite to cover both NaN and Inf, and added inf test cases for each instrument type |
|
|
||
| if not math.isfinite(amount): | ||
| _logger.warning( | ||
| "Add amount is not finite on Counter %s, ignoring measurement.", |
|
done, added the actual value to all 5 warning messages |
There was a problem hiding this comment.
Pull request overview
This pull request updates the OpenTelemetry Python SDK metrics instrument implementations to proactively drop non-finite numeric measurements (NaN, +/-Inf) at the instrument/callback layer, preventing permanent corruption (“poisoning”) of downstream aggregations.
Changes:
- Added
math.isfinite(...)guards to synchronous instrument methods (Counter.add,UpDownCounter.add,Histogram.record,Gauge.set) and to the shared async callback path (_Asynchronous.callback) to skip non-finite values with a warning. - Added unit tests covering
NaNandInfinputs for the affected instruments and async callbacks. - Added a changelog entry documenting the fix.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/instrument.py |
Drops non-finite values before they reach measurement consumption/aggregation for both sync instruments and async callbacks. |
opentelemetry-sdk/tests/metrics/test_instrument.py |
Adds regression tests ensuring non-finite measurements are ignored (and valid measurements still pass through). |
.changelog/5336.fixed |
Records the behavioral fix for release notes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Can you add an integration test for this either in an existing file or feel free to make a new one opentelemetry-sdk/tests/metrics/integration_test/.
(it helps to capture the fix if we later refactor the SDK internals)

Summary
NaNmeasurements passed to metric instruments (e.g.counter.add(float("nan"))) were silently folded into running aggregations, permanently poisoning the sum, all subsequent exports emittednanwith no recovery pathif amount < 0guard onCounterandHistogramis bypassed byNaNbecause all IEEE 754 comparisons againstNaNreturnFalsemath.isnan()check at the instrument level (same layer as the existing negative-value guard) for all sync instruments (Counter,UpDownCounter,Histogram,Gauge) and in the async callback path, matching the approach taken in the Java SDK (open-telemetry/opentelemetry-java#5846, fixed in PR #5859)Closes #5330
What was tested
test_add_nantoTestCounterandTestUpDownCountertest_record_nantoTestHistogramtest_set_nantoTestGaugetest_nan_callback_value_is_droppedtoTestObservableGauge(verifies NaN from async callbacks is skipped, valid measurements still pass through)python -m pytest opentelemetry-sdk/tests/metrics/test_instrument.py)counter.add(1) → add(NaN) → add(5)now yieldssum = 6, notnan