Skip to content

feat: support TimestampNTZ inputs natively for hour/minute/second#4753

Open
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-3180-timestamp-ntz-hms
Open

feat: support TimestampNTZ inputs natively for hour/minute/second#4753
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-3180-timestamp-ntz-hms

Conversation

@andygrove

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #3180.

Rationale for this change

The native hour, minute, and second implementations applied a session timezone offset to every input. That conversion is correct for timezone-aware timestamps (stored in UTC) but wrong for TimestampNTZ values, which already hold local wall-clock time. As a short-term fix these expressions were marked Incompatible for TimestampNTZ inputs so they fell back to Spark.

This PR implements the follow-up enhancement tracked in the issue: handling TimestampNTZ correctly on the native path so the expressions run in Comet for all supported input types.

What changes are included in this PR?

  • Native (extract_date_part.rs): detect TimestampNTZ inputs (including dictionary-encoded) and extract the date part directly with no timezone conversion. Timezone-aware timestamps keep the existing UTC to session-timezone shift.
  • Scala serde: CometHour, CometMinute, and CometSecond no longer special-case TimestampNTZ. They are Compatible for all supported input types and no longer need the Incompatible reason or the codegen-dispatch fallback.
  • Retargeted the native opt-in hint shown for codegen-dispatch path test in CometExpressionSuite to use from_utc_timestamp, since Hour is no longer Incompatible for any input.

How are these changes tested?

  • New Rust unit tests in extract_date_part.rs covering: timezone-aware conversion, TimestampNTZ direct extraction, timezone independence of the NTZ result, minute/second on NTZ, the NTZ-detection helper, and dictionary-encoded NTZ input.
  • CometTemporalExpressionSuite hour/minute/second - timestamp_ntz input now verifies native execution and Spark-identical results across UTC, America/Los_Angeles, Europe/London, and Asia/Tokyo. Two tests that only covered the previous fallback and allowIncompatible behavior were removed as obsolete.

The native `hour`, `minute`, and `second` implementations applied a session
timezone offset to every input. That is correct for timezone-aware timestamps
(stored in UTC) but wrong for TimestampNTZ values, which already hold local
wall-clock time. The expressions were therefore marked Incompatible for
TimestampNTZ inputs and fell back to Spark.

Detect TimestampNTZ inputs (including dictionary-encoded) in the native
`extract_date_part` macro and extract the date part directly, with no timezone
conversion. The Scala serde no longer special-cases TimestampNTZ, so these
expressions are Compatible for all supported input types and run natively.
@andygrove andygrove force-pushed the fix-3180-timestamp-ntz-hms branch from c1a1de2 to 4571032 Compare June 29, 2026 16:58
makeRawTimeParquetFile(path, dictionaryEnabled = false, n = 5)
withSQLConf("spark.comet.expression.Hour.allowIncompatible" -> "false") {
val df = spark.read.parquet(path.toString).selectExpr("hour(_5) as h")
withSQLConf("spark.comet.expression.FromUTCTimestamp.allowIncompatible" -> "false") {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is false expected here?

@mbutrovich mbutrovich left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove! This is a clean fix and the root cause is pinned down well. I confirmed it in the source: array_with_timezone routes a Timestamp(_, None) (NTZ) input through timestamp_ntz_to_timestamp, which interprets the wall-clock value as local time in the session zone and shifts it, so a later date_part applied the offset twice for NTZ. Bypassing that conversion for NTZ and extracting directly is exactly right.

I also like that the fix sits at the call site rather than inside array_with_timezone. Other callers (the cast paths) genuinely want the NTZ to session-zone conversion, so changing the shared helper would have been the wrong altitude. Special-casing in extract_date_part keeps the shared helper intact. The NTZ branch is zero-copy too, since array is already owned and just passes through.

Dropping CodegenDispatchFallback and the Incompatible reason from the three serdes looks safe to me. Spark's Hour, Minute, and Second only accept TimestampType and TimestampNTZType, and both are now handled on the native path, so there is no remaining input type that needs the fallback. The Rust unit tests cover the timezone-aware, NTZ, timezone-independence, and dictionary cases, and the obsolete fallback and allowIncompatible tests were correctly removed. CI is green.

Thanks for closing this one out.

fn timestamp_ntz_dictionary_input_extracts_local_time() {
use arrow::array::{DictionaryArray, Int8Array};
let values = Arc::new(TimestampMicrosecondArray::from(vec![Some(MICROS)])) as ArrayRef;
let keys = Int8Array::from(vec![0i8]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small test-fidelity thought. Comet's dictionary columns use Int32 keys, and return_type declares the dictionary result as Dictionary(Int32, Int32). arrow's date_part preserves the key type (it does array.with_values(...)), so an Int8-keyed input actually comes back as Dictionary(Int8, Int32), which does not match the declared return type. The test gets away with it because it casts the result to Int32 before checking, which is what the "regardless of whether the kernel kept the dictionary" comment is papering over.

Would it be worth switching this to an Int32-keyed dictionary so the test exercises the real production shape and the declared return type? As a bonus you could then assert the result is still a dictionary, which would tighten the coverage rather than casting it away. Not a correctness problem in production since Comet only emits Int32 keys, just making the test match what actually flows through.

/// Returns true when the type is a timestamp without a timezone (Spark's TimestampNTZType),
/// including when wrapped in a dictionary. Such values store local wall-clock time and must not
/// have any session timezone offset applied when extracting date parts.
fn is_timestamp_ntz(data_type: &DataType) -> bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice little helper, and recursing through the dictionary wrapper is the right move. Purely a forward pointer, no action needed here: the CometTemporalExpressionSuite comments note date_trunc has the same "do not apply a session offset to NTZ" issue for non-UTC zones (#2649). If that one gets picked up later, this same predicate looks like it would carry over, so it might be worth keeping in mind as a shared piece rather than re-deriving it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hour/minute/second expressions incorrectly apply timezone conversion to TimestampNTZ inputs

3 participants