feat: support TimestampNTZ inputs natively for hour/minute/second by andygrove · Pull Request #4753 · apache/datafusion-comet

andygrove · 2026-06-29T16:53:36Z

Which issue does this PR close?

Closes #3180.

Rationale for this change

The native hour, minute, and second implementations applied a session timezone offset to every input. That conversion is correct for timezone-aware timestamps (stored in UTC) but wrong for TimestampNTZ values, which already hold local wall-clock time. As a short-term fix these expressions were marked Incompatible for TimestampNTZ inputs so they fell back to Spark.

This PR implements the follow-up enhancement tracked in the issue: handling TimestampNTZ correctly on the native path so the expressions run in Comet for all supported input types.

What changes are included in this PR?

Native (extract_date_part.rs): detect TimestampNTZ inputs (including dictionary-encoded) and extract the date part directly with no timezone conversion. Timezone-aware timestamps keep the existing UTC to session-timezone shift.
Scala serde: CometHour, CometMinute, and CometSecond no longer special-case TimestampNTZ. They are Compatible for all supported input types and no longer need the Incompatible reason or the codegen-dispatch fallback.
Retargeted the native opt-in hint shown for codegen-dispatch path test in CometExpressionSuite to use from_utc_timestamp, since Hour is no longer Incompatible for any input.

How are these changes tested?

New Rust unit tests in extract_date_part.rs covering: timezone-aware conversion, TimestampNTZ direct extraction, timezone independence of the NTZ result, minute/second on NTZ, the NTZ-detection helper, and dictionary-encoded NTZ input.
CometTemporalExpressionSuite hour/minute/second - timestamp_ntz input now verifies native execution and Spark-identical results across UTC, America/Los_Angeles, Europe/London, and Asia/Tokyo. Two tests that only covered the previous fallback and allowIncompatible behavior were removed as obsolete.

The native `hour`, `minute`, and `second` implementations applied a session timezone offset to every input. That is correct for timezone-aware timestamps (stored in UTC) but wrong for TimestampNTZ values, which already hold local wall-clock time. The expressions were therefore marked Incompatible for TimestampNTZ inputs and fell back to Spark. Detect TimestampNTZ inputs (including dictionary-encoded) in the native `extract_date_part` macro and extract the date part directly, with no timezone conversion. The Scala serde no longer special-cases TimestampNTZ, so these expressions are Compatible for all supported input types and run natively.

comphead · 2026-06-29T22:41:02Z

      makeRawTimeParquetFile(path, dictionaryEnabled = false, n = 5)
-      withSQLConf("spark.comet.expression.Hour.allowIncompatible" -> "false") {
-        val df = spark.read.parquet(path.toString).selectExpr("hour(_5) as h")
+      withSQLConf("spark.comet.expression.FromUTCTimestamp.allowIncompatible" -> "false") {


is false expected here?

mbutrovich

Thanks @andygrove! This is a clean fix and the root cause is pinned down well. I confirmed it in the source: array_with_timezone routes a Timestamp(_, None) (NTZ) input through timestamp_ntz_to_timestamp, which interprets the wall-clock value as local time in the session zone and shifts it, so a later date_part applied the offset twice for NTZ. Bypassing that conversion for NTZ and extracting directly is exactly right.

I also like that the fix sits at the call site rather than inside array_with_timezone. Other callers (the cast paths) genuinely want the NTZ to session-zone conversion, so changing the shared helper would have been the wrong altitude. Special-casing in extract_date_part keeps the shared helper intact. The NTZ branch is zero-copy too, since array is already owned and just passes through.

Dropping CodegenDispatchFallback and the Incompatible reason from the three serdes looks safe to me. Spark's Hour, Minute, and Second only accept TimestampType and TimestampNTZType, and both are now handled on the native path, so there is no remaining input type that needs the fallback. The Rust unit tests cover the timezone-aware, NTZ, timezone-independence, and dictionary cases, and the obsolete fallback and allowIncompatible tests were correctly removed. CI is green.

Thanks for closing this one out.

mbutrovich · 2026-06-30T15:44:28Z

+    fn timestamp_ntz_dictionary_input_extracts_local_time() {
+        use arrow::array::{DictionaryArray, Int8Array};
+        let values = Arc::new(TimestampMicrosecondArray::from(vec![Some(MICROS)])) as ArrayRef;
+        let keys = Int8Array::from(vec![0i8]);


Small test-fidelity thought. Comet's dictionary columns use Int32 keys, and return_type declares the dictionary result as Dictionary(Int32, Int32). arrow's date_part preserves the key type (it does array.with_values(...)), so an Int8-keyed input actually comes back as Dictionary(Int8, Int32), which does not match the declared return type. The test gets away with it because it casts the result to Int32 before checking, which is what the "regardless of whether the kernel kept the dictionary" comment is papering over.

Would it be worth switching this to an Int32-keyed dictionary so the test exercises the real production shape and the declared return type? As a bonus you could then assert the result is still a dictionary, which would tighten the coverage rather than casting it away. Not a correctness problem in production since Comet only emits Int32 keys, just making the test match what actually flows through.

mbutrovich · 2026-06-30T15:44:48Z

+/// Returns true when the type is a timestamp without a timezone (Spark's TimestampNTZType),
+/// including when wrapped in a dictionary. Such values store local wall-clock time and must not
+/// have any session timezone offset applied when extracting date parts.
+fn is_timestamp_ntz(data_type: &DataType) -> bool {


Nice little helper, and recursing through the dictionary wrapper is the right move. Purely a forward pointer, no action needed here: the CometTemporalExpressionSuite comments note date_trunc has the same "do not apply a session offset to NTZ" issue for non-UTC zones (#2649). If that one gets picked up later, this same predicate looks like it would carry over, so it might be worth keeping in mind as a shared piece rather than re-deriving it there.

andygrove force-pushed the fix-3180-timestamp-ntz-hms branch from c1a1de2 to 4571032 Compare June 29, 2026 16:58

fix: remove unused DataType import in datetime serde

3a8f7bb

comphead reviewed Jun 29, 2026

View reviewed changes

mbutrovich approved these changes Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support TimestampNTZ inputs natively for hour/minute/second#4753

feat: support TimestampNTZ inputs natively for hour/minute/second#4753
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix-3180-timestamp-ntz-hms

andygrove commented Jun 29, 2026

Uh oh!

comphead Jun 29, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

mbutrovich Jun 30, 2026

Uh oh!

mbutrovich Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

andygrove commented Jun 29, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

comphead Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants