feat: support TimestampNTZ inputs natively for hour/minute/second#4753
feat: support TimestampNTZ inputs natively for hour/minute/second#4753andygrove wants to merge 2 commits into
Conversation
The native `hour`, `minute`, and `second` implementations applied a session timezone offset to every input. That is correct for timezone-aware timestamps (stored in UTC) but wrong for TimestampNTZ values, which already hold local wall-clock time. The expressions were therefore marked Incompatible for TimestampNTZ inputs and fell back to Spark. Detect TimestampNTZ inputs (including dictionary-encoded) in the native `extract_date_part` macro and extract the date part directly, with no timezone conversion. The Scala serde no longer special-cases TimestampNTZ, so these expressions are Compatible for all supported input types and run natively.
c1a1de2 to
4571032
Compare
| makeRawTimeParquetFile(path, dictionaryEnabled = false, n = 5) | ||
| withSQLConf("spark.comet.expression.Hour.allowIncompatible" -> "false") { | ||
| val df = spark.read.parquet(path.toString).selectExpr("hour(_5) as h") | ||
| withSQLConf("spark.comet.expression.FromUTCTimestamp.allowIncompatible" -> "false") { |
mbutrovich
left a comment
There was a problem hiding this comment.
Thanks @andygrove! This is a clean fix and the root cause is pinned down well. I confirmed it in the source: array_with_timezone routes a Timestamp(_, None) (NTZ) input through timestamp_ntz_to_timestamp, which interprets the wall-clock value as local time in the session zone and shifts it, so a later date_part applied the offset twice for NTZ. Bypassing that conversion for NTZ and extracting directly is exactly right.
I also like that the fix sits at the call site rather than inside array_with_timezone. Other callers (the cast paths) genuinely want the NTZ to session-zone conversion, so changing the shared helper would have been the wrong altitude. Special-casing in extract_date_part keeps the shared helper intact. The NTZ branch is zero-copy too, since array is already owned and just passes through.
Dropping CodegenDispatchFallback and the Incompatible reason from the three serdes looks safe to me. Spark's Hour, Minute, and Second only accept TimestampType and TimestampNTZType, and both are now handled on the native path, so there is no remaining input type that needs the fallback. The Rust unit tests cover the timezone-aware, NTZ, timezone-independence, and dictionary cases, and the obsolete fallback and allowIncompatible tests were correctly removed. CI is green.
Thanks for closing this one out.
| fn timestamp_ntz_dictionary_input_extracts_local_time() { | ||
| use arrow::array::{DictionaryArray, Int8Array}; | ||
| let values = Arc::new(TimestampMicrosecondArray::from(vec![Some(MICROS)])) as ArrayRef; | ||
| let keys = Int8Array::from(vec![0i8]); |
There was a problem hiding this comment.
Small test-fidelity thought. Comet's dictionary columns use Int32 keys, and return_type declares the dictionary result as Dictionary(Int32, Int32). arrow's date_part preserves the key type (it does array.with_values(...)), so an Int8-keyed input actually comes back as Dictionary(Int8, Int32), which does not match the declared return type. The test gets away with it because it casts the result to Int32 before checking, which is what the "regardless of whether the kernel kept the dictionary" comment is papering over.
Would it be worth switching this to an Int32-keyed dictionary so the test exercises the real production shape and the declared return type? As a bonus you could then assert the result is still a dictionary, which would tighten the coverage rather than casting it away. Not a correctness problem in production since Comet only emits Int32 keys, just making the test match what actually flows through.
| /// Returns true when the type is a timestamp without a timezone (Spark's TimestampNTZType), | ||
| /// including when wrapped in a dictionary. Such values store local wall-clock time and must not | ||
| /// have any session timezone offset applied when extracting date parts. | ||
| fn is_timestamp_ntz(data_type: &DataType) -> bool { |
There was a problem hiding this comment.
Nice little helper, and recursing through the dictionary wrapper is the right move. Purely a forward pointer, no action needed here: the CometTemporalExpressionSuite comments note date_trunc has the same "do not apply a session offset to NTZ" issue for non-UTC zones (#2649). If that one gets picked up later, this same predicate looks like it would carry over, so it might be worth keeping in mind as a shared piece rather than re-deriving it there.
Which issue does this PR close?
Closes #3180.
Rationale for this change
The native
hour,minute, andsecondimplementations applied a session timezone offset to every input. That conversion is correct for timezone-aware timestamps (stored in UTC) but wrong for TimestampNTZ values, which already hold local wall-clock time. As a short-term fix these expressions were markedIncompatiblefor TimestampNTZ inputs so they fell back to Spark.This PR implements the follow-up enhancement tracked in the issue: handling TimestampNTZ correctly on the native path so the expressions run in Comet for all supported input types.
What changes are included in this PR?
extract_date_part.rs): detect TimestampNTZ inputs (including dictionary-encoded) and extract the date part directly with no timezone conversion. Timezone-aware timestamps keep the existing UTC to session-timezone shift.CometHour,CometMinute, andCometSecondno longer special-case TimestampNTZ. They areCompatiblefor all supported input types and no longer need theIncompatiblereason or the codegen-dispatch fallback.native opt-in hint shown for codegen-dispatch pathtest inCometExpressionSuiteto usefrom_utc_timestamp, sinceHouris no longerIncompatiblefor any input.How are these changes tested?
extract_date_part.rscovering: timezone-aware conversion, TimestampNTZ direct extraction, timezone independence of the NTZ result,minute/secondon NTZ, the NTZ-detection helper, and dictionary-encoded NTZ input.CometTemporalExpressionSuitehour/minute/second - timestamp_ntz inputnow verifies native execution and Spark-identical results across UTC, America/Los_Angeles, Europe/London, and Asia/Tokyo. Two tests that only covered the previous fallback andallowIncompatiblebehavior were removed as obsolete.