feat: Lambda function support from DataFusion, illustrated with array_filter by kazantsev-maksim · Pull Request #4744 · apache/datafusion-comet

kazantsev-maksim · 2026-06-28T18:58:36Z

Which issue does this PR close?

N/A

Rationale for this change

Running higher-order functions through JVM codegen is expensive: each batch incurs a JNI call into Spark's own implementation. Moving the lambda evaluation into the native DataFusion engine removes that overhead and brings the plan closer to fully native execution.

What changes are included in this PR?

Protobuf (native/proto/src/proto/expr.proto) - Added three new messages: HigherOrderFunc (function name + value arguments + lambdas), LambdaFunction (body + arguments), and NamedLambdaVariable (name, type, nullable). Added high_order_func (71) and named_lambda_variable (72) fields to Expr.
Native — planner (native/core/src/execution/planner.rs) - Handling for the new HighOrderFunc and NamedLambdaVariable expression variants. New create_high_order_function_expr and create_lambda_expr methods that build DataFusion's HigherOrderFunctionExpr and LambdaExpr, unpacking the value arguments and lambdas. Wired in LambdaExpr, LambdaVariable, and HigherOrderFunctionExpr from DataFusion.
Native — function registry (native/spark-expr/src/comet_high_order_funcs.rs, lib.rs). New module with create_comet_hof_func, which looks up a higher-order function in the FunctionRegistry by name and returns a clear error if it isn't found.
Spark — serialization (CometHighOrderFunction.scala, QueryPlanSerde.scala, arrays.scala). New generic serializer CometHighOrderFunction[T] that converts a Spark HigherOrderFunction (along with LambdaFunction and NamedLambdaVariable) into protobuf. CometArrayFilter now extends this serializer: when spark.comet.exec.scalaUDF.codegen.enabled is disabled it takes the new native path, otherwise the old behavior is preserved (including the fast-path for array_compact).

How are these changes tested?

Added new sql tests
Added new benchmark test

Simple benchmark result:

This reverts commit 768b3e9.

andygrove

Thanks for this, it is a nicely scoped foundation for native lambda support. I verified the native array_filter null semantics against datafusion-functions-nested 54.0.0 and they line up with Spark (null predicate drops the element, a null array is preserved as null, null elements are passed to the lambda and dropped). The HOF is registered into the session via functions_nested::register_all, so the runtime lookup resolves. A few discussion points and one robustness fix below.

One overall direction: I think the native lambda path should be the default, with the codegen dispatcher as the fallback for shapes the native path cannot handle, rather than the native path being off by default. See the inline comment on arrays.scala.

andygrove · 2026-06-30T01:27:38Z


-  override def getSupportLevel(expr: ArrayFilter): SupportLevel = Compatible()
+  override def getSupportLevel(expr: ArrayFilter): SupportLevel = {
+    if (!CometConf.COMET_SCALA_UDF_CODEGEN_ENABLED.get()) {


Could we flip this so the native lambda path is the default and the codegen dispatcher becomes the fallback for shapes the native path cannot handle (2-arg lambdas, non-LambdaFunction bodies), with Spark only as the last resort? Right now the native path is off by default, and an unsupported shape with codegen disabled drops straight to Spark instead of degrading to codegen. Reusing scalaUDF.codegen.enabled to toggle the native path also couples two separate concerns. A dedicated config plus a three-tier fallback (native -> codegen -> Spark) would read more cleanly.

andygrove · 2026-06-30T01:27:39Z

+      .newBuilder()
+      .setName(nlv.name)
+      .setNullable(nlv.nullable)
+      .setDataType(serializeDataType(nlv.dataType).get)


serializeDataType(nlv.dataType).get will throw if the lambda variable's type is not serializable by Comet, which fails during planning rather than falling back. Could we thread the Option through and return None from convert in that case so it falls back gracefully?

andygrove · 2026-06-30T01:27:39Z

+  private val UNARY_FUNCTION_EXPECTED =
+    "DataFusion higher-order functions support only 1 argument"
+
+  override def getIncompatibleReasons(): Seq[String] =


These three strings map to Unsupported(...) support levels in getSupportLevel, so they probably belong in getUnsupportedReasons() rather than getIncompatibleReasons(). They describe subsets that fall back, not cases that produce incorrect results, so this is the bucket the generated compat docs expect.

andygrove · 2026-06-30T01:27:39Z

+        let mut args: Vec<Arc<dyn PhysicalExpr>> =
+            Vec::with_capacity(value_args.len() + lambdas.len());
+        args.extend(value_args);
+        args.extend(lambdas);


A short comment that this assumes all value args precede all lambdas would help. That holds for array_filter and the current single-lambda functions, but would not generalize to a future HOF with interleaved value and lambda args.

andygrove · 2026-06-30T01:27:39Z

+SELECT filter(arr, x -> x > 2) FROM test_array_filter_native
+
+query
+SELECT filter(arr, x -> x >= 0) FROM test_array_filter_native


Might be worth adding cases for an array with null elements (e.g. array(1, null, 3) with x -> x > 0), an empty array array(), and a predicate that itself returns null. The native filter has dedicated handling for those, and the current data only covers all-non-null arrays plus one fully-null row.

Kazantsev Maksim and others added 30 commits December 14, 2025 16:24

impl map_from_entries

768b3e9

Revert "impl map_from_entries"

c68c342

This reverts commit 768b3e9.

Merge branch 'apache:main' into main

d887555

Merge branch 'apache:main' into main

231aa90

Merge branch 'apache:main' into main

9500bbb

Merge branch 'apache:main' into main

9577481

Merge branch 'apache:main' into main

3791557

Merge branch 'apache:main' into main

7c2f082

Merge branch 'apache:main' into main

609a605

Merge branch 'apache:main' into main

a151b2c

Merge branch 'apache:main' into main

ad3e7f5

Merge branch 'apache:main' into main

ea92e4b

Merge branch 'apache:main' into main

8dfeca3

Merge branch 'apache:main' into main

559741e

Merge branch 'apache:main' into main

ebda14e

Merge branch 'apache:main' into main

408152e

Merge branch 'apache:main' into main

d7857b2

Merge branch 'apache:main' into main

aef41be

Merge branch 'apache:main' into main

5ac1c58

Merge branch 'apache:main' into main

9ae8e23

Merge branch 'apache:main' into main

5ca3888

Merge branch 'apache:main' into main

160a817

Merge branch 'apache:main' into main

88fc313

Merge branch 'apache:main' into main

e14c180

Merge branch 'apache:main' into main

610a885

Merge branch 'apache:main' into main

f8acb2c

Merge branch 'apache:main' into main

ec94897

Merge branch 'apache:main' into main

43405e4

Merge branch 'apache:main' into main

47b4915

Merge branch 'apache:main' into main

26e2682

kazantsev-maksim and others added 22 commits April 22, 2026 21:13

Merge branch 'apache:main' into main

c9f52d1

Merge branch 'apache:main' into main

67f72d9

Merge branch 'apache:main' into main

314e594

Merge branch 'apache:main' into main

ac8292f

Merge branch 'apache:main' into main

c9c140e

Merge branch 'apache:main' into main

decca58

Merge branch 'apache:main' into main

0919b33

Merge branch 'apache:main' into main

7495e21

Merge branch 'apache:main' into main

0a37a60

Merge branch 'apache:main' into main

abbba84

Merge branch 'apache:main' into main

6020560

Merge branch 'apache:main' into main

e2bdfb1

Merge branch 'apache:main' into main

3edfc33

Merge branch 'apache:main' into main

a39e860

Merge branch 'apache:main' into main

e88dd7b

Merge branch 'apache:main' into main

3e29d37

Merge branch 'apache:main' into main

4068359

Merge branch 'apache:main' into main

a3cb8de

Merge branch 'apache:main' into main

b33726f

Merge branch 'apache:main' into main

698f7a1

Merge branch 'apache:main' into main

18162a6

Support native DataFusion lambda functions

6b0d500

kazantsev-maksim marked this pull request as draft June 28, 2026 18:58

more tests

4281483

kazantsev-maksim marked this pull request as ready for review June 29, 2026 18:14

kazantsev-maksim and others added 3 commits June 29, 2026 22:14

Merge branch 'main' into array_filter

d157db2

fix

d056f64

Merge remote-tracking branch 'origin/array_filter' into array_filter

6037e7a

andygrove requested a review from comphead June 30, 2026 01:24

andygrove reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Lambda function support from DataFusion, illustrated with array_filter#4744

feat: Lambda function support from DataFusion, illustrated with array_filter#4744
kazantsev-maksim wants to merge 63 commits into
apache:mainfrom
kazantsev-maksim:array_filter

kazantsev-maksim commented Jun 28, 2026 •

edited

Loading

Uh oh!

andygrove left a comment

Uh oh!

andygrove Jun 30, 2026

Uh oh!

andygrove Jun 30, 2026

Uh oh!

andygrove Jun 30, 2026

Uh oh!

andygrove Jun 30, 2026

Uh oh!

andygrove Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kazantsev-maksim commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kazantsev-maksim commented Jun 28, 2026 •

edited

Loading