Skip to content

[AURON #1853] Convert Flink Calc operators to Native Calc operators#2283

Open
weiqingy wants to merge 4 commits into
apache:masterfrom
weiqingy:AURON-1853-impl
Open

[AURON #1853] Convert Flink Calc operators to Native Calc operators#2283
weiqingy wants to merge 4 commits into
apache:masterfrom
weiqingy:AURON-1853-impl

Conversation

@weiqingy
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #1853

Rationale for this change

FlinkAuronCalcOperator (#1857) can execute Flink Calc plans natively but is unreachable from real Flink SQL jobs — the planner instantiates Flink's stock StreamExecCalc which builds a JVM codegen operator via CodeGenOperatorFactory<RowData>. This PR closes the loop: a shadowed StreamExecCalc in auron-flink-planner (at the same FQCN as Flink's, picked up by classpath ordering) attempts to translate its projection + condition RexNodes into a native Project[Filter?[FFIReader]] plan using the converter framework (#1856 / #1859). On success, it constructs a FlinkAuronCalcOperator inline; on any failure, it falls back transparently to Flink's stock Calc via super.translateToPlanInternal.

After this PR, a SELECT a + 1 FROM t query routes through Auron's native arithmetic instead of Flink's codegen-Calc bytecode — the first end-to-end Flink-on-Auron acceleration path is operational. Subsequent sub-issues (#1860 / #1861 / #1862 / #1863 / #1864) layer on more RexNode converters; #1865 adds source-Calc fusion. This PR provides the substitution mechanism they all plug into.

What changes are included in this PR?

Three commits, each independently reviewable.

Commit 1[AURON #1853] Register built-in RexNode converters; add fallback config option

  • FlinkNodeConverterFactory now ships the three built-in converters (RexInputRefConverter, RexLiteralConverter, RexCallConverter) on its singleton at class load. Production callers no longer need to register them manually; tests using the package-private constructor remain unaffected.
  • FlinkAuronConfiguration gains FAIL_BACK_FLINK_ENGINE_ENABLED (boolean, default true, key auron.failback.flink.engine.enabled). Controls whether conversion failure falls back to Flink's engine silently (default) or fails the job at submit time.

Commit 2[AURON #1853] Shadow Flink's StreamExecCalc to attempt native Calc

  • New org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecCalc in auron-flink-planner. Same FQCN as Flink's; classpath ordering (auron-flink-planner ahead of flink-table-planner via the standard auron-flink-assembly packaging) makes the planner construct this class whenever it builds a Calc ExecNode. Same pattern as Apache Gluten's gluten-flink.
  • Override of translateToPlanInternal(PlannerBase, ExecNodeConfig) builds the native plan inline via a small private helper returning Optional<PhysicalPlanNode>. Plan shape is Project[Filter?[FFIReader]] (Filter wrapper only when condition != null); the FFIReader leaf carries a placeholder resource ID that FlinkAuronCalcOperator.open() rewrites at runtime per the Introduce FlinkAuronCalcOperator #1857 contract.
  • Observability: WARN per unique unsupported RexNode class (deduplicated by class via a ThreadLocal<Set>) + WARN per plan-composition exception (per-occurrence with stack trace). Per-submission INFO summary deferred — Flink 1.18's PlannerBase exposes no clean submission-end hook; the per-fallback WARN already provides actionable signal.
  • One-line cross-commit fix to a pre-existing latent defect in Introduce FlinkAuronCalcOperator #1857: FlinkAuronCalcOperator.NativeRuntimeFactory interface now extends java.io.Serializable. Without the marker, Flink's operator dispatch to TaskManagers throws NotSerializableException — the bug was latent because Introduce FlinkAuronCalcOperator #1857 had no E2E test exercising the operator-serialization path. Marker interfaces are non-breaking; the interface is @VisibleForTesting-package-private, so the public API surface is unchanged.

Commit 3[AURON #1853] Add AuronCalcRewriteITCase for end-to-end Flink SQL coverage

  • New AuronCalcRewriteITCase extends AuronFlinkTableTestBase. Four tests cover distinct paths: multi-column Auron-converted projection (select \int` + 1, `int` * 2 from T1), filter-with-fallback (where `int` > 1GREATER_THAN is not yet converter-supported), unsupported-function silent fallback (UPPER(string)), and mixed-Calcs per-Calc granularity (UNION ALL` of one convertible + one non-convertible Calc).

Are there any user-facing changes?

Two new behaviors visible to operators and SQL users:

  1. Native acceleration for Calc operators — any Flink SQL SELECT … WHERE … whose projection and condition use only converter-supported RexNodes now runs through Auron's native engine. Today's supported set (from Convert Math operators to Auron Native operators #1859): RexInputRef, RexLiteral, RexCall with SqlKind in {+, -, *, /, %, MINUS_PREFIX, PLUS_PREFIX, CAST}. The Auron operator emits the same logical results; throughput improves on supported workloads.

  2. flink.auron.failback.flink.engine.enabled config option — boolean, default true. When true, any Calc with an unsupported RexNode silently falls back to Flink's stock codegen Calc (the user sees identical behavior to a non-Auron Flink cluster). When false, conversion failure throws IllegalStateException at job submission — useful for CI gates and during new-operator development.

  3. WARN log lines on fallback — TaskManager logs now surface per-unique-RexNode-class WARN lines like Auron StreamExecCalc fallback (node 17): unsupported RexNode org.apache.calcite.rex.RexFieldAccess; using Flink CodeGen Calc. and per-exception lines like Auron StreamExecCalc fallback (node 17): plan composition threw java.lang.UnsupportedOperationException; using Flink CodeGen Calc. Users with monitoring see fallbacks immediately and can decide whether to file feature requests for missing converters.

No deprecations. No removed APIs.

How was this patch tested?

Unit tests:

  • StreamExecCalcTest — 12 tests covering plan-build paths, fallback paths, strict mode, and WARN dedup contracts.
  • FlinkAuronConfigurationTest — verifies the new config option default and proxy-lookup behavior.
  • FlinkNodeConverterFactoryTest — verifies the singleton ships with built-ins registered.

Integration tests:

  • AuronCalcRewriteITCase — 4 SQL-level tests against TestValuesTableFactory + StreamTableEnvironment. Two exercise the Auron path (multi-column arithmetic, union with convertible branch) and require the native library; two exercise the fallback path (UPPER, > in condition) and pass without the native library.

Command:

./build/mvn test -Pspark-3.5,scala-2.12,flink-1.18 \
    -pl auron-flink-extension/auron-flink-planner,auron-flink-extension/auron-flink-runtime \
    -Dtest=StreamExecCalcTest,FlinkAuronConfigurationTest,FlinkNodeConverterFactoryTest,FlinkAuronCalcOperatorTest,AuronCalcRewriteITCase

(The AuronCalcRewriteITCase Auron-path tests share AuronFlinkCalcITCase.testPlus's native-library prerequisite — build the native lib first via ./auron-build.sh --pre --sparkver 3.5 --scalaver 2.12.)

Checkstyle: 0 violations on both modules.

weiqingy added 3 commits May 21, 2026 23:35
…k config option

Prerequisite infrastructure for the shadowed StreamExecCalc landing in
a subsequent commit:

- FlinkNodeConverterFactory now ships the three built-in converters
  (RexInputRefConverter, RexLiteralConverter, RexCallConverter) on the
  singleton at class load, so production callers reach a usable
  factory without explicit registration. Tests creating fresh
  instances via the package-private constructor stay unaffected.

- FlinkAuronConfiguration gains FAIL_BACK_FLINK_ENGINE_ENABLED
  (boolean, default true), keyed auron.failback.flink.engine.enabled.
  Controls whether conversion failure falls back to the Flink engine
  silently (default) or fails the job at submit time.

Tests: FlinkAuronConfigurationTest 2/2, FlinkNodeConverterFactoryTest
9/9, checkstyle 0 violations.

Issue: apache#1853
The shadowed class lives at the same FQCN as Flink's stock
StreamExecCalc; classpath ordering (auron-flink-planner ahead of
flink-table-planner via the standard auron-flink-assembly packaging)
makes the planner construct this class whenever it builds a Calc
ExecNode. Same pattern as Apache Gluten's gluten-flink.

At translateToPlanInternal time, the class attempts to translate its
projection + condition into a native Project[Filter?[FFIReader]] plan
using the converter framework. On success, constructs a
FlinkAuronCalcOperator inline and wraps it in a OneInputTransformation.
On any failure (unsupported RexNode or exception during composition),
falls back to Flink's stock CodeGenOperator via
super.translateToPlanInternal — gated by FAIL_BACK_FLINK_ENGINE_ENABLED
(default true falls back, false throws IllegalStateException).

Observability: WARN per unique unsupported RexNode class
(deduplicated), WARN per plan-composition exception (per-occurrence
with stack trace). Per-submission INFO summary deferred — Flink
1.18's PlannerBase exposes no clean submission-end hook.

Also: NativeRuntimeFactory now extends java.io.Serializable
(@VisibleForTesting interface). Without the marker, Flink's operator
dispatch to TaskManagers throws NotSerializableException — a latent
defect in apache#1857 that the E2E ITCase in the next commit will exercise.
Marker interfaces are non-breaking.

Tests: StreamExecCalcTest 12/12, FlinkAuronCalcOperatorTest 14/14,
checkstyle 0 violations.

Issue: apache#1853
…QL coverage

Four tests exercising distinct paths of the shadowed StreamExecCalc:

- testMultiColumnArithmeticProjection — Auron path with multi-expression
  projection (`int + 1, int * 2`).
- testFilterAndProjectEndToEnd — Calc-with-condition; GREATER_THAN is
  not yet converter-supported, so this verifies fallback-path
  correctness. The Auron-side Filter[FFIReader] plan-shape coverage
  lands once a predicate-returning converter does.
- testFallbackOnUnsupportedExprStillExecutes — UPPER(string) triggers
  silent fallback; the job emits the correct UPPERed rows.
- testMixedSupportedAndUnsupportedCalcs — UNION ALL of one convertible
  and one non-convertible Calc; verifies per-Calc granularity at the
  job-level correctness layer.

No duplicate with AuronFlinkCalcITCase.testPlus (single-expression
arithmetic). Two of the four tests pass without the native library;
the other two share testPlus's native-lib prerequisite.

Tests: compile clean, checkstyle 0 violations.

Issue: apache#1853
@github-actions github-actions Bot added the flink label May 24, 2026
…ecCalc binding

The shadowed StreamExecCalc shares its FQCN with Flink's stock class.
Maven's scala-maven-plugin testCompile classpath ordering is not always
deterministic across environments: on Linux CI runners, javac resolves
StreamExecCalc to flink-table-planner_2.12-1.18.1.jar instead of the
local target/classes, so symbols only present on the shadow
(peekWarnEmitCountForTest, translateToFlinkCalc) are not visible at
compile time.

- Add invokeStaticInt helper and reach peekWarnEmitCountForTest via
  reflection, matching the existing pattern used for
  resetWarnDedupForTest.
- Drop @OverRide on CapturingTranslator.translateToFlinkCalc so javac
  no longer requires the parent class to declare it. Runtime virtual
  dispatch is unaffected: the loaded StreamExecCalc is the shadow,
  signatures match, and translateToPlanInternal's invocation routes
  to the subclass override as before.

Tested: StreamExecCalcTest 12/12 locally on JDK 8 + JDK 11; spotless
and checkstyle clean; isolated javac run against only the stock Flink
JAR (no local target/classes) compiles the test cleanly, reproducing
the CI classpath condition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Convert Flink Calc operators to Native Calc operators

1 participant