Optimize GpuLiteral with per-task GpuScalar cache #14049

binmahone · 2025-12-22T14:14:45Z

Add per-task cache for GpuScalar in GpuLiteral to reuse identical scalars
Make GpuScalar thread-safe with volatile fields and double-check locking
Throw IllegalStateException instead of NullPointerException when closed
Register cleanup callback on TaskContext completion

This PR is broke up from #14032, please visit that to see more details

- Add per-task cache for GpuScalar in GpuLiteral to reuse identical scalars - Make GpuScalar thread-safe with volatile fields and double-check locking - Throw IllegalStateException instead of NullPointerException when closed - Register cleanup callback on TaskContext completion Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

greptile-apps · 2025-12-22T14:16:32Z

Greptile Summary

This PR optimizes GpuLiteral by introducing a per-task cache for GpuScalar instances, reusing identical scalars within the same Spark task to reduce memory allocation and GPU data copies. The implementation adds thread-safety improvements to GpuScalar with volatile fields and double-check locking patterns in getBase and getValue methods, enabling fast-path access when values are already initialized. The cache is properly cleaned up via TaskContext completion callbacks, and the code falls back to creating new scalars in unit tests where TaskContext is unavailable. Exception handling is improved by throwing IllegalStateException instead of NullPointerException when accessing closed scalars.

Confidence Score: 4/5

This PR is safe to merge with minor considerations around task context behavior
The implementation correctly introduces per-task caching with proper cleanup mechanisms via TaskContext callbacks. Thread-safety is improved with volatile fields and double-check locking patterns. The fast/slow path optimization in getBase and getValue is sound. Tests are updated appropriately to verify the new exception types. The primary consideration is ensuring correct refCount management when using incRefCount on cached scalars, but the implementation appears correct with proper synchronization.
No files require special attention - all changes are focused and well-tested

Sequence Diagram

sequenceDiagram
    participant Task as Spark Task
    participant GL as GpuLiteral
    participant Cache as GpuLiteral.taskScalarCache
    participant GS as GpuScalar
    participant TC as TaskContext

    Task->>GL: columnarEvalAny()
    GL->>TC: TaskContext.get()
    alt Has TaskContext
        GL->>Cache: getOrCreateCachedScalar(literal)
        alt First access for task
            Cache->>TC: register cleanup callback
            Cache->>Cache: create new ConcurrentHashMap
        end
        alt Scalar exists in cache
            Cache-->>GS: return cached GpuScalar
        else Scalar not in cache
            Cache->>GS: create new GpuScalar
            Cache->>Cache: store in cache
        end
        GS->>GS: incRefCount()
        GS-->>GL: return cached scalar
    else No TaskContext (unit test)
        GL->>GS: create new GpuScalar
        GS-->>GL: return new scalar
    end
    GL-->>Task: return GpuScalar

    Task->>Task: task completes
    TC->>Cache: cleanup callback triggered
    Cache->>Cache: remove taskId entry
    Cache->>GS: close all scalars in task cache

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

binmahone · 2025-12-22T14:16:56Z

build

greptile-apps · 2025-12-22T14:17:58Z

Greptile Summary

This PR optimizes GpuLiteral by adding a per-task cache for GpuScalar and improving thread-safety. The changes include: (1) introducing a per-task cache (taskScalarCache) that reuses GpuScalar objects for identical literals within a task, (2) making GpuScalar thread-safe by adding @volatile annotations and implementing double-checked locking in getter methods, (3) changing exception types from NullPointerException to IllegalStateException when accessing closed scalars, and (4) registering cleanup callbacks via ScalableTaskCompletion to properly close cached scalars on task completion.

The implementation leverages Spark's TaskContext for lifecycle management and uses ConcurrentHashMap for thread-safe cache storage. However, there is a potential race condition in the isValid() method which accesses value without proper synchronization to detect closed state.

Confidence Score: 3/5

This PR introduces performance optimizations and thread-safety improvements, but contains a race condition in the isValid() method that needs to be fixed before merging.
The score reflects the significant functionality improvement from per-task caching and proper cleanup via task completion callbacks. However, the isValid() method accesses volatile fields without synchronization after close() sets them to null, creating a potential race condition. The double-checked locking pattern in getBase() and getValue() is correctly implemented, but the incomplete synchronization in isValid() prevents a higher confidence score. The test expectations also show an inconsistency: isValid() is expected to throw NullPointerException but the method doesn't explicitly handle the closed state like other getters do.
The isValid() method in literals.scala requires attention to properly synchronize access to volatile fields when the scalar is in a closed state.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala	Adds per-task GpuScalar cache and thread-safety with volatile fields and double-checked locking. Key improvements: per-task caching in GpuLiteral to reuse identical scalars, volatile fields for thread-safe access, and double-checked locking in getter methods. However, the `isValid()` method has a potential race condition accessing closed state without synchronization.
tests/src/test/scala/com/nvidia/spark/rapids/unit/GpuScalarUnitTest.scala	Test file updated to reflect the exception type change from `NullPointerException` to `IllegalStateException` for `getBase()`, `getValue()`, and `isNan()` methods when called on a closed GpuScalar. The `isValid()` method still expects `NullPointerException`, which aligns with the current behavior but may need review if the implementation changes.

Sequence Diagram

sequenceDiagram
    participant Task as Spark Task
    participant GpuLiteral as GpuLiteral
    participant Cache as taskScalarCache
    participant TaskCtx as TaskContext
    participant GpuScalar as GpuScalar
    
    Task->>GpuLiteral: columnarEvalAny(batch)
    GpuLiteral->>TaskCtx: get()
    TaskCtx-->>GpuLiteral: taskContext
    GpuLiteral->>Cache: getOrCreateCachedScalar(literal)
    Cache->>Cache: computeIfAbsent(taskId)
    Cache->>TaskCtx: addTaskCompletionListener
    Cache->>GpuScalar: create(value, dataType)
    GpuScalar-->>Cache: cachedScalar
    Cache-->>GpuLiteral: cachedScalar
    GpuLiteral->>GpuScalar: incRefCount
    GpuScalar-->>GpuLiteral: this
    GpuLiteral-->>Task: GpuScalar
    
    Task->>Task: task completion
    TaskCtx->>Cache: onTaskCompletion callback
    Cache->>GpuScalar: close()
    Cache->>Cache: remove(taskId)

Copilot

Pull request overview

This PR optimizes GpuLiteral by introducing a per-task cache for GpuScalar instances, allowing identical literals within the same task to share the same underlying GPU scalar. The changes also enhance thread-safety of GpuScalar through volatile fields and double-check locking, and improve error handling by throwing IllegalStateException instead of NullPointerException for most closed-state access.

Key changes:

Adds per-task GpuScalar caching with automatic cleanup on task completion
Makes GpuScalar thread-safe with volatile fields and synchronized access
Changes exception types from NullPointerException to IllegalStateException for most methods when accessing closed scalars

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
tests/src/test/scala/com/nvidia/spark/rapids/unit/GpuScalarUnitTest.scala	Updates copyright year and modifies test expectations to assert IllegalStateException instead of NullPointerException for most methods when accessing closed GpuScalar instances
sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala	Implements per-task GpuScalar caching in GpuLiteral, adds thread-safety to GpuScalar with volatile fields and double-check locking, integrates cleanup via ScalableTaskCompletion, and updates columnarEvalAny to use cached scalars when TaskContext is available

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-22T14:20:37Z

tests/src/test/scala/com/nvidia/spark/rapids/unit/GpuScalarUnitTest.scala

-    assertThrows[NullPointerException](gsv.getValue)
+    assertThrows[IllegalStateException](gsv.getBase)
+    assertThrows[IllegalStateException](gsv.getValue)
    assertThrows[NullPointerException](gsv.isValid)


The isValid method still throws NullPointerException after close, but other methods like getBase, getValue, and isNan now throw IllegalStateException. This inconsistency should be addressed. The isValid method accesses the value field directly via value.map(), which will throw NullPointerException when value is null (after close), while the updated methods explicitly check for null and throw IllegalStateException.

Copilot · 2025-12-22T14:20:37Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala

  override def columnarEvalAny(batch: ColumnarBatch): Any = {
-    // Returns a Scalar instead of the value to support the scalar of nested type, and
-    // simplify the handling of result from a `expr.columnarEval`.
-    GpuScalar(value, dataType)
+    // When there's no TaskContext (e.g., unit tests), create a new GpuScalar
+    // each time like original behavior, because there's no cleanup callback
+    // to properly close the cached scalar. Otherwise use the cached scalar.
+    if (TaskContext.get() == null) {
+      GpuScalar(value, dataType)
+    } else {
+      cachedScalar.incRefCount
+    }
  }


The new per-task GpuScalar caching mechanism introduced in columnarEvalAny lacks test coverage. Tests should verify: (1) that identical GpuLiterals in the same task share the same GpuScalar instance, (2) that scalars from different tasks are separate, (3) that reference counting works correctly with caching, and (4) that cached scalars are properly cleaned up at task completion.

revans2

I would like to see some performance numbers for this optimization.

revans2 · 2025-12-22T15:33:33Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala

+  // Per-task cache: taskId -> (GpuLiteral -> GpuScalar)
+  // GpuLiterals with same value and dataType in the same task share the same GpuScalar.
+  // GpuLiteral's equals/hashCode already handles Array[Byte] correctly.
+  private val taskScalarCache =


nit: Can we please have a follow on issue to look at making some, or all of this spillable. Not that I actually want to spill a Literal, but more around if we do run out of memory and this is causing fragmentation, then I want to be able to remove and close things in the cache to try and let more stuff succeeed.

abellina · 2025-12-22T15:48:06Z

@binmahone agree with @revans2. We need perf numbers for this.

I like the idea of caching scalars and with the async allocator this strategy should be OK, where we will eventually see defragmentation. Otherwise, with pool_memory_resource or arena_memory_resource, I could see the non-spillable scalar creating holes in the address space, and causing fragmentation that we really like to stay away from. Some of the allocators we have do move tiny blocks to a specific area (local arenas in arena case), but I don't remember if pool does something similar. We are more likely to use arena than pool, at this rate.

For example, your pattern was allocate 1GB, then scalar, then allocate 1GB, and you OOM, you would free the 1GB allocations but not the scalar, leaving a hole in non-defragmenting allocators. A second task wanting to allocate a bit above 1GB would fail, unless async defragmented.

Could we do the cache opportunistically? Could we allocate up front say 8MB and we have something like the tryAllocPinned api, where we would cacheScalarIfPossible. If it fits in the scalar cache. Allocating a single block at the beginning of time makes it so we have less available, so we would want to make it relatively small, but works around the hole problem.

I was going to suggest to make this spillable. I don't know how to do that, actually. Because you are proposing reuse, every scalar we hand out we are going to incRefCount, so there's no way to spill it if 1+ threads owns it. Unfortunately. This last piece makes me propose more strongly that there should be a scalar block of memory we allocate up front and we go there when we can.

binmahone · 2025-12-22T16:18:39Z

I think scalar block of memory is a good idea. Let me find a way to manage all scalars in this block

abellina

need to update copyright, also you were going to try other things as well and hopefully post perf numbers. Let us know your state here when you get a chance.

nvauto · 2026-01-26T02:06:29Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Copilot AI review requested due to automatic review settings December 22, 2025 14:14

Copilot started reviewing on behalf of binmahone December 22, 2025 14:15 View session

binmahone force-pushed the 251222_gpu_literal_cache branch from 480e346 to ef7a060 Compare December 22, 2025 14:16

greptile-apps bot reviewed Dec 22, 2025

View reviewed changes

binmahone mentioned this pull request Dec 22, 2025

Optimize GpuHashAggregate by 33%-40% for multi-column avg/sum aggregations #14032

Closed

binmahone requested a review from abellina December 22, 2025 14:20

binmahone self-assigned this Dec 22, 2025

Copilot AI reviewed Dec 22, 2025

View reviewed changes

revans2 reviewed Dec 22, 2025

View reviewed changes

abellina reviewed Jan 2, 2026

View reviewed changes

sameerz added the performance A performance related task/issue label Jan 4, 2026

Optimize GpuLiteral with per-task GpuScalar cache #14049

Are you sure you want to change the base?

Optimize GpuLiteral with per-task GpuScalar cache #14049

Conversation

binmahone commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

binmahone commented Dec 22, 2025

Uh oh!

greptile-apps bot commented Dec 22, 2025

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

abellina commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

binmahone commented Dec 22, 2025

Uh oh!

abellina left a comment

Choose a reason for hiding this comment

Uh oh!

nvauto commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

binmahone commented Dec 22, 2025 •

edited

Loading

greptile-apps bot commented Dec 22, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

abellina commented Dec 22, 2025 •

edited

Loading