Skip to content

Conversation

@binmahone
Copy link
Collaborator

@binmahone binmahone commented Dec 22, 2025

  • Add per-task cache for GpuScalar in GpuLiteral to reuse identical scalars
  • Make GpuScalar thread-safe with volatile fields and double-check locking
  • Throw IllegalStateException instead of NullPointerException when closed
  • Register cleanup callback on TaskContext completion

This PR is broke up from #14032, please visit that to see more details

Copilot AI review requested due to automatic review settings December 22, 2025 14:14
- Add per-task cache for GpuScalar in GpuLiteral to reuse identical scalars
- Make GpuScalar thread-safe with volatile fields and double-check locking
- Throw IllegalStateException instead of NullPointerException when closed
- Register cleanup callback on TaskContext completion

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>
@binmahone binmahone force-pushed the 251222_gpu_literal_cache branch from 480e346 to ef7a060 Compare December 22, 2025 14:16
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 22, 2025

Greptile Summary

This PR optimizes GpuLiteral by introducing a per-task cache for GpuScalar instances, reusing identical scalars within the same Spark task to reduce memory allocation and GPU data copies. The implementation adds thread-safety improvements to GpuScalar with volatile fields and double-check locking patterns in getBase and getValue methods, enabling fast-path access when values are already initialized. The cache is properly cleaned up via TaskContext completion callbacks, and the code falls back to creating new scalars in unit tests where TaskContext is unavailable. Exception handling is improved by throwing IllegalStateException instead of NullPointerException when accessing closed scalars.

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations around task context behavior
  • The implementation correctly introduces per-task caching with proper cleanup mechanisms via TaskContext callbacks. Thread-safety is improved with volatile fields and double-check locking patterns. The fast/slow path optimization in getBase and getValue is sound. Tests are updated appropriately to verify the new exception types. The primary consideration is ensuring correct refCount management when using incRefCount on cached scalars, but the implementation appears correct with proper synchronization.
  • No files require special attention - all changes are focused and well-tested

Sequence Diagram

sequenceDiagram
    participant Task as Spark Task
    participant GL as GpuLiteral
    participant Cache as GpuLiteral.taskScalarCache
    participant GS as GpuScalar
    participant TC as TaskContext

    Task->>GL: columnarEvalAny()
    GL->>TC: TaskContext.get()
    alt Has TaskContext
        GL->>Cache: getOrCreateCachedScalar(literal)
        alt First access for task
            Cache->>TC: register cleanup callback
            Cache->>Cache: create new ConcurrentHashMap
        end
        alt Scalar exists in cache
            Cache-->>GS: return cached GpuScalar
        else Scalar not in cache
            Cache->>GS: create new GpuScalar
            Cache->>Cache: store in cache
        end
        GS->>GS: incRefCount()
        GS-->>GL: return cached scalar
    else No TaskContext (unit test)
        GL->>GS: create new GpuScalar
        GS-->>GL: return new scalar
    end
    GL-->>Task: return GpuScalar

    Task->>Task: task completes
    TC->>Cache: cleanup callback triggered
    Cache->>Cache: remove taskId entry
    Cache->>GS: close all scalars in task cache

Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@binmahone
Copy link
Collaborator Author

build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 22, 2025

Greptile Summary

This PR optimizes GpuLiteral by adding a per-task cache for GpuScalar and improving thread-safety. The changes include: (1) introducing a per-task cache (taskScalarCache) that reuses GpuScalar objects for identical literals within a task, (2) making GpuScalar thread-safe by adding @volatile annotations and implementing double-checked locking in getter methods, (3) changing exception types from NullPointerException to IllegalStateException when accessing closed scalars, and (4) registering cleanup callbacks via ScalableTaskCompletion to properly close cached scalars on task completion.

The implementation leverages Spark's TaskContext for lifecycle management and uses ConcurrentHashMap for thread-safe cache storage. However, there is a potential race condition in the isValid() method which accesses value without proper synchronization to detect closed state.

Confidence Score: 3/5

  • This PR introduces performance optimizations and thread-safety improvements, but contains a race condition in the isValid() method that needs to be fixed before merging.
  • The score reflects the significant functionality improvement from per-task caching and proper cleanup via task completion callbacks. However, the isValid() method accesses volatile fields without synchronization after close() sets them to null, creating a potential race condition. The double-checked locking pattern in getBase() and getValue() is correctly implemented, but the incomplete synchronization in isValid() prevents a higher confidence score. The test expectations also show an inconsistency: isValid() is expected to throw NullPointerException but the method doesn't explicitly handle the closed state like other getters do.
  • The isValid() method in literals.scala requires attention to properly synchronize access to volatile fields when the scalar is in a closed state.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala Adds per-task GpuScalar cache and thread-safety with volatile fields and double-checked locking. Key improvements: per-task caching in GpuLiteral to reuse identical scalars, volatile fields for thread-safe access, and double-checked locking in getter methods. However, the isValid() method has a potential race condition accessing closed state without synchronization.
tests/src/test/scala/com/nvidia/spark/rapids/unit/GpuScalarUnitTest.scala Test file updated to reflect the exception type change from NullPointerException to IllegalStateException for getBase(), getValue(), and isNan() methods when called on a closed GpuScalar. The isValid() method still expects NullPointerException, which aligns with the current behavior but may need review if the implementation changes.

Sequence Diagram

sequenceDiagram
    participant Task as Spark Task
    participant GpuLiteral as GpuLiteral
    participant Cache as taskScalarCache
    participant TaskCtx as TaskContext
    participant GpuScalar as GpuScalar
    
    Task->>GpuLiteral: columnarEvalAny(batch)
    GpuLiteral->>TaskCtx: get()
    TaskCtx-->>GpuLiteral: taskContext
    GpuLiteral->>Cache: getOrCreateCachedScalar(literal)
    Cache->>Cache: computeIfAbsent(taskId)
    Cache->>TaskCtx: addTaskCompletionListener
    Cache->>GpuScalar: create(value, dataType)
    GpuScalar-->>Cache: cachedScalar
    Cache-->>GpuLiteral: cachedScalar
    GpuLiteral->>GpuScalar: incRefCount
    GpuScalar-->>GpuLiteral: this
    GpuLiteral-->>Task: GpuScalar
    
    Task->>Task: task completion
    TaskCtx->>Cache: onTaskCompletion callback
    Cache->>GpuScalar: close()
    Cache->>Cache: remove(taskId)
Loading

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GpuLiteral by introducing a per-task cache for GpuScalar instances, allowing identical literals within the same task to share the same underlying GPU scalar. The changes also enhance thread-safety of GpuScalar through volatile fields and double-check locking, and improve error handling by throwing IllegalStateException instead of NullPointerException for most closed-state access.

Key changes:

  • Adds per-task GpuScalar caching with automatic cleanup on task completion
  • Makes GpuScalar thread-safe with volatile fields and synchronized access
  • Changes exception types from NullPointerException to IllegalStateException for most methods when accessing closed scalars

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/src/test/scala/com/nvidia/spark/rapids/unit/GpuScalarUnitTest.scala Updates copyright year and modifies test expectations to assert IllegalStateException instead of NullPointerException for most methods when accessing closed GpuScalar instances
sql-plugin/src/main/scala/com/nvidia/spark/rapids/literals.scala Implements per-task GpuScalar caching in GpuLiteral, adds thread-safety to GpuScalar with volatile fields and double-check locking, integrates cleanup via ScalableTaskCompletion, and updates columnarEvalAny to use cached scalars when TaskContext is available

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

assertThrows[NullPointerException](gsv.getValue)
assertThrows[IllegalStateException](gsv.getBase)
assertThrows[IllegalStateException](gsv.getValue)
assertThrows[NullPointerException](gsv.isValid)
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isValid method still throws NullPointerException after close, but other methods like getBase, getValue, and isNan now throw IllegalStateException. This inconsistency should be addressed. The isValid method accesses the value field directly via value.map(), which will throw NullPointerException when value is null (after close), while the updated methods explicitly check for null and throw IllegalStateException.

Copilot uses AI. Check for mistakes.
Comment on lines 782 to 791
override def columnarEvalAny(batch: ColumnarBatch): Any = {
// Returns a Scalar instead of the value to support the scalar of nested type, and
// simplify the handling of result from a `expr.columnarEval`.
GpuScalar(value, dataType)
// When there's no TaskContext (e.g., unit tests), create a new GpuScalar
// each time like original behavior, because there's no cleanup callback
// to properly close the cached scalar. Otherwise use the cached scalar.
if (TaskContext.get() == null) {
GpuScalar(value, dataType)
} else {
cachedScalar.incRefCount
}
}
Copy link

Copilot AI Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new per-task GpuScalar caching mechanism introduced in columnarEvalAny lacks test coverage. Tests should verify: (1) that identical GpuLiterals in the same task share the same GpuScalar instance, (2) that scalars from different tasks are separate, (3) that reference counting works correctly with caching, and (4) that cached scalars are properly cleaned up at task completion.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some performance numbers for this optimization.

// Per-task cache: taskId -> (GpuLiteral -> GpuScalar)
// GpuLiterals with same value and dataType in the same task share the same GpuScalar.
// GpuLiteral's equals/hashCode already handles Array[Byte] correctly.
private val taskScalarCache =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we please have a follow on issue to look at making some, or all of this spillable. Not that I actually want to spill a Literal, but more around if we do run out of memory and this is causing fragmentation, then I want to be able to remove and close things in the cache to try and let more stuff succeeed.

@abellina
Copy link
Collaborator

abellina commented Dec 22, 2025

@binmahone agree with @revans2. We need perf numbers for this.

I like the idea of caching scalars and with the async allocator this strategy should be OK, where we will eventually see defragmentation. Otherwise, with pool_memory_resource or arena_memory_resource, I could see the non-spillable scalar creating holes in the address space, and causing fragmentation that we really like to stay away from. Some of the allocators we have do move tiny blocks to a specific area (local arenas in arena case), but I don't remember if pool does something similar. We are more likely to use arena than pool, at this rate.

For example, your pattern was allocate 1GB, then scalar, then allocate 1GB, and you OOM, you would free the 1GB allocations but not the scalar, leaving a hole in non-defragmenting allocators. A second task wanting to allocate a bit above 1GB would fail, unless async defragmented.

Could we do the cache opportunistically? Could we allocate up front say 8MB and we have something like the tryAllocPinned api, where we would cacheScalarIfPossible. If it fits in the scalar cache. Allocating a single block at the beginning of time makes it so we have less available, so we would want to make it relatively small, but works around the hole problem.

I was going to suggest to make this spillable. I don't know how to do that, actually. Because you are proposing reuse, every scalar we hand out we are going to incRefCount, so there's no way to spill it if 1+ threads owns it. Unfortunately. This last piece makes me propose more strongly that there should be a scalar block of memory we allocate up front and we go there when we can.

@binmahone
Copy link
Collaborator Author

I think scalar block of memory is a good idea. Let me find a way to manage all scalars in this block

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update copyright, also you were going to try other things as well and hopefully post perf numbers. Let us know your state here when you get a chance.

@sameerz sameerz added the performance A performance related task/issue label Jan 4, 2026
@nvauto
Copy link
Collaborator

nvauto commented Jan 26, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance A performance related task/issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants