Do Not Merge: Integration Branch for GT4Py Next by philip-paul-mueller · Pull Request #1 · GridTools/dace

philip-paul-mueller · 2025-04-30T10:04:09Z

This is the PR/branch that GT4Py.Next uses to pull DaCe.
It is essentially DaCe main together with our fixes that, for various reasons have not made it yet into DaCe main.

The process for updating this branch is is as follows there are no exceptions:

You start with current DaCe main.
Then you include the PR that enables automatic Python index update, by squash merge it.
Then squash merge the PRs that are listed below, check if they have been merged into DaCe proper and if so remove them from the list.
Then update the version.py file in the dace/ subfolder. Make sure that there is no new line at the end. For next we are using the epoch 43, cartesian would use 42. As version number the date is used. Thus the version (for next) would look something like: '43!YYYY.MM.DD'.
Force push your changes to this branch (gt4py-next-integration).
Create a tag with the pattern __gt4py-next-integration_YYYY_MM_DD and push it as well.
Make sure that the workflow has been triggered.

Afterwards you have to update GT4Py's pyproject.toml file.
For this you have to update the version requirement of DaCe in the dace-next group at the beginning of the file to the version you just created, i.e. change it to dace==43!YYYY.MM.DD.
Then you have to update the the source in the uv specific parts of the file, there you have to change the source to the new tag you have just created.
Then you have to update the uv look by running uv sync --extra next --group dace-next, if you have installed the precommit hooks then this will be done automatically.

NOTE: Once PR#2423 has been merged the second step, i.e. adapting the tag in the uv specific parts is no longer needed.

On top of DaCe/main we are using the following PRs:
No open PRs currently, all changes are in dace main.

No Longer Needed

Fix InlineMultistateSDFG: find new name for all control-flow blocks
Fixing concurrency bug in DaCe.Config
Fix codegen for cuda memory pool
Avoided costly copy operation
Better Use Of Simplify
Improve PruneSymbols
Fixing scope_tree_recursive()
Updated edge consolidation in MapFusion
Fixed other_subset validation
Handling of Empty Memlets by GPU Codegen
Modification of CUDA Codegen, setting streams
Fixes positional arguments if restored
Uses anonymous file to store configuration
Make GPU Code Generation Reentrant
Updated GPU Copies
Fixes Configuration Contextmanager
PruneConnector now uses the isolation function for nested SDFG
Modification of state_fission()
Update SubgraphView
Updated Map fusion
Refine try_initialize() edges in Map fusion
Improving symbol usage
Added Fine Grain Control to MapFusionVertical
Fixed MapFusion
Modifications of the DaCe Default Configuration - Instead, dace is configured in gt4py as required.
Faster check of SDFG return value
Canonicalize Memlets
Fixed RedundantSecondArray
Disable AddThreadBlockMap transformation in cuda codegen; no longer needed since DaCe PR#2202 has been merged.
Relocate import in fast_call()
Added compiled_sdfg_call_hooks_manager
Make self._lastargs Mutable No longer needed since we now use GT4Py PR#2353 and DaCe PR#2206.
MapFusion: Multiple Consumer of Second Map
Added canonicalized Memlet functions
Make self._lastargs Mutable (Should be replaced by a more permanent solution).
Refactored Reloading Scheme
Deterministic Map Labels in MapFusion*
Fixed AddThreadBlockMap
Fix apply_transformation_once_everywhere()
Fix convert block_size config
Add NVTX/rocTX ranges
CompiledSDFG refactoring (archive):
For some reason the original PR has been "taken over" by Tal.
Due to the inherent dependency that GT4Py has on this PR we should use the the archive (liked at the top).
Fix gpu reduction templates
Add #pragma unroll to LoopRegion generated code
Add HIP-codegen for stream-order memory allocation
Fix DFS traversal in MapFusionVertical
New Scheme to Name Intermediates in MapFusionVertical

Instead of pulling directly from the official DaCe repo, we now (for the time being) pull from [this PR](GridTools/dace#1). This became necessary as we have a lot of open PR in DaCe and need some custom fixes (that can in their current form not be merged into DaCe). In the long term however, we should switch back to the main DaCe repo.

…#2294) This PR enables enumerations to contain attributes via definition as a dataclass. It is also better than the previous `aenum`-based approach for type checkers and IDEs, as it transparently keeps the enumeration members. This feature will be useful for nesting attributes and methods into the classes themselves, improving extensibility. Also enables support for dataclass serialization/deserialization, and removes `aenum` as a requirement. The syntax is as follows (for example): ```python from dace.attr_enum import ExtensibleAttributeEnum from enum import auto class ScheduleType(ExtensibleAttributeEnum): Default = auto() #: Scope-default parallel schedule Sequential = auto() #: Sequential code (single-thread) MPI = auto() #: MPI processes @DataClass(frozen=True) class CPU_Multicore: omp_schedule_type: OMPScheduleType = OMPScheduleType.Default # ... ``` Setting `CPU_Multicore = CPUData` to an external dataclass is also possible. As a result, `ScheduleType.CPU_Multicore` is now a _template_ enum member, and `CPU_Multicore(OMPScheduleType.Static)` is an instance. Registering a new template externally looks like: ```python ScheduleType.register_template("CPU_Multicore", CPUData) ```

A student had a problem because `np.int8` maps to `char`. Char can be either unsigned or signed according to the C++ standard (https://en.cppreference.com/w/cpp/language/types.html). I propose we either use `int8_t` directly or `signed char`, I updated the dictionary according to this proposal.

After a brief discussion `subsets.Indices` were deprecated last week with PR spcl#2282. Since then, many Dace tests emit warnings because of remaining usage of `Indices` in the offset member functions of `subsets.Range`. This PR suggests to adapt `Range.from_indices()` to add support for a sequence of (symbolic) numbers or strings (as suggested in Mattermost). This allows to remove the remaining usage of `subsets.Indices` constructors in the DaCe codebase, which gets rid of a bunch of warnings emitted in test or upstream/user code. Only hickup that I had doing this was the function `_add_read_slice()` , called from `visit_Subscript()` of the `ProgramVisitor` in `newast.py` . That function would check for subsets to be either ranges or indices. And if subsets were indices, we'd go another way. That code path separation is apparently loosely tied to some other place in the codebase because we'd get errors if we were going the sub-optimal ranges-path with indices. I do now check if ranges are indices and set the flag accordingly. That seems to fix issues in tests. I've also checked (manually) all other cases where we'd go a different code path in case subsets are indices. There are some and the remaining ones all "upgrade" indices to ranges. They can be removed once we remove the deprecated `Indices` class. --------- Co-authored-by: Roman Cattaneo <> Co-authored-by: Tal Ben-Nun <tbennun@gmail.com>

This PR replaces `is_start_state` -> `is_start_block` because the former is deprecated. The PR is part of an ongoing effort to reduce warnings emitted in tests. Unrelated to this change, the PR removes unused imports and fixes a couple of typos in changed files. --------- Co-authored-by: Roman Cattaneo <>

Updated GitHub Actions dependencies to the latest versions. No breakage is expected. I've checked the logs and most of them just updated to node20, which is a breaking changes because it requires an up-to-date runner. Since we rely GitHub's runners, this should be no problem. Co-authored-by: Roman Cattaneo <> Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>

Reduces the number of warnings

The default configuration of CUDA MPS does not support the number of pytest workers (32) used by the CI job. Besides, CUDA MPS is not needed because the GPU is not configured in exclusive mode.

A student gets a CMake error on some older version. ``` CMake Error at CMakeLists.txt:191 (if): if given arguments: "3.28.3" "VERSION_LESS" Unknown arguments specified ``` I think it is better to just check if the variable is defined before, so as not have a missing argument error.

`MapFusionVertical` must create new data (reduced versions of the intermediate data) and hence name it. Before it was using a naming scheme based on the node id, which might not be stable. The new scheme uses the name of the intermediate data and guarantees stable names for exclusive intermediate nodes and for shared intermediate nodes under the condition that they are only involved in one MapFusionVertical operation. --------- Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>

spcl#2298) Get the number of warnings in tests down by avoiding usage of `state.add_*` functions like `state.add_array(...)`. --------- Co-authored-by: Roman Cattaneo <>

The `ControlFlowReachability` pass gets prohibitively expensive in particular graphs. Updating from `v1/maintenance` to current `main`, we have seen `simplify()` runtimes of 10-15 minutes where previous runtime was in the order of magnitude of tens of seconds. The slowdown turned out to be caused by not caching closures per region. Some of our graphs generate large, nested control flow regions (if statements) from iterative solvers with conditional returns that we map to if/else blocks with a boolean mask. In such a scenario, having four layers of nestedness is easily achieved and then `_region_closure()` gets called again and again for previously calculated closures of regions. Because of the transitive requirement of theses closures, control flow regions nested deep will have to "go up" an re-evaluate the same closures for "upper" regions again and again. This PR suggest a simple cache of closures per region to avoid this duplicate evaluation. Co-authored-by: Roman Cattaneo <>

Reduces SDFG size when serialized, using the following methods: * Non-human-readable JSON dumping by default * Consolidating file names in DebugInfo to be per-SDFG * Reducing the size of the DebugInfo JSON object based on fields * Saving transformation history set to off by default

This PR suggest to write a `CACHEDIR.TAG` file into the program folder. The tag is an attempt to signal (e.g. to backup software) that the containing folder contains no archival value, see http://www.brynosaurus.com/cachedir/. While the convention started with cache directories for things like thumbnails of a webbrowser, I'd argue the same argumentation (no archival value, frequent changes, un-suiteable to be located in `/var/cache` or `/tmp`) apply for build folders. Instead of writing the file by hand, we could also a library like https://pypi.org/project/cachedir-tag/. --------- Co-authored-by: Roman Cattaneo <>

Set is not hashable doesn't work with lru cache decorator, I propose using FrozenSet here.

Some parts of DaCe are currently relying on `six`, a python2 / python3 compatibility library. Given that DaCe is only supporting python 3.10 - 3.14 now, I think we don't need the `six` dependency anymore.

Following up on PR spcl#2312, this PR proposes to save compressed SDFGs in the program folder. As discussed in the last meeting: - no change to the API, i.e. we keep keyword arguments of `sdfg.save()` as they are. - save `program.sdfg` as compressed `program.sdfgz` inside the program folder - make sure that changes are backwards compatible and that the program folder is still found regardless of `program.sdfg` or `program.sdfgz` --------- Co-authored-by: Roman Cattaneo <>

If we unroll a top-level for CFG, then the connectivity might be broken, I added a unit test and the fix to it. Replace dict on loop does not properly update the init statement, which can be exposed by loop unrolling when parent loop parameter is used inside in an inner loop. I fixed it and added a unit test in loop unroll that would expose it.

`Range` subsets have a `reorder()` function that re-orders the dimensions in-place. So far, it only re-ordered the ranges, but not the tile sizes (which are stored in a separate list). This PR makes sure both, ranges and tile sizes, are re-ordered according to the given permutation. The PR adds a simple test case.

spcl#2329)

The initial idea was to reduce the size of the cache folder, by only storing the components that are needed. To this end the code generator was modified such that different versions of build folder could be generated. Currently there are only two versions: - `development`: Which is the old full version, i.e. everything in a single folder. - `production`: This is a reduced folder and only contains the libraries (stub and program library) as well the version. The implementation has two parts. First `generate_program_folder()` was modified such that it only generates the parts that are absolutely needed, such as source files and anything else would not be generated in the first place. Then `compile_and_configure()` was modified such that it would remove the parts that are no longer needed (an example would be the source files which are needed for compilation but not afterwards). The changes are backwards compatible. Thus, caches that were generated _before_ this PR will still work, but they should be phased out. The changes are also done in a way that it should be simple to add new modes later, if their need arises. In Addition: - Also fixes that `sdfg.view()` fails if there are external SDFGs ([727cfb1](spcl@727cfb1)) - `sdfg.generate_code()` no longer generates the source maps as a side effect and writes them to disc. Instead they are generated by the `generate_program_folder()` ([5e1694b](spcl@5e1694b)). - Dumping of the configuration in the build folder ([eb54062](spcl@eb54062)). - Miscellaneous refactoring in the `ReloadableDLL`, that changed its interface.

Old version is now deprecated

Unused imports add a performance overhead at runtime, and risk creating import cycles. To automatically detect and remove them in the future, this PR suggests to add [ruff](https://docs.astral.sh/ruff/) as a `pre-commit` hook to detect unused imports. I've added an exception for `__init__.py` files to allow re-exports without adding an explicit `__all__` list or adding extra annotations. The PR grew quite big. However, non-automatic changes are only in the following seven files - `.pre-commit-config.yaml`: configure `pre-commit` to run the `ruff` linter - `ruff.toml`: configure the `ruff` linter to only search for unused imports (rule `F401`) - `dace/autodiff/library/library.py`: make sure we keep the `ParameterArray` import for backwards compatibility - `dace/frontend/python/replacements/operators.py`: make sure we keep the `dace` import for evaluation of data types - `tests/library/include_test.py`: make sure we keep the necessary import in the middle of the test - `dace/sdfg/analysis/schedule_tree/treenodes.py`: manually remove the now trivial `if TYPE_CHECKING` branch

When we started to enforce consistent formatting on the CI (PR spcl#1957), an important discussion point was for developers to see changes that needed to be applied. For lack of better knowledge, I've added a small script to show the `git diff` output in case of failure. I recently learned that `pre-commit` has a built-in `--show-diff-on-failure` option <img width="3241" height="1202" alt="image" src="https://github.com/user-attachments/assets/843ebae8-1dda-401d-913e-f6dabf150bd0" /> which enables exactly this behavior out of the box ([link to failing workflow run](https://github.com/spcl/dace/actions/runs/24556723300/job/71794961944?pr=2337)). It even comes with a colored output ... 🤩 I suggest, we drop the custom script since this option is much simpler.

- Fix script to query the HIP architecture of the machine - Remove explicit setting of `--offload-arch` in `HIP_HIPCC_FLAGS` - Set `CMAKE_HIP_FLAGS` instead of `EXTRA_HIP_FLAGS`

When we build an SDFG, there's the option to store `DebugInfo` with some SDFG nodes. For example, this `DebugInfo` can be used to store file & line information of parsed code when building an SDFG. When using the SDFG API, the default is to inspect the python stack and extract file & line information from there. These calls to `inspect` can/will be expensive, especially for bigger graphs. This PR proposes to add a configuration option, `compiler.lineinfo`, to drive this behavior from a single place. The defaults are kept as is, i.e. we keep inspecting the stack by default. However, the config option allows a central pace to turn `DebugInfo` off, which could be configured in production scenarios.

philip-paul-mueller marked this pull request as draft April 30, 2025 10:04

This was referenced Apr 30, 2025

Do Not Merge: Integration Branch for GT4Py philip-paul-mueller/dace#4

Closed

build[dace][next]: Changed DaCe Source GridTools/gt4py#2012

Merged

philip-paul-mueller force-pushed the gt4py-next-integration branch from 7fcf8f9 to 8b9b674 Compare May 1, 2025 10:14

philip-paul-mueller mentioned this pull request May 7, 2025

Do Not Merge: Changes to DaCe CUDA compilation flags #2

Closed

philip-paul-mueller force-pushed the gt4py-next-integration branch from 8b9b674 to 268fc18 Compare May 7, 2025 12:29

philip-paul-mueller force-pushed the gt4py-next-integration branch 3 times, most recently from 964e84b to 2d85437 Compare May 26, 2025 05:22

philip-paul-mueller force-pushed the gt4py-next-integration branch from 2d85437 to 9f72250 Compare June 5, 2025 07:43

philip-paul-mueller mentioned this pull request Jun 5, 2025

build[dace][next]: Updated DaCe GridTools/gt4py#2064

Merged

philip-paul-mueller force-pushed the gt4py-next-integration branch 2 times, most recently from 88c99f4 to d779cd1 Compare June 10, 2025 11:50

edopao force-pushed the gt4py-next-integration branch from d779cd1 to 4f40029 Compare June 12, 2025 12:46

philip-paul-mueller force-pushed the gt4py-next-integration branch from 4f40029 to 09dfda3 Compare June 13, 2025 07:18

philip-paul-mueller force-pushed the gt4py-next-integration branch from 09dfda3 to 87c77ef Compare June 24, 2025 10:49

edopao force-pushed the gt4py-next-integration branch from 87c77ef to c2a4e42 Compare June 27, 2025 14:08

philip-paul-mueller force-pushed the gt4py-next-integration branch from c2a4e42 to 0deba99 Compare July 2, 2025 14:24

philip-paul-mueller mentioned this pull request Jul 11, 2025

feat[cartesian]: DaCe bridge refactor: OIR -> TreeIR -> ScheduleTree -> SDFG GridTools/gt4py#2067

Merged

2 tasks

edopao force-pushed the gt4py-next-integration branch from 178037a to 9114985 Compare July 14, 2025 08:42

philip-paul-mueller changed the title ~~Do Not Merge: Integration Branch for GT4Py~~ Do Not Merge: Integration Branch for GT4Py Next Jul 15, 2025

philip-paul-mueller force-pushed the gt4py-next-integration branch 4 times, most recently from 33b63a1 to 2417e09 Compare July 21, 2025 07:42

philip-paul-mueller force-pushed the gt4py-next-integration branch 4 times, most recently from 3472895 to bed3b0e Compare July 24, 2025 07:24

tbennun and others added 9 commits February 5, 2026 16:11

Fewer warnings (spcl#2299)

673909a

Reduces the number of warnings

(CI-Fix) Disable CUDA MPS in CSCS CI (spcl#2303)

90d45d9

The default configuration of CUDA MPS does not support the number of pytest workers (32) used by the CI job. Besides, CUDA MPS is not needed because the GPU is not configured in exclusive mode.

edopao force-pushed the gt4py-next-integration branch from 87cc16f to c702c0a Compare February 12, 2026 10:11

romanc and others added 11 commits February 12, 2026 10:12

Avoid usage of deprecated state.add_array() and state.add_stream() (

f647548

spcl#2298) Get the number of warnings in tests down by avoiding usage of `state.add_*` functions like `state.add_array(...)`. --------- Co-authored-by: Roman Cattaneo <>

Remove replication in ConditionalBlock serialization (spcl#2310)

ae2f69a

Fix impossible to realize type hint (spcl#2314)

0d83bf3

Set is not hashable doesn't work with lru cache decorator, I propose using FrozenSet here.

refactor: remove six (py2 / py3 compat library) (spcl#2313)

dfac14e

Some parts of DaCe are currently relying on `six`, a python2 / python3 compatibility library. Given that DaCe is only supporting python 3.10 - 3.14 now, I think we don't need the `six` dependency anymore.

InlineSDFG: verify that connectors have properly matching access nodes (

0ec62e2

spcl#2329)

edopao force-pushed the gt4py-next-integration branch from c702c0a to 4efefa7 Compare March 31, 2026 06:35

philip-paul-mueller and others added 7 commits April 17, 2026 11:06

Update Read the Docs configuration for new versions (spcl#2330)

77b3ca4

Old version is now deprecated

Handle HIP flags similarly to CUDA flags (spcl#2338)

4a322c1

- Fix script to query the HIP architecture of the machine - Remove explicit setting of `--offload-arch` in `HIP_HIPCC_FLAGS` - Set `CMAKE_HIP_FLAGS` instead of `EXTRA_HIP_FLAGS`

Update DaCe version

de77c7c

iomaganaris force-pushed the gt4py-next-integration branch from 4efefa7 to de77c7c Compare April 20, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do Not Merge: Integration Branch for GT4Py Next#1

Do Not Merge: Integration Branch for GT4Py Next#1
philip-paul-mueller wants to merge 64 commits intomainfrom
gt4py-next-integration

philip-paul-mueller commented Apr 30, 2025 •

edited by edopao

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

philip-paul-mueller commented Apr 30, 2025 • edited by edopao Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

No Longer Needed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

philip-paul-mueller commented Apr 30, 2025 •

edited by edopao

Loading