The first ~560 something instructions of the warpspeed scan kernel just handle the setup of mbarriers (up into CCTL.IVALL), which is really excessive (the kernel has a total size of ~1800 instructions). I believe there is great potential to reduce this.
I made a few attempts in: https://github.com/bernhardmgruber/cccl/tree/warpspeed_barrer_init. However, while reducing the setup code to 140 instructions, a few runs had small regressions, so I did not integrate the branch.
We should revisit this.