Vulkan 1.3 has pointers, thanks to buffer device address[1]. It took a while to ...

jauntywundrkind · on Dec 14, 2023

Is IREE the main runtime doing Vulkan or are there others? Who should we be listening to (oh wise @raphlinus)?

It's been awesome seeing folks like Keras 3.0 kicking out broad Intercompatibility across JAX, TF, Pytorch, powered by flexible executuon engines. Looking forward to seeing more Vulkan based runs getting socialized benchmarked & compared. https://news.ycombinator.com/item?id=38446353

raphlinus · on Dec 15, 2023

The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.

[1]: https://kompute.cc/

[2]: https://github.com/webonnx/wonnx

zozbot234 · on Dec 15, 2023

How feasible would it be to target Vulkan 1.3 or such from standard SYCL (as first seen in Sylkan, for earlier Vulkan Compute)? Is it still lacking the numerical properties for some math functions that OpenCL and SYCL seem to expect?

raphlinus · on Dec 15, 2023

That's a really good question. I don't know enough about SYCL to be able to tell you the answer, but I've heard rumblings that it may be the thing to watch. I think there may be some other limitations, for example SYCL 2020 depends on unified shared memory, and that is definitely not something you can depend on in compute shader land (in some cases you can get some of it, for example with resizable BAR, but it depends).

In researching this answer, I came across a really interesting thread[1] on diagnosing performance problems with USM in SYCL (running on AMD HIP in this case). It's a good tour of why this is hard, and why for the vast majority of users it's far better to just use CUDA and not have to deal with any of this bullshit - things pretty much just work.

When targeting compute shaders, you pretty much have to manage buffers manually, and also do copying between host and device memory explicitly (when needed - on hardware such as Apple Silicon, you prefer to not copy). I personally don't have a problem with this, as I like things being explicit, but it is definitely one of the ergonomic advantages of modern CUDA, and one of the reasons why fully automated conversion to other runtimes is not going to work well.

[1]: https://stackoverflow.com/questions/76700305/4000-performanc...

imtringued · on Dec 15, 2023

Unified shared memory is an intel specific extension of OpenCL.

SYCL builds on top of OpenCL so you need to know the history of OpenCL. OpenCL 2.0 introduced shared virtual memory, which is basically the most insane way of doing it. Even with coarse grained shared virtual memory, memory pages can transparently migrate from host to device on access. This is difficult to implement in hardware. The only good implementations were on iGPUs simply because the memory is already shared. No vendor, not even AMD could implement this demanding feature. You would need full cache coherence from the processor to the GPU, something that is only possible with something like CXL and that one isn't ready even to this day.

So OpenCL 2.x was basically dead. It has unimplementable mandatory features so nobody wrote software for OpenCL 2.x.

Khronos then decided to make OpenCL 3.0, which gets rid of all these difficult to implement features so vendors can finally move on.

So, Intel is building their Arc GPUs and they decided to create a variant of shared virtual memory that is actually implementable called unified shared memory.

The idea is the following: All USM buffers are accessible by CPU and GPU, but the location is defined by the developer. Host memory stays on the host and the GPU must access it over PCIe. Device memory stays on the GPU and the host must access it over PCIe. These types of memory already cover the vast majority of use cases and can be implemented by anyone. Then finally, there is "shared" memory, which can migrate between CPU and GPU in a coarse grained matter. This isn't page level. The entire buffer gets moved as far as I am aware. This allows you to do CPU work then GPU work and then CPU work. What doesn't exist is a fully cache coherent form of shared memory.

https://registry.khronos.org/OpenCL/extensions/intel/cl_inte...

zozbot234 · on Dec 15, 2023

https://enccs.github.io/sycl-workshop/unified-shared-memory/ seems to suggest that USM is still a hardware-specific feature in SYCL 2020, so compatibility with hardware that requires a buffer copying approach is still maintained. Is this incorrect?

raphlinus · on Dec 15, 2023

Good call. So this doesn't look like a blocker to SYCL compatibility. I'm interested in learning more about this.

mschuetz · on Dec 15, 2023

> Vulkan 1.3 has pointers, thanks to buffer device address[1].

> [1] https://community.arm.com/arm-community-blogs/b/graphics-gam...

"Using a pointer in a shader - In Vulkan GLSL, there is the GL_EXT_buffer_reference extension "

That extension is utter garbage. I tried it. It was the last thing I tried before giving up on GLSL/Vulkan and switching to CUDA. It was the nail in the coffin that made me go "okay, if that's the best Vulkan can do, then I need to switch to CUDA". It's incredibly cumbersome, confusing and verbose.

What's needed are regular, simple, C-like pointers.