Vk cuda interop by kevyuu · Pull Request #1061 · Devsh-Graphics-Programming/Nabla

kevyuu · 2026-04-23T10:43:32Z

Description

Bridging CUDA and Vulkan

Testing

…utes

…nitializer

devshgraphicsprogramming · 2026-05-04T21:31:42Z

 }


-core::smart_refctd_ptr<ISemaphore> CVulkanLogicalDevice::createSemaphore(const uint64_t initialValue)


intialValue should go into creationPArams and the old signature should be implemented in terms of new and marked deprecated

devshgraphicsprogramming · 2026-05-04T21:32:49Z

+    if (m_externalHandle != ExternalHandleNull)
+    {
+        bool re = CloseExternalHandle(m_externalHandle);
+        assert(re);


can log an error using the m_vulkanDevice's debug callback

devshgraphicsprogramming · 2026-05-04T21:37:20Z

+    // The Vulkan spec states: If the pNext chain includes a VkExternalMemoryImageCreateInfo or VkExternalMemoryImageCreateInfoNV structure whose handleTypes member is not 0, initialLayout must be VK_IMAGE_LAYOUT_UNDEFINED
+    vk_createInfo.initialLayout = external ? VK_IMAGE_LAYOUT_UNDEFINED : (params.preinitialized ? VK_IMAGE_LAYOUT_PREINITIALIZED : VK_IMAGE_LAYOUT_UNDEFINED);


I would put a check in the valid() of the params that preinitialized is not true if external is true

devshgraphicsprogramming · 2026-05-04T21:38:49Z

+    const auto handleType = static_cast<VkExternalSemaphoreHandleTypeFlagBits>(creationParams.externalHandleTypes.value);
+    if (handleType != 0)
+    {
+#ifdef _WIN32
+        VkSemaphoreGetWin32HandleInfoKHR props = {
+            .sType = VK_STRUCTURE_TYPE_SEMAPHORE_GET_WIN32_HANDLE_INFO_KHR,
+            .semaphore = semaphore,
+            .handleType = handleType,
+        };
+
+        if (VK_SUCCESS != m_devf.vk.vkGetSemaphoreWin32HandleKHR(m_vkdev, &props, &externalHandle))
+        {
+            m_devf.vk.vkDestroySemaphore(m_vkdev, semaphore, nullptr);
+            return nullptr;
+        }
+#else
+        VkSemaphoreGetFdInfoKHR props = {
+            .sType = VK_STRUCTURE_TYPE_SEMAPHORE_GET_FD_INFO_KHR,
+            .semaphore = vkSemaphore,
+            .handleType = handleType,
+        };
+        if (VK_SUCCESS != m_devf.vk.vkGetSemaphoreFdKHR(m_vkdev, &props, &externalHandle))
+        {
+            m_devf.vk.vkDestroySemaphore(m_vkdev, semaphore, nullptr);
+            return nullptr;
+        }
+#endif
+    }


this should be handled through a switch with a default case so we don't die when more handle types are added

ok I see you can support multiple handles at once so switch doesn't make sense but you should still gate/assert that the external handle type you're doing the code for is there (e.g. WIN32 or FD) and others are not

devshgraphicsprogramming · 2026-05-04T21:41:23Z

+#ifdef _WIN32
+    VkImportMemoryWin32HandleInfoKHR importInfo = { 
+        .sType = VK_STRUCTURE_TYPE_IMPORT_MEMORY_WIN32_HANDLE_INFO_KHR,
+        .handleType = static_cast<VkExternalMemoryHandleTypeFlagBits>(info.externalHandleType),


you need to check the handle types at the start IMHO before just doing through with win32 or POSIX codepaths

devshgraphicsprogramming · 2026-05-04T21:46:24Z

why are we passing dedicatedOnly around ?

ok I see, IMHO the solution is to have IGPUBuffer and IGPUImage SCONSTRUCTIONPArams and SCachedConstruction params (which can contain the creation and cached creation params) to pass things between the inside of the factory creation function and the CVulkan constructor

devshgraphicsprogramming · 2026-05-04T21:47:44Z

+    friend class IDeviceMemoryAllocator;
+    friend class ILogicalDevice;


write docs about why you need the friendship

devshgraphicsprogramming · 2026-05-04T21:50:40Z

+        struct SCreationParams: SInfo
+        {
+            core::bitflag<E_MEMORY_PROPERTY_FLAGS> memoryPropertyFlags = E_MEMORY_PROPERTY_FLAGS::EMPF_NONE;
+            const bool dedicated = false;


you dont need to make the struct member const, its better if you make the m_params const!

devshgraphicsprogramming · 2026-05-04T21:51:27Z

+
+        inline const SCreationParams& getCreationParams() const { return m_params; }
+
+        virtual external_handle_t getExternalHandle() const = 0;


I don't get the point of having this, and a getCreationParams() when I can go getCreationParams().externalHandle

devshgraphicsprogramming · 2026-05-04T21:52:34Z

+        struct SInfo
+        {
+            uint64_t allocationSize = 0;
+            core::bitflag<IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS> allocateFlags = IDeviceMemoryAllocation::EMAF_NONE;
+            // Handle Type for external resources
+            IDeviceMemoryAllocation::E_EXTERNAL_HANDLE_TYPE externalHandleType = IDeviceMemoryAllocation::EHT_NONE;
+            //! Imports the given handle  if externalHandle != nullptr && externalHandleType != EHT_NONE
+            //! Creates exportable memory if externalHandle == nullptr && externalHandleType != EHT_NONE
+            external_handle_t externalHandle = 0;
+        };
+
+        struct SCreationParams: SInfo


any reason for the split between SInfo and ScreationParams we could shove everything into 24 bytes, by packing the allocateflags, handle types, memoryPropertyFlags and dedicated into 8 bytes

ok I see its because you reuse the SInfo in the allocator, but stull it would help to pack the allocate flags and external handle type into 4 bytes in case the external handle is 4 bytes on linux

devshgraphicsprogramming · 2026-05-04T21:53:30Z

-        inline bool isDedicated() const {return m_dedicated;}
+        inline bool isDedicated() const {return m_params.dedicated;}

        //! Returns the size of the memory allocation
-        inline size_t getAllocationSize() const {return m_allocationSize;}
+        inline size_t getAllocationSize() const {return m_params.allocationSize;}

        //!
-        inline core::bitflag<E_MEMORY_ALLOCATE_FLAGS> getAllocateFlags() const { return m_allocateFlags; }
+        inline core::bitflag<E_MEMORY_ALLOCATE_FLAGS> getAllocateFlags() const { return m_params.allocateFlags; }

        //!
-        inline core::bitflag<E_MEMORY_PROPERTY_FLAGS> getMemoryPropertyFlags() const { return m_memoryPropertyFlags; }
+        inline core::bitflag<E_MEMORY_PROPERTY_FLAGS> getMemoryPropertyFlags() const { return m_params.memoryPropertyFlags; }


whatever I can get from getCreationParams() mark with [[deprecated]]

devshgraphicsprogramming · 2026-05-04T21:53:57Z

        enum E_MEMORY_HEAP_FLAGS : uint32_t
        {
            EMHF_NONE               = 0,
            EMHF_DEVICE_LOCAL_BIT   = 0x00000001,
            EMHF_MULTI_INSTANCE_BIT = 0x00000002,
        };


you can take this enum down to uint8_t

devshgraphicsprogramming · 2026-05-04T21:54:14Z

+        //! Flags for imported/exported allocation
+        enum E_EXTERNAL_HANDLE_TYPE : uint32_t


and this to uint16_t

devshgraphicsprogramming · 2026-05-04T21:56:37Z

-			size_t size : 54 = 0ull;
-			size_t flags : 5 = 0u; // IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS
-			size_t memoryTypeIndex : 5 = 0u;
+			size_t memoryTypeIndex = 0u;


don't use `size_t for the index, use uint8_T at most

devshgraphicsprogramming · 2026-05-04T21:58:06Z

 		class IMemoryTypeIterator
 		{
 			public:
-				IMemoryTypeIterator(const IDeviceMemoryBacked::SDeviceMemoryRequirements& reqs, core::bitflag<IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS> allocateFlags)
-					: m_allocateFlags(static_cast<uint32_t>(allocateFlags.value)), m_reqs(reqs) {}
+				IMemoryTypeIterator(const IDeviceMemoryBacked::SDeviceMemoryRequirements& reqs, 
+					core::bitflag<IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS> allocateFlags,
+					IDeviceMemoryAllocation::E_EXTERNAL_HANDLE_TYPE handleType,
+					external_handle_t handle) : 
+					m_allocateFlags(static_cast<uint32_t>(allocateFlags.value)), 
+					m_reqs(reqs), 
+					m_handleType(handleType),
+					m_handle(handle)
+				{}


why does your memory type iterator care about the handle value ? I can understand caring about the handle type and whether we're importing or exporting, but not the actual handle

devshgraphicsprogramming · 2026-05-04T21:59:36Z

+			IDeviceMemoryBacked* dedication = nullptr,
+			const core::bitflag<IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS> allocateFlags = IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS::EMAF_NONE,
+			IDeviceMemoryAllocation::E_EXTERNAL_HANDLE_TYPE externalHandleType = IDeviceMemoryAllocation::EHT_NONE,
+			external_handle_t externalHandle = {})


last 4 arguments should be wrapped up in a structure, it kinda looks like half of your Sinfo

devshgraphicsprogramming · 2026-05-04T22:02:03Z

+        enum E_EXTERNAL_MEMORY_FEATURE_FLAGS : uint32_t
+        {
+          EEMF_NONE = 0x0,
+          EEMF_DEDICATED_ONLY_BIT = 0x1,
+          EEMF_EXPORTABLE_BIT = 0x2,
+          EEMF_IMPORTABLE_BIT = 0x4,
+        };
+
+        struct SExternalMemoryProperties
+        {
+            IDeviceMemoryAllocation::E_EXTERNAL_HANDLE_TYPE exportableTypes : 7;
+            IDeviceMemoryAllocation::E_EXTERNAL_HANDLE_TYPE compatibleTypes : 7;
+            // TODO(kevin): This should actually be core::bitflag to be semantically correct. What should we do? Should we use bool for each flag instead of enum?
+            E_EXTERNAL_MEMORY_FEATURE_FLAGS features : 3;


uint16_t for the class

its fine as it is, can make a core::bitflag<> get method

devshgraphicsprogramming · 2026-05-04T22:02:43Z

+            // TODO(kevinyu): Should we cached the properties like Atil does. If yes, needs mutex and mutable specifier. Class become not that simple anymore. 
+            // {
+            //     std::shared_lock lock(m_externalBufferPropertiesMutex);
+            //     auto it = m_externalBufferProperties.find({ usage, handleType });
+            //     if (it != m_externalBufferProperties.end())
+            //         return it->second;
+            // }
+            //
+            // std::unique_lock lock(m_externalBufferPropertiesMutex);
+            // return m_externalBufferProperties[{ usage, handleType }] = getExternalBufferProperties_impl(usage, handleType);
+            return getExternalMemoryProperties_impl(usages, handleType);


we do it for image formats, I guess we should make it cached here too

devshgraphicsprogramming · 2026-05-04T22:03:12Z

+        struct SImageFormatInfo
+        {
+            asset::E_FORMAT format;
+            IGPUImage::E_TYPE type;
+            IGPUImage::TILING tiling;
+            core::bitflag<IGPUImage::E_USAGE_FLAGS> usage;
+            core::bitflag<IGPUImage::E_CREATE_FLAGS> flags;
+        };


assert this is 5 bytes, if not find the padding

devshgraphicsprogramming · 2026-05-04T22:09:57Z

-		static CUresult acquireAndGetMipmappedArray(GraphicsAPIObjLink<video::IGPUImage>* linksBegin, GraphicsAPIObjLink<video::IGPUImage>* linksEnd, CUstream stream);
-		static CUresult acquireAndGetArray(GraphicsAPIObjLink<video::IGPUImage>* linksBegin, GraphicsAPIObjLink<video::IGPUImage>* linksEnd, uint32_t* arrayIndices, uint32_t* mipLevels, CUstream stream);
-#endif
+		size_t roundToGranularity(CUmemLocationType location, size_t size) const;


isn't this just core::roundUp or just align ?

devshgraphicsprogramming · 2026-05-04T22:11:03Z

+      CCUDAImportedSemaphore(core::smart_refctd_ptr<CCUDADevice> device, 
+        core::smart_refctd_ptr<ISemaphore> src, 
+        CUexternalSemaphore semaphore)
+          : m_device(std::move(device))
+          , m_src(std::move(src))
+          , m_handle(semaphore)
+      {}
+      ~CCUDAImportedSemaphore() override;
+


again these are public and should be private

devshgraphicsprogramming · 2026-05-04T22:11:12Z

+namespace nbl::video
+{
+
+class NBL_API2 CCUDAImportedSemaphore : public core::IReferenceCounted


use final whenever possible

devshgraphicsprogramming · 2026-05-04T22:13:14Z

+			// ASK(kevin): What initial_modified_time should I use? Is this how this parameter is used?
+			std::chrono::clock_cast<system::IFile::time_point_t::clock>(std::chrono::system_clock::now()),


this is fine, we don't have a file watcher system yet to reload the file contents if they change

devshgraphicsprogramming · 2026-05-04T22:16:00Z

+  auto& cu = m_device->getHandler()->getCUDAFunctionTable();
+  return cu.pcuExternalMemoryGetMappedBuffer(mappedBuffer, m_handle, &bufferDesc);
+
+}


oh, the syntax for this hurts me, can we turn it into a method that returns CUdeviceptr (and null on failure) ?

If I want CUDA like API usage then I guess I can call getInternalObject and go to town

devshgraphicsprogramming · 2026-05-04T22:21:10Z

+	req.prefersDedicatedAllocation  = nullptr != dedication;
+	req.requiresDedicatedAllocation = nullptr != dedication;


I don't get this, if you pass a dedication it means you've already allocated the memory, so you may have allocated it on the wrong heap which is not compatible.

IMHO it wholesale doesn't make sense, because if you have a dedicate allocation made before you must have known the external handle to even make it, so its a chicken an egg problem.

HOWEVER you should have a method that returns the Device Memory Requirements (same as IGPUBuffer or IGPUImage) for the CCUDAExportableMemory so that these are known before we attempt an export

devshgraphicsprogramming · 2026-05-04T22:21:51Z

+
+	return device->allocate(req, 
+		dedication, 
+		IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS::EMAF_NONE, 


you've robbed me of specifying EMAF_DEVICE_ADDRESS_BIT and getting BDA for a CUDA buffer

devshgraphicsprogramming · 2026-05-04T22:24:30Z

+	return device->allocate(req, 
+		dedication, 
+		IDeviceMemoryAllocation::E_MEMORY_ALLOCATE_FLAGS::EMAF_NONE, 
+		CCUDADevice::EXTERNAL_MEMORY_HANDLE_TYPE, 
+		m_params.externalHandle).memory;
+}


you need the equivalent of ICleanup from IGPUBuffer and IGPUImage but for the IDeviceMemoryAllocation so you can keep a smart pointer back to the Imported thing which provides the external Handle

Right now you're not ensuring lifetimes when CUDA exports and Vulkan imports

devshgraphicsprogramming · 2026-05-04T22:37:28Z

+	m_handler(std::move(handler)),
+	m_allocationGranularity{}
 {
 	m_defaultCompileOptions.push_back("--std=c++14");


can you do C++20 or will NVCC die?

devshgraphicsprogramming · 2026-05-04T22:38:13Z

+size_t CCUDADevice::roundToGranularity(CUmemLocationType location, size_t size) const
 {
-	assert(link->obj);
-	auto glbuf = static_cast<video::COpenGLBuffer*>(link->obj.get());
-	auto retval = cuda.pcuGraphicsGLRegisterBuffer(&link->cudaHandle,glbuf->getOpenGLName(),flags);
-	if (retval!=CUDA_SUCCESS)
-		link->obj = nullptr;
-	return retval;
+	return ((size - 1) / m_allocationGranularity[location] + 1) * m_allocationGranularity[location];


ah okay you do it based on location, still please use the hlsl or core utitliities, and inline the function, no need to be going across DLL boundaries for stuff like that

devshgraphicsprogramming · 2026-05-04T22:40:24Z

 }
-CUresult CCUDAHandler::registerImage(GraphicsAPIObjLink<video::IGPUImage>* link, uint32_t flags)
+
+CUresult CCUDADevice::reserveAddressAndMapMemory(CUdeviceptr* outPtr, size_t size, size_t alignment, CUmemLocationType location, CUmemGenericAllocationHandle memory) const


I find STATUS func(RETURN* really awkward, maybe make a template<T> struct nbl::video::SCUDAResult {T result; CUresult status;};with avoid` specialization thats just the status

kevyuu added 30 commits March 23, 2026 17:37

Add more cuda function to load

0cd109a

Add _NBL_COMPILE_WITH_CUDA_ compile definition on CMakeLists.txt

bbe25ab

Move CCudaHandler constructor to cpp and query device info and attrib…

d74349e

…utes

Add missing CFileView.h header in CCudaHandler.cpp

38ed6db

Fix indentation of CCudaHandler.cpp

95338cd

Add NBL_API2 to CCudaHandler

3e9dfd2

Fix fetching deviceUUID logic

1ae7747

Fix usage of CFileView

a3150dc

Fix use after move of ptx cpuBuffer

5018be7

Improve cpuBuffer initialization using params instead of aggregrate i…

5251b4d

…nitializer

Fix indentation of CCudaHandler.cpp into tabs

d655b19

Iterate m_availableDevices when creatingDevice

454710b

Implement context creation in CCUDADevice

4645bc4

Implement physical device getExternalMemoryProperties

3172ae7

Dedicated buffer and image

f9b8b4f

External Memory Feature flags should not be enum class

a2357e2

External Vulkan Buffer Creation

0d9c3d8

Temporary enable compile with cuda flag

89f5ae5

Update examples_tests submodule to vk_cuda interop demo branch

152830f

External memory allocation

ea3b49b

Fix indentation on CAssetConverter.cpp

77b92ab

Update jitify submodule

68f740f

External memory allocation cleanup

1c93a91

Implement proper CCUDADevice destructor.

ae0e177

Implementation of Shared memory between vulkan and cuda

c83942a

Add NBL_API2 modifier to CCUDADevice

2e45702

Implementation of Shared semaphore between Vulkan and CUDA

741252f

Update to CUDA Toolkit version 13.0+

fe75ce0

Fix external semaphore

78fc0df

External image implementation

5d19c5b

devshgraphicsprogramming reviewed May 4, 2026

View reviewed changes

		}


		core::smart_refctd_ptr<ISemaphore> CVulkanLogicalDevice::createSemaphore(const uint64_t initialValue)

		// The Vulkan spec states: If the pNext chain includes a VkExternalMemoryImageCreateInfo or VkExternalMemoryImageCreateInfoNV structure whose handleTypes member is not 0, initialLayout must be VK_IMAGE_LAYOUT_UNDEFINED
		vk_createInfo.initialLayout = external ? VK_IMAGE_LAYOUT_UNDEFINED : (params.preinitialized ? VK_IMAGE_LAYOUT_PREINITIALIZED : VK_IMAGE_LAYOUT_UNDEFINED);

		friend class IDeviceMemoryAllocator;
		friend class ILogicalDevice;


		inline const SCreationParams& getCreationParams() const { return m_params; }

		virtual external_handle_t getExternalHandle() const = 0;

		//! Flags for imported/exported allocation
		enum E_EXTERNAL_HANDLE_TYPE : uint32_t

		// ASK(kevin): What initial_modified_time should I use? Is this how this parameter is used?
		std::chrono::clock_cast<system::IFile::time_point_t::clock>(std::chrono::system_clock::now()),

		req.prefersDedicatedAllocation = nullptr != dedication;
		req.requiresDedicatedAllocation = nullptr != dedication;

Conversation

kevyuu commented Apr 23, 2026

Description

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devshgraphicsprogramming May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

devshgraphicsprogramming May 4, 2026 •

edited

Loading