CUDA Python

CUDA Python provides direct Python bindings to the CUDA Runtime API (cuda.bindings.runtime), offering the most low-level control over CUDA operations. Since CUDA Python is equivalent to using CUDA directly, you can use it to allocate memory on GPU, copy data between CPU and GPU, and even write your own CUDA kernels while using CV-CUDA.

Common Utilities

To simplify working with CUDA Python, we provide utility functions in cuda_python_common.py:

CudaBuffer Class:

The CudaBuffer class is a lightweight wrapper that implements the __cuda_array_interface__ protocol, making raw CUDA memory accessible to CV-CUDA and other frameworks:

class CudaBuffer:
    """Wrapper for CUDA memory buffer that implements __cuda_array_interface__."""

    def __init__(self, shape, dtype, ptr=None):
        """Initialize CUDA buffer.

        Args:
            shape: tuple of dimensions
            dtype: numpy dtype
            ptr: CUDA device pointer (if None, allocates new memory)
        """
        self.shape = shape
        self.dtype = np.dtype(dtype)
        self.size = int(np.prod(shape)) * self.dtype.itemsize

        if ptr is None:
            err, self.ptr = cudart.cudaMalloc(self.size)
            if err != cudart.cudaError_t.cudaSuccess:
                raise RuntimeError(f"cudaMalloc failed: {err}")
            self.owns_memory = True
        else:
            self.ptr = ptr
            self.owns_memory = False

    @staticmethod
    def from_cuda(cuda_obj) -> "CudaBuffer":
        """Create from __cuda_array_interface__ or object that implements it."""
        if hasattr(cuda_obj, "__cuda_array_interface__"):
            cuda_array_interface = cuda_obj.__cuda_array_interface__
        else:
            cuda_array_interface = cuda_obj
        return CudaBuffer(
            shape=cuda_array_interface["shape"],
            dtype=np.dtype(cuda_array_interface["typestr"]),
            ptr=cuda_array_interface["data"][0],
        )

    @property
    def __cuda_array_interface__(self):
        """CUDA Array Interface for zero-copy interop."""
        return {
            "version": 3,
            "shape": self.shape,
            "typestr": self.dtype.str,
            "data": (int(self.ptr), False),
            "strides": None,
        }

    def __del__(self):
        """Free CUDA memory if we own it."""
        if self.owns_memory and hasattr(self, "ptr"):
            cudart.cudaFree(self.ptr)

Key features:

Allocates CUDA memory using cudart.cudaMalloc()
Implements __cuda_array_interface__ for zero-copy interop
Manages memory lifetime (frees memory in __del__ if it owns it)
Can wrap existing CUDA pointers without taking ownership via from_cuda() class method

Memory Copy Utilities:

def cuda_memcpy_h2d(
    host_array: np.ndarray,
    device_array: int | dict | object,
) -> None:
    """
    Copy host array to device array.

    Args:
        host_array: Host array to copy.
        device_array: Device array to copy to, __cuda_array_interface__ or object with interface.
    """
    if hasattr(device_array, "__cuda_array_interface__"):
        device_array = device_array.__cuda_array_interface__["data"][0]
    elif isinstance(device_array, dict) and "data" in device_array:
        device_array = device_array["data"][0]
    elif not isinstance(device_array, int):
        raise ValueError("Invalid device array")
    (err,) = cudart.cudaMemcpy(
        device_array,
        host_array.ctypes.data,
        host_array.nbytes,
        cudart.cudaMemcpyKind.cudaMemcpyHostToDevice,
    )
    if err != cudart.cudaError_t.cudaSuccess:
        raise RuntimeError(f"cudaMemcpy failed: {err}")

def cuda_memcpy_d2h(
    device_array: int | dict | object,
    host_array: np.ndarray,
) -> None:
    """
    Copy device array to host array.

    Args:
        device_array: Device array to copy from, __cuda_array_interface__ or object with interface.
        host_array: Host array to copy to.
    """
    if hasattr(device_array, "__cuda_array_interface__"):
        device_array = device_array.__cuda_array_interface__["data"][0]
    elif isinstance(device_array, dict) and "data" in device_array:
        device_array = device_array["data"][0]
    elif not isinstance(device_array, int):
        raise ValueError("Invalid device array")
    (err,) = cudart.cudaMemcpy(
        host_array.ctypes.data,
        device_array,
        host_array.nbytes,
        cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,
    )
    if err != cudart.cudaError_t.cudaSuccess:
        raise RuntimeError(f"cudaMemcpy failed: {err}")

These functions wrap cudart.cudaMemcpy() to transfer data between host (CPU) and device (GPU). Each function supports the __cuda_array_interface__ or object with interface as the device array. You can utilize cvcuda.Tensor via the .cuda() method.

Approach 1: Custom CudaBuffer

This approach uses the custom CudaBuffer class to manually allocate CUDA memory and transfer data.

Required Imports:

import numpy as np
import cvcuda

from cuda_python_common import CudaBuffer, cuda_memcpy_h2d, cuda_memcpy_d2h

CUDA Python to CV-CUDA:

# Allocate host data and CUDA buffer
numpy_array = np.random.randn(10, 10).astype(np.float32)
cuda_buffer = CudaBuffer(shape=(10, 10), dtype=np.float32)

# Copy host to GPU
cuda_memcpy_h2d(
    numpy_array,
    cuda_buffer,
)

# Create CV-CUDA tensor from the CUDA buffer
cvcuda_tensor = cvcuda.as_tensor(cuda_buffer)

CV-CUDA to CUDA Python:

# Get the CUDA array interface from the CV-CUDA tensor
# and create a new CUDA buffer from it
new_cuda_buffer = CudaBuffer.from_cuda(cvcuda_tensor.cuda())

When to use this approach:

You need full control over memory allocation
You want explicit memory management

Complete Example: See samples/interoperability/cuda_python_interop_1.py

Approach 2: Using CV-CUDA Tensor as Buffer

This approach leverages CV-CUDA’s own memory allocation by using cvcuda.Tensor as a raw buffer, then copying data into it. This is useful when you want CV-CUDA to manage the memory.

Required Imports:

import numpy as np
import cvcuda

from cuda_python_common import cuda_memcpy_h2d, cuda_memcpy_d2h

CUDA Python to CV-CUDA:

# Allocate host data and CUDA buffer
numpy_array = np.random.randn(10, 10, 3).astype(np.uint8)
# Allocate data of identical size to the numpy_array using a cvcuda.Tensor
cvcuda_tensor = cvcuda.Tensor((10, 10, 3), dtype=cvcuda.Type.U8)

# Copy host to GPU
cuda_memcpy_h2d(
    numpy_array,
    cvcuda_tensor.cuda(),
)

Key differences from Approach 1:

Uses cvcuda.Tensor to allocate GPU memory instead of CudaBuffer
CV-CUDA manages the memory lifecycle

Conversion to Tensor:

When to use this approach:

You want CV-CUDA to manage memory allocation
You’re working with contiguous data that will be reshaped
You want simpler memory management (no manual CudaBuffer)
You’re already using CV-CUDA Tensors in your pipeline

Complete Example: See samples/interoperability/cuda_python_interop_2.py

Summary of CUDA Python Approaches

Comparison of CUDA Python Approaches
Approach	Memory Allocation	Data Transfer	Best Use Case
Approach 1	Custom `CudaBuffer`	Manual `cuda_memcpy_h2d/d2h`	Full control, interfacing with existing CUDA code
Approach 2	`cvcuda.Tensor`	Manual `cuda_memcpy_h2d/d2h`	Let CV-CUDA manage memory, simpler code

Both approaches achieve zero-copy interoperability between CUDA Python and CV-CUDA, but differ in memory management and allocation strategy.