CUDA Python
CUDA Python provides direct Python bindings to the CUDA Runtime API (cuda.bindings.runtime),
offering the most low-level control over CUDA operations. Since CUDA Python is equivalent to using
CUDA directly, you can use it to allocate memory on GPU, copy data between CPU and GPU, and
even write your own CUDA kernels while using CV-CUDA.
Common Utilities
To simplify working with CUDA Python, we provide utility functions in cuda_python_common.py:
CudaBuffer Class:
The CudaBuffer class is a lightweight wrapper that implements the __cuda_array_interface__
protocol, making raw CUDA memory accessible to CV-CUDA and other frameworks:
class CudaBuffer:
"""Wrapper for CUDA memory buffer that implements __cuda_array_interface__."""
def __init__(self, shape, dtype, ptr=None):
"""Initialize CUDA buffer.
Args:
shape: tuple of dimensions
dtype: numpy dtype
ptr: CUDA device pointer (if None, allocates new memory)
"""
self.shape = shape
self.dtype = np.dtype(dtype)
self.size = int(np.prod(shape)) * self.dtype.itemsize
if ptr is None:
err, self.ptr = cudart.cudaMalloc(self.size)
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"cudaMalloc failed: {err}")
self.owns_memory = True
else:
self.ptr = ptr
self.owns_memory = False
@staticmethod
def from_cuda(cuda_obj) -> "CudaBuffer":
"""Create from __cuda_array_interface__ or object that implements it."""
if hasattr(cuda_obj, "__cuda_array_interface__"):
cuda_array_interface = cuda_obj.__cuda_array_interface__
else:
cuda_array_interface = cuda_obj
return CudaBuffer(
shape=cuda_array_interface["shape"],
dtype=np.dtype(cuda_array_interface["typestr"]),
ptr=cuda_array_interface["data"][0],
)
@property
def __cuda_array_interface__(self):
"""CUDA Array Interface for zero-copy interop."""
return {
"version": 3,
"shape": self.shape,
"typestr": self.dtype.str,
"data": (int(self.ptr), False),
"strides": None,
}
def __del__(self):
"""Free CUDA memory if we own it."""
if self.owns_memory and hasattr(self, "ptr"):
cudart.cudaFree(self.ptr)
Key features:
Allocates CUDA memory using
cudart.cudaMalloc()Implements
__cuda_array_interface__for zero-copy interopManages memory lifetime (frees memory in
__del__if it owns it)Can wrap existing CUDA pointers without taking ownership via
from_cuda()class method
Memory Copy Utilities:
def cuda_memcpy_h2d(
host_array: np.ndarray,
device_array: int | dict | object,
) -> None:
"""
Copy host array to device array.
Args:
host_array: Host array to copy.
device_array: Device array to copy to, __cuda_array_interface__ or object with interface.
"""
if hasattr(device_array, "__cuda_array_interface__"):
device_array = device_array.__cuda_array_interface__["data"][0]
elif isinstance(device_array, dict) and "data" in device_array:
device_array = device_array["data"][0]
elif not isinstance(device_array, int):
raise ValueError("Invalid device array")
(err,) = cudart.cudaMemcpy(
device_array,
host_array.ctypes.data,
host_array.nbytes,
cudart.cudaMemcpyKind.cudaMemcpyHostToDevice,
)
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"cudaMemcpy failed: {err}")
def cuda_memcpy_d2h(
device_array: int | dict | object,
host_array: np.ndarray,
) -> None:
"""
Copy device array to host array.
Args:
device_array: Device array to copy from, __cuda_array_interface__ or object with interface.
host_array: Host array to copy to.
"""
if hasattr(device_array, "__cuda_array_interface__"):
device_array = device_array.__cuda_array_interface__["data"][0]
elif isinstance(device_array, dict) and "data" in device_array:
device_array = device_array["data"][0]
elif not isinstance(device_array, int):
raise ValueError("Invalid device array")
(err,) = cudart.cudaMemcpy(
host_array.ctypes.data,
device_array,
host_array.nbytes,
cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,
)
if err != cudart.cudaError_t.cudaSuccess:
raise RuntimeError(f"cudaMemcpy failed: {err}")
These functions wrap cudart.cudaMemcpy() to transfer data between host (CPU) and device (GPU).
Each function supports the __cuda_array_interface__ or object with interface as the device array.
You can utilize cvcuda.Tensor via the .cuda() method.
Approach 1: Custom CudaBuffer
This approach uses the custom CudaBuffer class to manually allocate CUDA memory and transfer data.
Required Imports:
import numpy as np
import cvcuda
from cuda_python_common import CudaBuffer, cuda_memcpy_h2d, cuda_memcpy_d2h
CUDA Python to CV-CUDA:
# Allocate host data and CUDA buffer
numpy_array = np.random.randn(10, 10).astype(np.float32)
cuda_buffer = CudaBuffer(shape=(10, 10), dtype=np.float32)
# Copy host to GPU
cuda_memcpy_h2d(
numpy_array,
cuda_buffer,
)
# Create CV-CUDA tensor from the CUDA buffer
cvcuda_tensor = cvcuda.as_tensor(cuda_buffer)
CV-CUDA to CUDA Python:
# Get the CUDA array interface from the CV-CUDA tensor
# and create a new CUDA buffer from it
new_cuda_buffer = CudaBuffer.from_cuda(cvcuda_tensor.cuda())
When to use this approach:
You need full control over memory allocation
You want explicit memory management
Complete Example: See samples/interoperability/cuda_python_interop_1.py
Approach 2: Using CV-CUDA Tensor as Buffer
This approach leverages CV-CUDA’s own memory allocation by using cvcuda.Tensor as a
raw buffer, then copying data into it. This is useful when you want CV-CUDA to manage the memory.
Required Imports:
import numpy as np
import cvcuda
from cuda_python_common import cuda_memcpy_h2d, cuda_memcpy_d2h
CUDA Python to CV-CUDA:
# Allocate host data and CUDA buffer
numpy_array = np.random.randn(10, 10, 3).astype(np.uint8)
# Allocate data of identical size to the numpy_array using a cvcuda.Tensor
cvcuda_tensor = cvcuda.Tensor((10, 10, 3), dtype=cvcuda.Type.U8)
# Copy host to GPU
cuda_memcpy_h2d(
numpy_array,
cvcuda_tensor.cuda(),
)
Key differences from Approach 1:
Uses
cvcuda.Tensorto allocate GPU memory instead ofCudaBufferCV-CUDA manages the memory lifecycle
Conversion to Tensor:
When to use this approach:
You want CV-CUDA to manage memory allocation
You’re working with contiguous data that will be reshaped
You want simpler memory management (no manual
CudaBuffer)You’re already using CV-CUDA Tensors in your pipeline
Complete Example: See samples/interoperability/cuda_python_interop_2.py
Summary of CUDA Python Approaches
Approach |
Memory Allocation |
Data Transfer |
Best Use Case |
|---|---|---|---|
Approach 1 |
Custom |
Manual |
Full control, interfacing with existing CUDA code |
Approach 2 |
Manual |
Let CV-CUDA manage memory, simpler code |
Both approaches achieve zero-copy interoperability between CUDA Python and CV-CUDA, but differ in memory management and allocation strategy.