Object Cache
CV-CUDA includes an internal resource management system that caches allocated objects for efficient reuse.
Objects used within CV-CUDA, such as cvcuda.Image, cvcuda.Tensor, cvcuda.ImageBatchVarShape, and cvcuda.TensorBatch, are automatically managed by the CV-CUDA cache.
Note
Only Python objects are cached, there is no C/C++ object caching.
Note
CV-CUDA is device agnostic and does not track which device the data resides on.
Wrapped vs Non-Wrapped Objects
The cache distinguishes between two types of objects based on memory ownership:
Non-wrapped objects are allocated by CV-CUDA and increase the cache size:
import cvcuda
import numpy as np
def main() -> None:
# Non-wrapped object that increases cache memory
tensor = cvcuda.Tensor( # noqa: F841
(16, 32, 4), np.float32, cvcuda.TensorLayout.HWC
)
Wrapped objects wrap externally-managed memory and do not increase the cache size:
import cvcuda
import torch
def main() -> None:
# Wrapped object that does not increase cache memory
torch_tensor = torch.tensor([1], device="cuda", dtype=torch.uint8)
cvcuda_tensor = cvcuda.as_tensor(torch_tensor, layout="N") # noqa: F841
Del and Garbage Collection
Both cvcuda.Image and cvcuda.Tensor are managed by CV-CUDA if they are allocated via CV-CUDA.
When a cvcuda.Tensor or cvcuda.Image has been allocated by CV-CUDA and it goes out of scope, the underlying memory is not released.
Instead, it is stored and will be reused for future allocations via the CV-CUDA object cache.
For example, when using the del keyword in Python, only the reference to the object is removed.
When the only remaining reference to the object is from the CV-CUDA object cache, the underlying memory can be reused.
As such, it is best practice to not attempt to manually free memory.
Note
You can manually free all cached memory allocations by calling cvcuda.clear_cache().
Cache Reuse
When an CV-CUDA object goes out of scope, its memory is retained in the cache for efficient reuse. Creating a new object with identical specifications (shape, data type, etc.) will reuse the cached memory:
import cvcuda
import numpy as np
def create_tensor1():
tensor1 = cvcuda.Tensor( # noqa: F841
(16, 32, 4), np.float32, cvcuda.TensorLayout.HWC
)
def create_tensor2():
# Re-use the cache
tensor2 = cvcuda.Tensor( # noqa: F841
(16, 32, 4), np.float32, cvcuda.TensorLayout.HWC
)
def main() -> None:
create_tensor1()
# tensor1 runs out of scope after leaving create_tensor1()
create_tensor2()
In this example, tensor2 reuses the memory from tensor1 since they have identical shapes and data types. No new memory allocation occurs.
Cache reuse also applies to wrapped objects, improving efficiency even though they don’t consume cache memory.
Controlling Cache Growth
Certain workflows can cause unbounded cache growth, particularly when creating many non-wrapped objects with different shapes:
import random
import cvcuda
import numpy as np
def create_tensor(h: int, w: int) -> None:
_ = cvcuda.Tensor((h, w, 3), np.float32, cvcuda.TensorLayout.HWC)
def main(iters: int = 10_000) -> None:
for _ in range(iters):
h = random.randint(1000, 2000)
w = random.randint(1000, 2000)
create_tensor(h, w)
To manage cache growth, CV-CUDA provides a configurable cache limit with automatic clearing. When the cache reaches this limit, it is automatically cleared. Objects larger than the cache limit are not cached.
Configuring the cache limit
import cvcuda
def main() -> None:
# Get the cache limit (in bytes)
current_cache_limit = cvcuda.get_cache_limit_inbytes() # noqa: F841
# Set the cache limit (in bytes)
my_new_cache_limit = 12345 # in bytes
cvcuda.set_cache_limit_inbytes(my_new_cache_limit)
By default, the cache limit is set to half the total GPU memory of the current device when importing cvcuda:
import cvcuda
import torch
def set_default_cache_limit() -> None:
# Set the cache limit (in bytes)
total_mem = torch.cuda.mem_get_info()[1]
cvcuda.set_cache_limit_inbytes(total_mem // 2)
You can set the cache limit larger than a single GPU’s memory since CV-CUDA is device agnostic. For example, with two 24GB GPUs, you could set a cache limit exceeding 40GB if you distribute data across both devices.
Setting the cache limit to 0 effectively disables caching, though this may impact performance since memory cannot be reused.
Querying cache size
def query_cache_size() -> None:
print(cvcuda.current_cache_size_inbytes())
img = cvcuda.Image.zeros((1, 1), cvcuda.Format.F32) # noqa: F841
print(cvcuda.current_cache_size_inbytes())
Multithreading Considerations
The cache uses thread-local storage internally. Objects created in one thread cannot be reused by another thread when they go out of scope.
Warning
Cache size and limits are shared between threads. Exercise caution in multithreaded applications.
You can clear the cache for the current thread using cvcuda.clear_cache() with cvcuda.ThreadScope.LOCAL
and query the thread-local cache size with cvcuda.cache_size() with cvcuda.ThreadScope.LOCAL: