Object Cache

CV-CUDA includes an internal resource management system that caches allocated objects for efficient reuse. Objects used within CV-CUDA, such as cvcuda.Image, cvcuda.Tensor, cvcuda.ImageBatchVarShape, and cvcuda.TensorBatch, are automatically managed by the CV-CUDA cache.

Note

Only Python objects are cached, there is no C/C++ object caching.

Note

CV-CUDA is device agnostic and does not track which device the data resides on.

Wrapped vs Non-Wrapped Objects

The cache distinguishes between two types of objects based on memory ownership:

Non-wrapped objects are allocated by CV-CUDA and increase the cache size:

import cvcuda
import numpy as np


def main() -> None:
    # Non-wrapped object that increases cache memory
    tensor = cvcuda.Tensor(  # noqa: F841
        (16, 32, 4), np.float32, cvcuda.TensorLayout.HWC
    )

Wrapped objects wrap externally-managed memory and do not increase the cache size:

import cvcuda
import torch


def main() -> None:
    # Wrapped object that does not increase cache memory
    torch_tensor = torch.tensor([1], device="cuda", dtype=torch.uint8)
    cvcuda_tensor = cvcuda.as_tensor(torch_tensor, layout="N")  # noqa: F841

Del and Garbage Collection

Both cvcuda.Image and cvcuda.Tensor are managed by CV-CUDA if they are allocated via CV-CUDA. When a cvcuda.Tensor or cvcuda.Image has been allocated by CV-CUDA and it goes out of scope, the underlying memory is not released. Instead, it is stored and will be reused for future allocations via the CV-CUDA object cache.

For example, when using the del keyword in Python, only the reference to the object is removed. When the only remaining reference to the object is from the CV-CUDA object cache, the underlying memory can be reused. As such, it is best practice to not attempt to manually free memory.

Note

You can manually free all cached memory allocations by calling cvcuda.clear_cache().

Cache Reuse

When an CV-CUDA object goes out of scope, its memory is retained in the cache for efficient reuse. Creating a new object with identical specifications (shape, data type, etc.) will reuse the cached memory:

import cvcuda
import numpy as np


def create_tensor1():
    tensor1 = cvcuda.Tensor(  # noqa: F841
        (16, 32, 4), np.float32, cvcuda.TensorLayout.HWC
    )


def create_tensor2():
    # Re-use the cache
    tensor2 = cvcuda.Tensor(  # noqa: F841
        (16, 32, 4), np.float32, cvcuda.TensorLayout.HWC
    )


def main() -> None:
    create_tensor1()
    # tensor1 runs out of scope after leaving create_tensor1()
    create_tensor2()

In this example, tensor2 reuses the memory from tensor1 since they have identical shapes and data types. No new memory allocation occurs.

Cache reuse also applies to wrapped objects, improving efficiency even though they don’t consume cache memory.

Controlling Cache Growth

Certain workflows can cause unbounded cache growth, particularly when creating many non-wrapped objects with different shapes:

import random

import cvcuda
import numpy as np


def create_tensor(h: int, w: int) -> None:
    _ = cvcuda.Tensor((h, w, 3), np.float32, cvcuda.TensorLayout.HWC)


def main(iters: int = 10_000) -> None:
    for _ in range(iters):
        h = random.randint(1000, 2000)
        w = random.randint(1000, 2000)
        create_tensor(h, w)

To manage cache growth, CV-CUDA provides a configurable cache limit with automatic clearing. When the cache reaches this limit, it is automatically cleared. Objects larger than the cache limit are not cached.

Configuring the cache limit

import cvcuda


def main() -> None:
    # Get the cache limit (in bytes)
    current_cache_limit = cvcuda.get_cache_limit_inbytes()  # noqa: F841

    # Set the cache limit (in bytes)
    my_new_cache_limit = 12345  # in bytes
    cvcuda.set_cache_limit_inbytes(my_new_cache_limit)

By default, the cache limit is set to half the total GPU memory of the current device when importing cvcuda:

import cvcuda
import torch


def set_default_cache_limit() -> None:
    # Set the cache limit (in bytes)
    total_mem = torch.cuda.mem_get_info()[1]
    cvcuda.set_cache_limit_inbytes(total_mem // 2)

You can set the cache limit larger than a single GPU’s memory since CV-CUDA is device agnostic. For example, with two 24GB GPUs, you could set a cache limit exceeding 40GB if you distribute data across both devices.

Setting the cache limit to 0 effectively disables caching, though this may impact performance since memory cannot be reused.

Querying cache size

def query_cache_size() -> None:
    print(cvcuda.current_cache_size_inbytes())
    img = cvcuda.Image.zeros((1, 1), cvcuda.Format.F32)  # noqa: F841
    print(cvcuda.current_cache_size_inbytes())

Multithreading Considerations

The cache uses thread-local storage internally. Objects created in one thread cannot be reused by another thread when they go out of scope.

Warning

Cache size and limits are shared between threads. Exercise caution in multithreaded applications.

You can clear the cache for the current thread using cvcuda.clear_cache() with cvcuda.ThreadScope.LOCAL and query the thread-local cache size with cvcuda.cache_size() with cvcuda.ThreadScope.LOCAL: