NVCV Object Cache

CV-CUDA has internal Resource management. Python objects that are used within CV-CUDA will be added to CV-CUDA’s NVCV cache.

Note: CV-CUDA is device agnostic, ie CV-CUDA does not know on which device the data resides!

Basics

The most prominent cached objects are of the following classes: Image, ImageBatch, Stream, Tensor, TensorBatch, ExternalCacheItem (eg. an operator’s payload).

With respect to the cache, we differentiate objects between their used memory of the cache. While wrapped objects do not increase the cache’s size, non-wrapped objects do increase the cache.

An example of a non-wrapped object that increases the cache’s memory:

import cvcuda
import numpy as np

tensor = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC)

Wrapped objects are objects which do not have the memory hosted by CV-CUDA, hence they do not increase the cache’s memory. In the following python snippet, the cvcuda_tensor is a wrapped tensor, which does not increase the cache’s memory.:

import cvcuda
import torch

torch_tensor = torch.tensor([1], device="cuda", dtype=torch.uint8)
cvcuda_tensor = torch.as_tensor(torch_tensor)

Cache Re-use

If a CV-CUDA object is created and runs out of scope, we can leverage the cache to efficiently create a new CV-CUDA object with the same specifics, eg of the same shape and data type:

import cvcuda
import numpy as np

def create_tensor1():
   tensor1 = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC)
   return

def create_tensor2():
   # re-use the cache
   tensor2 = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC)
   return

create_tensor1()
# tensor1 runs out of scope, after leaving ``create_tensor1()``
create_tensor2()

In this case, for tensor2 no new memory is being allocated, as we re-use the memory from tensor1, because tensor1 and tensor2 have the same shape and data type.

Cache re-use is also possible for wrapped objects (even if they do not increase the cache’s memory, it’s more efficient to use the re-use the cache).

Controlling the cache limit

Some workflows can cause the cache to grow significantly, eg if one keeps creating non-wrapped tensors of different shape. Hence, rarely re-using the cache:

import cvcuda
import numpy as np
import random

def create_tensor(h, w):
   tensor1 = nvcv.Tensor((h, w, 3), np.float32, nvcv.TensorLayout.HWC)
   return

while True:
   h = random.randint(1000, 2000)
   w = random.randint(1000, 2000)
   create_tensor(h, w)

To control that cache growth, CV-CUDA implements a user-configurable’ cache limit and automatic clearance mechanism. When the cache hits that limit, it is automatically cleared. Similarly, if a single object is larger than the cache limit, we do not add it to the cache. The cache limit can be controlled in the following manner:

import cvcuda

# Get the cache limit (in bytes)
current_cache_limit = nvcv.get_cache_limit()

# Set the cache limit (in bytes)
my_new_cache_limit = 12345 # in bytes
nvcv.set_cache_limit(my_new_cache_limit)

By default the cache limit is set to half the total GPU memory of the current device when importing cvcuda, eg:

import cvcuda
import torch

# Set the cache limit (in bytes)
total_mem = torch.cuda.mem_get_info()[1]
nvcv.set_cache_limit(total_mem // 2)

It is also feasible to set the cache limit to a value larger than the total GPU memory. Due to CV-CUDA being device agnostic, it can happen that a larger cache than one GPU’s total memory is possible. Consider a scenario where two GPUs, each with 24GB are available. Data of 20GB could reside on each GPU. Setting the cache to >40GB, allows to keep all data in cache, despite the cache limit being larger than one GPU’s total memory. It is, however, the user’s responsibility to distribute the data accordingly.

A cache limit of 0 effectively disables the cache. However, a low cache limit or a disabled cache can cause a hit in performance, as already allocated memory is not being re-used, but new memory has to be allocated and deallocated.

CV-CUDA also provides querying the current cache size (in bytes). This can be helpful for debugging:

import cvcuda

print(nvcv.current_cache_inbytes())
img = nvcv.Image.zeros((1, 1), nvcv.Format.F32)
print(nvcv.current_cache_inbytes())