.. # SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. .. _nvcvobjectcache: NVCV Object Cache ================= CV-CUDA has internal Resource management. Python objects that are used within CV-CUDA will be added to CV-CUDA's NVCV cache. Note: CV-CUDA is device agnostic, ie CV-CUDA does not know on which device the data resides! Basics ------ The most prominent cached objects are of the following classes: ``Image``, ``ImageBatch``, ``Stream``, ``Tensor``, ``TensorBatch``, ``ExternalCacheItem`` (eg. an operator's payload). With respect to the cache, we differentiate objects between their used memory of the cache. While wrapped objects do not increase the cache's size, non-wrapped objects do increase the cache. An example of a non-wrapped object that increases the cache's memory:: import cvcuda import numpy as np tensor = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC) Wrapped objects are objects which do not have the memory hosted by CV-CUDA, hence they do not increase the cache's memory. In the following python snippet, the ``cvcuda_tensor`` is a wrapped tensor, which does not increase the cache's memory.:: import cvcuda import torch torch_tensor = torch.tensor([1], device="cuda", dtype=torch.uint8) cvcuda_tensor = torch.as_tensor(torch_tensor) Cache Re-use -------------- If a CV-CUDA object is created and runs out of scope, we can leverage the cache to efficiently create a new CV-CUDA object with the same specifics, eg of the same shape and data type:: import cvcuda import numpy as np def create_tensor1(): tensor1 = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC) return def create_tensor2(): # re-use the cache tensor2 = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC) return create_tensor1() # tensor1 runs out of scope, after leaving ``create_tensor1()`` create_tensor2() In this case, for ``tensor2`` no new memory is being allocated, as we re-use the memory from ``tensor1``, because ``tensor1`` and ``tensor2`` have the same shape and data type. Cache re-use is also possible for wrapped objects (even if they do not increase the cache's memory, it's more efficient to use the re-use the cache). Controlling the cache limit --------------------------- Some workflows can cause the cache to grow significantly, eg if one keeps creating non-wrapped tensors of different shape. Hence, rarely re-using the cache:: import cvcuda import numpy as np import random def create_tensor(h, w): tensor1 = nvcv.Tensor((h, w, 3), np.float32, nvcv.TensorLayout.HWC) return while True: h = random.randint(1000, 2000) w = random.randint(1000, 2000) create_tensor(h, w) To control that cache growth, CV-CUDA implements a user-configurable' cache limit and automatic clearance mechanism. When the cache hits that limit, it is automatically cleared. Similarly, if a single object is larger than the cache limit, we do not add it to the cache. The cache limit can be controlled in the following manner:: import cvcuda # Get the cache limit (in bytes) current_cache_limit = nvcv.get_cache_limit_inbytes() # Set the cache limit (in bytes) my_new_cache_limit = 12345 # in bytes nvcv.set_cache_limit_inbytes(my_new_cache_limit) By default the cache limit is set to half the total GPU memory of the current device when importing cvcuda, eg:: import cvcuda import torch # Set the cache limit (in bytes) total_mem = torch.cuda.mem_get_info()[1] nvcv.set_cache_limit_inbytes(total_mem // 2) It is also feasible to set the cache limit to a value larger than the total GPU memory. Due to CV-CUDA being device agnostic, it can happen that a larger cache than one GPU's total memory is possible. Consider a scenario where two GPUs, each with 24GB are available. Data of 20GB could reside on each GPU. Setting the cache to >40GB, allows to keep all data in cache, despite the cache limit being larger than one GPU's total memory. It is, however, the user's responsibility to distribute the data accordingly. A cache limit of 0 effectively disables the cache. However, a low cache limit or a disabled cache can cause a hit in performance, as already allocated memory is not being re-used, but new memory has to be allocated and deallocated. CV-CUDA also provides querying the current cache size (in bytes). This can be helpful for debugging:: import cvcuda print(nvcv.current_cache_size_inbytes()) img = nvcv.Image.zeros((1, 1), nvcv.Format.F32) print(nvcv.current_cache_size_inbytes()) Using the cache with multiple threads ------------------------------------- Internally, the cache uses thread-local storage. As a result, CV-CUDA objects created in a thread cannot be reused from another thread when they run out of scope. .. warning:: Since the cache size and limit are shared between threads, care must be taken in multithreaded applications. It is possible to clear the cache of the current thread using ``nvcv.clear_cache(nvcv.ThreadScope.LOCAL)``. Similarly, ``nvcv.cache_size(nvcv.ThreadScope.LOCAL)`` allows querying the number of elements in the cache for the current thread: .. code-block:: python import threading import nvcv import numpy as np def create_tensor_and_clear(): tensor = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC) print(nvcv.cache_size(), nvcv.cache_size(nvcv.ThreadScope.LOCAL)) # 2 1 nvcv.clear_cache(nvcv.ThreadScope.LOCAL) print(nvcv.cache_size(), nvcv.cache_size(nvcv.ThreadScope.LOCAL)) # 1 0 tensor = nvcv.Tensor((16, 32, 4), np.float32, nvcv.TensorLayout.HWC) threading.Thread(target=create_tensor_and_clear).start()