v0.10.0-beta
Release Highlights
CV-CUDA v0.10.0 includes a critical bug fix (cache growth management) alongside the following changes:
New Features:
Added mechanism to limit and manage cache memory consumption (includes new “Best Practices” documentation) 1.
Performance improvements of color conversion operators (e.g., 2x faster RGB2YUV).
Refactored codebase to allow independent build of NVCV library (data structures).
Bug Fixes:
Fixed unbounded cache memory consumption issue 1.
Improved management of Python-created object lifetimes, decoupled from cache management 1.
Fixed potential crash in Resize operator’s linear and nearest neighbor interpolation from non-aligned vectorized writes.
Fixed Python CvtColor operator to correctly handle NV12 and NV21 outputs.
Fixed Resize and RandomResizedCrop linear interpolation weight for border rows and columns.
Fixed missing parameter in C API for fused ResizeCropConvertReformat.
Fixed several minor documentation and error output issues.
Fixed minor compiler warning while building Resize operator.
Compatibility and Known Limitations
New limitations:
Cache/resource management introduced in v0.10 add micro-second-level overhead to Python operator calls. Based on the performance analysis of our Python samples, we expect the production- and pipeline-level impact to be negligible. CUDA kernel and C++ call performance is not affected. We aim to investigate and reduce this overhead further in a future release.
Sporadic Pybind11-deallocation crashes have been reported in long-lasting multi-threaded Python pipelines with externally allocated memory (eg wrapped Pytorch buffers). We are evaluating an upgrade of Pybind11 (currently using 2.10) as a potential fix in an upcoming release.
For the full list, see main README on CV-CUDA GitHub.
License
CV-CUDA is licensed under the Apache 2.0 license.
Resources
Acknowledgements
CV-CUDA is developed jointly by NVIDIA and the ByteDance Machine Learning team.
- 1(1,2,3)
These fixes and features add micro-second-level overhead to Python operator calls. Based on the performance analysis of our Python samples, we expect the production- and pipeline-level impact to be negligible. CUDA kernel and C++ call performance is not affected. We aim to investigate and reduce this overhead further in a future release.