Classification
In this example, we use CVCUDA to accelerate the pre and post processing pipelines in the deep learning inference use case involving an image classification model. The deep learning model can utilize either PyTorch or TensorRT to run the inference. The pre-processing pipeline converts the input into the format required by the input layer of the model whereas the post processing pipeline converts the output produced by the model into a visualization-friendly format. We use the ResNet50 model (from torchvision) to generate the predictions. This sample can work on a single image or a folder full of images or on a single video. All images have to be in the JPEG format and with the same dimensions unless run under the batch size of one. Video has to be in mp4 format with a fixed frame rate. We use the torchnvjpeg library to read the images and NVIDIA’s Video Processing Framework (VPF) to read/write videos.
The exact pre-processing operations are:
Decode Data -> Resize -> Convert Datatype(Float) -> Normalize (to 0-1 range, mean and stddev) -> convert to NCHW
The exact post-processing operations are:
Sorting the probabilities to get the top N -> Print the top N classes with the confidence scores
Writing the Sample App
The classification sample app has been designed to be modular in all aspects. It imports and uses various modules such as data decoders, pipeline pre and post processors and the model inference. Some of these modules are defined in the same folder as the sample whereas the rest are defined in the common scripts folder for a wider re-use.
Modules used by this sample app that are defined in the common folder (i.e. not specific just to this sample) are the
ImageBatchDecoder
for nvImageCodec based image decoding andVideoBatchDecoder
for PyNvVideoCodec based video decoding.Modules specific to this sample (i.e. defined in the classification sample folder) are
PreprocessorCvcuda
andPostprocessorCvcuda
for CVCUDA based pre and post processing pipelines andClassificationPyTorch
andClassificationTensorRT
for the model inference.
The first stage in our pipeline is importing all necessary python modules. Apart from the modules described above, this also includes modules such as torch and torchvision, torchnvjpeg, vpf and the main package of CVCUDA among others. Be sure to import pycuda.driver
before importing any other GPU packages like torch or cvcuda to ensure a proper initialization.
1# NOTE: One must import PyCuda driver first, before CVCUDA or VPF otherwise
2# things may throw unexpected errors.
3import pycuda.driver as cuda
4import os
5import sys
6import logging
7import cvcuda
8import torch
9
10# Bring the commons folder from the samples directory into our path so that
11# we can import modules from it.
12common_dir = os.path.join(
13 os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))),
14 "common",
15 "python",
16)
17sys.path.insert(0, common_dir)
18
19from perf_utils import ( # noqa: E402
20 CvCudaPerf,
21 get_default_arg_parser,
22 parse_validate_default_args,
23)
24
25from nvcodec_utils import ( # noqa: E402
26 VideoBatchDecoder,
27 ImageBatchDecoder,
28)
29
30from pipelines import ( # noqa: E402
31 PreprocessorCvcuda,
32 PostprocessorCvcuda,
33)
34
35from model_inference import ( # noqa: E402
36 ClassificationPyTorch,
37 ClassificationTensorRT,
38)
39
Then we define the main function which helps us parse various configuration parameters used throughout this sample as command line
arguments. This sample allows configuring following parameters. All of them have their default values already set so that one can execute the sample without supplying any. Some of these arguments are shared across many other CVCUDA samples and hence come from the perf_utils.py
class’s get_default_arg_parser()
method.
-i
,--input_path
: Either a path to a JPEG image/MP4 video or a directory containing JPG images to use as input. When pointing to a directory, only JPG images will be read.-o
,--output_dir
: The directory where the outputs should be stored.-th
,--target_img_height
: The height to which you want to resize the input_image before running inference.-tw
,--target_img_width
: The width to which you want to resize the input_image before running inference.-b
,--batch_size
: The batch size used during inference. If only one image is used as input, the same input image will be read and used this many times. Useful for performance bench-marking.-d
,--device_id
: The GPU device to use for this sample.-bk
,--backend
: The inference backend to use. Currently supports PyTorch or TensorRT.
Once we are done parsing all the command line arguments, we would setup the CvCudaPerf
object for any performance benchmarking needs and simply call the function run_sample
with all those arguments.
1cvcuda_perf = CvCudaPerf("classification_sample", default_args=args)
2run_sample(
3 args.input_path,
4 args.output_dir,
5 args.batch_size,
6 args.target_img_height,
7 args.target_img_width,
8 args.device_id,
9 args.backend,
10 cvcuda_perf,
11)
The run_sample
function is the primary function that runs this sample. It sets up the requested CUDA device, CUDA context and CUDA stream. CUDA streams help us execute CUDA operations on a non-default stream and enhances the overall performance. Additionally, NVTX markers are used throughout this sample (via CvCudaPerf
) to facilitate performance bench-marking using NVIDIA NSIGHT systems and benchmark.py
. In order to keep things simple, we are only creating one CUDA stream to run all the stages of this sample. The same stream is available in CVCUDA, PyTorch and TensorRT.
1cvcuda_perf.push_range("run_sample")
2
3# Define the objects that handle the pipeline stages
4image_size = (target_img_width, target_img_height)
5
6# Define the cuda device, context and streams.
7cuda_device = cuda.Device(device_id)
8cuda_ctx = cuda_device.retain_primary_context()
9cuda_ctx.push()
10# Use the the default stream for cvcuda and torch
11# Since we never created a stream current will be the CUDA default stream
12cvcuda_stream = cvcuda.Stream().current
13torch_stream = torch.cuda.default_stream(device=cuda_device)
Next, we instantiate various classes to help us run the sample. These classes are:
PreprocessorCvcuda
: A CVCUDA based pre-processing pipeline for classification.ImageBatchDecoder
: A nvImageCodec based image decoder to read the images.VideoBatchDecoder
: A PyNvVideoCodec based video decoder to read the video.PostprocessorCvcuda
: A post-processing pipeline for classification.classificationPyTorch
: A PyTorch based classification model to execute inference.classificationTensorRT
: A TensorRT based classification model to execute inference.
These classes are defined in modular fashion and exposes a uniform interface which allows easy plug-and-play in appropriate places. For example, one can use the same API to decode/encode images using PyTorch as that of decode/encode videos using VPF. Similarly, one can invoke the inference in the exact same way with PyTorch as with TensorRT.
Additionally, decoder interface also exposes start and join methods, making it easy to upgrade them to a multi-threading environment (if needed.) Such multi-threading capabilities are slated for a future release.
1# Now define the object that will handle pre-processing
2preprocess = PreprocessorCvcuda(device_id, cvcuda_perf)
3
4if os.path.splitext(input_path)[1] == ".jpg" or os.path.isdir(input_path):
5 # Treat this as data modality of images
6 decoder = ImageBatchDecoder(
7 input_path,
8 batch_size,
9 device_id,
10 cuda_ctx,
11 cvcuda_stream,
12 cvcuda_perf,
13 )
14
15else:
16 # Treat this as data modality of videos
17 decoder = VideoBatchDecoder(
18 input_path,
19 batch_size,
20 device_id,
21 cuda_ctx,
22 cvcuda_stream,
23 cvcuda_perf,
24 )
25
26# Define the post-processor
27postprocess = PostprocessorCvcuda(
28 "NCHW",
29 device_id,
30 cvcuda_perf,
31)
32
33# Setup the classification models.
34if backend == "pytorch":
35 inference = ClassificationPyTorch(
36 output_dir,
37 batch_size,
38 image_size,
39 device_id,
40 cvcuda_perf,
41 )
42elif backend == "tensorrt":
43 inference = ClassificationTensorRT(
44 output_dir,
45 batch_size,
46 image_size,
47 device_id,
48 cvcuda_perf,
49 )
50else:
51 raise ValueError("Unknown backend: %s" % backend)
With all of these components initialized, the overall data flow per a data batch looks like the following:
Decode batch -> Preprocess Batch -> Run Inference -> Post Process Batch -> Encode batch
1# Define and execute the processing pipeline ------------
2cvcuda_perf.push_range("pipeline")
3
4# Fire up the decoder
5decoder.start()
6
7# Loop through all input frames
8batch_idx = 0
9while True:
10 cvcuda_perf.push_range("batch", batch_idx=batch_idx)
11
12 with cvcuda_stream, torch.cuda.stream(torch_stream):
13 # Stage 1: decode
14 batch = decoder()
15 if batch is None:
16 cvcuda_perf.pop_range(total_items=0) # for batch
17 break # No more frames to decode
18 assert batch_idx == batch.batch_idx
19
20 logger.info("Processing batch %d" % batch_idx)
21
22 # Stage 2: pre-processing
23 orig_tensor, resized_tensor, normalized_tensor = preprocess(
24 batch.data,
25 out_size=image_size,
26 )
27
28 # Stage 3: inference
29 probabilities = inference(normalized_tensor)
30
31 # Stage 4: post-processing
32 postprocess(
33 probabilities,
34 top_n=5,
35 labels=inference.labels,
36 )
37
38 batch_idx += 1
39
40 cvcuda_perf.pop_range(total_items=batch.data.shape[0]) # for batch
41
42cvcuda_perf.pop_range() # for pipeline
43
44cuda_ctx.pop()
That’s it for the classification sample. To understand more about how each stage in the pipeline works, please explore the following sections:
Running the Sample
This sample can be invoked without any command-line arguments like the following. In that case it will use the default values. It uses TensorRT as the inference engine, tabby_tiger_cat.jpg as the input image and runs with batch size of 4. Upon the first run, generating serialized TensorRT model may take some time for a given batch size.
python3 classification/python/main.py
To run it on a single image with batch size 4 with the default backend of TensorRT:
python3 classification/python/main.py -i assets/images/tabby_tiger_cat.jpg -b 4
To run it on a folder worth of images with batch size 4 using the PyTorch backend instead of TensorRT:
python3 classification/python/main.py -i assets/images/ -b 4 -bk pytorch
To run on a single video with batch size of 1:
python classification/python/main.py -i assets/videos/pexels-ilimdar-avgezer-7081456.mp4 -b 1
Understanding the Output
This sample takes as input the one or more images or one video and generates the classification probabilities for top 5 classes. Since this sample works on batches, sometimes the batch size and the number of images read may not be a perfect multiple. In such cases, the last batch may have a smaller batch size. If the batch size to anything above 1 for one image input case, the same image is fed in the entire batch and identical image masks are generated and saved for all of them.
The top 5 classification results for the tabby_cat_tiger.jpg image is as follows:
user@machine:~/cvcuda/samples$ python3 classification/python/main.py -b 1
[perf_utils:85] 2023-07-27 22:27:17 WARNING perf_utils is used without benchmark.py. Benchmarking mode is turned off.
[perf_utils:89] 2023-07-27 22:27:17 INFO Using CV-CUDA version: 0.6.0-beta
[pipelines:35] 2023-07-27 22:27:17 INFO Using CVCUDA as preprocessor.
[torch_utils:77] 2023-07-27 22:27:17 INFO Using torchnvjpeg as decoder.
[pipelines:122] 2023-07-27 22:27:17 INFO Using CVCUDA as post-processor.
[model_inference:230] 2023-07-27 22:27:18 INFO Using TensorRT as the inference engine.
[classification:161] 2023-07-27 22:27:18 INFO Processing batch 0
[pipelines:144] 2023-07-27 22:27:18 INFO Classification probabilities for the image: 1 of 1
[pipelines:151] 2023-07-27 22:27:18 INFO tiger cat: 20.844%
[pipelines:151] 2023-07-27 22:27:18 INFO tabby: 18.831%
[pipelines:151] 2023-07-27 22:27:18 INFO Egyptian cat: 4.073%
[pipelines:151] 2023-07-27 22:27:18 INFO lynx: 0.276%
[pipelines:151] 2023-07-27 22:27:18 INFO Persian cat: 0.228%