Semantic Segmentation Post-processing Pipeline using CVCUDA

CVCUDA helps accelerate the post-processing pipeline of the semantic segmentation sample tremendously. Easy interoperability with PyTorch tensors also makes it easy to integrate with PyTorch and other data loaders that supports the tensor layout.

The exact post-processing operations are:

Create Binary mask -> Upscale the mask -> Blur the input frames -> Joint Bilateral filter to smooth the mask -> Overlay the masks onto the original frame

Since the network outputs the class probabilities (0-1) for all the classes supported by the network, we must first take out the class of interest from it and upscale its values to bring it in the uint8 (0-255) range. These operations will be done using PyTorch math and the resulting tensor will be converted to CVCUDA.

# We assume that everything other than probabilities will be a CVCUDA tensor.
# probabilities has to be a torch tensor because we need to perform a few
# math operations on it. Even if the TensorRT backend was used to run inference
# it would have generated output as Torch.tensor

actual_batch_size = resized_tensor.shape[0]

class_probs = probabilities[:actual_batch_size, class_index, :, :]
class_probs = torch.unsqueeze(class_probs, dim=-1)
class_probs *= 255
class_probs = class_probs.type(torch.uint8)

cvcuda_class_masks = cvcuda.as_tensor(class_probs.cuda(), "NHWC")

The remaining the pipeline code is easy to follow along.

# Upscale the resulting masks to the full resolution of the input image.
self.cvcuda_perf.push_range("resize")
cvcuda_class_masks_upscaled = cvcuda.resize(
    cvcuda_class_masks,
    (frame_nhwc.shape[0], frame_nhwc.shape[1], frame_nhwc.shape[2], 1),
    cvcuda.Interp.NEAREST,
)
self.cvcuda_perf.pop_range()

# Blur the down-scaled input images and upscale them back to their original resolution.
# A part of this will be used to create a background blur effect later when the
# overlay happens.
# Note: We apply blur on the low-res version of the images to save computation time.
self.cvcuda_perf.push_range("gaussian")
cvcuda_blurred_input_imgs = cvcuda.gaussian(
    resized_tensor, kernel_size=(15, 15), sigma=(5, 5)
)
self.cvcuda_perf.pop_range()

self.cvcuda_perf.push_range("resize")
cvcuda_blurred_input_imgs = cvcuda.resize(
    cvcuda_blurred_input_imgs,
    (frame_nhwc.shape[0], frame_nhwc.shape[1], frame_nhwc.shape[2], 3),
    cvcuda.Interp.LINEAR,
)
self.cvcuda_perf.pop_range()

# Next we apply joint bilateral filter on the up-scaled masks with the gray version of the
# input image as guidance to smooth out the edges of the masks. This is needed because
# the mask was generated in lower resolution and then up-scaled. Joint bilateral will help
# in smoothing out the edges, resulting in a nice smooth mask.
cvcuda_frame_nhwc = cvcuda.as_tensor(frame_nhwc.cuda(), "NHWC")

self.cvcuda_perf.push_range("cvtcolor")
cvcuda_image_tensor_nhwc_gray = cvcuda.cvtcolor(
    cvcuda_frame_nhwc, cvcuda.ColorConversion.RGB2GRAY
)
self.cvcuda_perf.pop_range()

self.cvcuda_perf.push_range("joint_bilateral_filter")
cvcuda_jb_masks = cvcuda.joint_bilateral_filter(
    cvcuda_class_masks_upscaled,
    cvcuda_image_tensor_nhwc_gray,
    diameter=5,
    sigma_color=50,
    sigma_space=1,
)
self.cvcuda_perf.pop_range()

# Create an overlay image. We do this by selectively blurring out pixels
# in the input image where the class mask prediction was absent (i.e. False)
# We already have all the things required for this: The input images,
# the blurred version of the input images and the upscale version
# of the mask.
self.cvcuda_perf.push_range("composite")
cvcuda_composite_imgs_nhwc = cvcuda.composite(
    cvcuda_frame_nhwc,
    cvcuda_blurred_input_imgs,
    cvcuda_jb_masks,
    3,
)
self.cvcuda_perf.pop_range()

# Based on the output requirements, we return appropriate tensors.
if self.output_layout == "NCHW":
    cvcuda_composite_imgs_out = cvcuda.reformat(
        cvcuda_composite_imgs_nhwc, "NCHW"
    )
else:
    assert self.output_layout == "NHWC"
    cvcuda_composite_imgs_out = cvcuda_composite_imgs_nhwc

if self.gpu_output:
    if self.torch_output:
        cvcuda_composite_imgs_out = torch.as_tensor(
            cvcuda_composite_imgs_out.cuda(), device="cuda:%d" % self.device_id
        )
else:
    cvcuda_composite_imgs_out = (
        torch.as_tensor(cvcuda_composite_imgs_out.cuda()).cpu().numpy()
    )

self.cvcuda_perf.pop_range()  # postprocess