Semantic Segmentation Inference Using TensorRT

The semantic segmentation sample in CVCUDA uses the fcn_resnet101 deep learning model from the torchvision library. Since the model does not come with the softmax layer at the end, we are going to add one. The following code snippet shows how the model is setup for inference use case with TensorRT.

TensorRT requires a serialized TensorRT engine to run the inference. One can generate such an engine by first converting an existing PyTorch model to ONNX and then converting the ONNX to a TensorRT engine. The serialized TensorRT engine is good to work on the specific GPU with the maximum batch size it was given at the creation time. Since ONNX and TensorRT model generation is a time consuming operation, we avoid doing this every-time by first checking if one of those already exists (most likely due to a previous run of this sample.) If so, we simply use those models rather than generating a new one.

Finally we take care of setting up the I/O bindings. We allocate the output Tensors in advance for TensorRT. Helper methods such as convert_onnx_to_tensorrt and setup_tensort_bindings are defined in the helper script file samples/common/python/trt_utils.py

class SegmentationTensorRT:
    def __init__(
        self,
        output_dir,
        seg_class_name,
        batch_size,
        image_size,
        device_id,
        cvcuda_perf,
    ):
        self.logger = logging.getLogger(__name__)
        self.output_dir = output_dir
        self.device_id = device_id
        self.cvcuda_perf = cvcuda_perf
        # For TensorRT, the process is the following:
        # We check if there already exists a TensorRT engine generated
        # previously. If not, we check if there exists an ONNX model generated
        # previously. If not, we will generate both of the one by one
        # and then use those.
        # The underlying pytorch model that we use in case of TensorRT
        # inference is the FCN model from torchvision. It is only used during
        # the conversion process and not during the inference.
        onnx_file_path = os.path.join(
            self.output_dir,
            "model.%d.%d.%d.onnx"
            % (
                batch_size,
                image_size[1],
                image_size[0],
            ),
        )
        trt_engine_file_path = os.path.join(
            self.output_dir,
            "model.%d.%d.%d.trtmodel"
            % (
                batch_size,
                image_size[1],
                image_size[0],
            ),
        )

        with torch.cuda.stream(torch.cuda.ExternalStream(cvcuda.Stream.current.handle)):

            torch_model = segmentation_models.fcn_resnet101
            weights = segmentation_models.FCN_ResNet101_Weights.DEFAULT

            try:
                self.class_index = weights.meta["categories"].index(seg_class_name)
            except ValueError:
                raise ValueError(
                    "Requested segmentation class '%s' is not supported by the "
                    "fcn_resnet101 model. All supported class names are: %s"
                    % (seg_class_name, ", ".join(weights.meta["categories"]))
                )

            # Check if we have a previously generated model.
            if not os.path.isfile(trt_engine_file_path):
                if not os.path.isfile(onnx_file_path):
                    # First we use PyTorch to create a segmentation model.
                    with torch.no_grad():
                        fcn_base = torch_model(weights=weights)

                        class FCN_Softmax(torch.nn.Module):
                            def __init__(self, fcn):
                                super(FCN_Softmax, self).__init__()
                                self.fcn = fcn

                            def forward(self, x):
                                infer_output = self.fcn(x)["out"]
                                return torch.nn.functional.softmax(infer_output, dim=1)

                        fcn_base.eval()
                        pyt_model = FCN_Softmax(fcn_base)
                        pyt_model.cuda(self.device_id)
                        pyt_model.eval()

                        # Allocate a dummy input to help generate an ONNX model.
                        dummy_x_in = torch.randn(
                            batch_size,
                            3,
                            image_size[1],
                            image_size[0],
                            requires_grad=False,
                        ).cuda(self.device_id)

                        # Generate an ONNX model using the PyTorch's onnx export.
                        torch.onnx.export(
                            pyt_model,
                            args=dummy_x_in,
                            f=onnx_file_path,
                            export_params=True,
                            opset_version=15,
                            do_constant_folding=True,
                            input_names=["input"],
                            output_names=["output"],
                            dynamic_axes={
                                "input": {0: "batch_size"},
                                "output": {0: "batch_size"},
                            },
                        )

                        # Remove the tensors and model after this.
                        del pyt_model
                        del dummy_x_in
                        torch.cuda.empty_cache()

                # Now that we have an ONNX model, we will continue generating a
                # serialized TensorRT engine from it.
                convert_onnx_to_tensorrt(
                    onnx_file_path,
                    trt_engine_file_path,
                    max_batch_size=batch_size,
                    max_workspace_size=1,
                )

            # Once the TensorRT engine generation is all done, we load it.
            trt_logger = trt.Logger(trt.Logger.ERROR)
            with open(trt_engine_file_path, "rb") as f, trt.Runtime(
                trt_logger
            ) as runtime:
                trt_model = runtime.deserialize_cuda_engine(f.read())

            # Create execution context.
            self.model = trt_model.create_execution_context()

            # Allocate the output bindings.
            self.output_tensors, self.output_idx = setup_tensort_bindings(
                trt_model,
                batch_size,
                self.device_id,
                self.logger,
            )

            self.logger.info("Using TensorRT as the inference engine.")

To run the inference the __call__ method is used. It uses the correct I/O bindings and makes sure to use the CUDA stream to perform the forward inference pass. In passing the inputs, we are directly going to pass the data from the CVCUDA tensor without further conversions. The API to do so does involve accessing an internal member named __cuda_array_interface__ as shown in the code below.

def __call__(self, tensor):
    self.cvcuda_perf.push_range("inference.tensorrt")

    # Grab the data directly from the pre-allocated tensor.
    input_bindings = [tensor.cuda().__cuda_array_interface__["data"][0]]
    output_bindings = []
    for t in self.output_tensors:
        output_bindings.append(t.data_ptr())
    io_bindings = input_bindings + output_bindings

    # Must call this before inference
    binding_i = self.model.engine.get_binding_index("input")
    assert self.model.set_binding_shape(binding_i, tensor.shape)

    self.model.execute_async_v2(
        bindings=io_bindings, stream_handle=cvcuda.Stream.current.handle
    )

    segmented = self.output_tensors[self.output_idx]

    self.cvcuda_perf.pop_range()
    return segmented