OnQ Blog

Developers/Robotics:Using image segmentation to drive the Turtlebot3 Burger

Aug 20, 2020

Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.

Image segmentation is a process of subdividing an image into constituent parts or objects for detailed analysis. The resulting high-level information is useful in computer vision applications, like the self-driving robot described in the QDN learning resource Qualcomm Robotics RB3 Drives Turtlebot 3 Burger.

Types of image segmentation

The goal of semantic segmentation is to determine whether each pixel in an image belongs to a given object or not. Semantic segmentation is commonly used in self-driving cars for obtaining the position of roads, cars and other obstacles. It is also used in camera applications to differentiate the pixels in the foreground from those in the background of an image. From there, a blurring or monochrome effect can be applied to background pixels, as shown below.

Instance segmentation, on the other hand, identifies each instance of an object at the pixel level. Instead of indicating whether each pixel belongs to a particular object or not, it identifies whether that pixel belongs to a different instance of the same object. An example is the multiple instances of cubes in the image below.

This article focuses on semantic segmentation, a computer vision problem that can be solved by algorithms like image thresholding, k-means clustering and deep learning architectures. The deep learning techniques of downsampling and upsampling are especially useful.


A downsampling path consists of multiple blocks of convolution and max pooling operations to extract feature maps.

A downsampling block consists of convolution operations by a filter, and then a max pooling operation by a kernel of shape 2×2. The number of filters for the convolution operation doubles after each block of downsampling. (See the QDN learning resource Deep Learning and Convolutional Neural Networks for Computer Vision for an overview of terms used.)

A representation of convolution operation using 3x3 kernel with same padding.


At the first block of the downsampling path, the input image is fed into the convolution layer. Each filter is then run across the image to get a corresponding feature map. For a kernel of size 3×3, the convolution operation finds the dot product between the weights of the filter and corresponding pixel value in the input image. After the convolution operation, the width and height will be reduced if padding is applied as VALID and depth increased to the number of filters applied.

We can find the height or width after the convolution operation by applying the following formula:


Nout = size after convolution,
Nin = height or width of input matrix,
p = padding,
k = kernel size and
s = stride.

So, we calculate the height and width of the feature map obtained after the convolution operation in the first block. Our input image is of resolution 320×240×3. Also we know p = 0 (VALID padding), k = 2, and s = 1. So if the number of filters equals 32, then the output of the convolution operation will be of dimension 316×236×32. If padding is SAME, then it will be 320×240×32.

Max pooling

Next, a max pooling operation is used to extract the most important feature from the output of the convolution operation.

Thus, to downsample the feature map, a 2×2 filter will run across it with stride = 2, obtaining the maximum value and forming a pooled feature map. A 2×2 max pooling will yield an output shape of 160×120×32 with stride 2, for an input shape 320×240×32. The downsampling block can be defined with any number of filters for convolution operations. In each block, the output from convolution operations is saved as a variable for concatenation in later stages of upsampling.

Once downsampling is complete, the feature maps are fed to the upsampling layer to upscale the feature map to the resolution of that input.

Tensorflow APIs (v1.x) used in the downsampling process are as follows:


The result of downsampling is information about whether an object is present in the image or not; however, information about the object’s position is lost in the process. Upsampling takes the output from downsampling and upscales it to the same resolution as that of the input image.

An upsampling block consists of convolution operations followed by an upscaling operation using transposed convolution and then a concatenation operation.

The convolution operations are similar to the ones performed in the downsampling process above.

Transposed convolution is the opposite of ordinary convolution. Here it involves a transposed convolution operation with the same number of filters as in the convolution operations of the last block of downsampling, with a filter shape of 2×2. After that, the height and width will be changed according to the shape of the filter. Upsampling requires maintaining the same size as the corresponding downsample block; thus, the output of the transposed convolution should be double the size of the input.

Better prediction comes from implementing a skip connection, by concatenating the output of convolution operations of each downsampling block with the output of the transposed convolution. That occurs at the end of the first upsampling block and the same operations will be repeated until the output shape is the same as the input shape.

Next is a convolution operation with an N_CLASS (number of classes) filter of shape 3×3. After that, a softmax activation function obtains the probability of each class for a particular pixel. That is the result of segmentation of the input image.

Tensorflow APIs (v1.x) used in the upsampling process are as follows:

Representation of Transposed Convolution.

Next step

Have a look at how the Qualcomm Robotics RB3 drives the Turtlebot 3 Burger.

Qualcomm Robotics RB3 is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries ("Qualcomm"). Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries. The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.

Ramya Kanthi Polisetti

Applications Engineer, Qualcomm Technologies

©2021 Qualcomm Technologies, Inc. and/or its affiliated companies.

References to "Qualcomm" may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as applicable.

Qualcomm Incorporated includes Qualcomm's licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm's engineering, research and development functions, and substantially all of its products and services businesses. Qualcomm products referenced on this page are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

Materials that are as of a specific date, including but not limited to press releases, presentations, blog posts and webcasts, may have been superseded by subsequent events or disclosures.

Nothing in these materials is an offer to sell any of the components or devices referenced herein.