Aug 20, 2020
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
Image segmentation is a process of subdividing an image into constituent parts or objects for detailed analysis. The resulting high-level information is useful in computer vision applications, like the self-driving robot described in the QDN learning resource Qualcomm Robotics RB3 Drives Turtlebot 3 Burger.
Types of image segmentation
The goal of semantic segmentation is to determine whether each pixel in an image belongs to a given object or not. Semantic segmentation is commonly used in self-driving cars for obtaining the position of roads, cars and other obstacles. It is also used in camera applications to differentiate the pixels in the foreground from those in the background of an image. From there, a blurring or monochrome effect can be applied to background pixels, as shown below.
Instance segmentation, on the other hand, identifies each instance of an object at the pixel level. Instead of indicating whether each pixel belongs to a particular object or not, it identifies whether that pixel belongs to a different instance of the same object. An example is the multiple instances of cubes in the image below.
This article focuses on semantic segmentation, a computer vision problem that can be solved by algorithms like image thresholding, k-means clustering and deep learning architectures. The deep learning techniques of downsampling and upsampling are especially useful.
A downsampling path consists of multiple blocks of convolution and max pooling operations to extract feature maps.
A downsampling block consists of convolution operations by a filter, and then a max pooling operation by a kernel of shape 2×2. The number of filters for the convolution operation doubles after each block of downsampling. (See the QDN learning resource Deep Learning and Convolutional Neural Networks for Computer Vision for an overview of terms used.)
At the first block of the downsampling path, the input image is fed into the convolution layer. Each filter is then run across the image to get a corresponding feature map. For a kernel of size 3×3, the convolution operation finds the dot product between the weights of the filter and corresponding pixel value in the input image. After the convolution operation, the width and height will be reduced if padding is applied as VALID and depth increased to the number of filters applied.
We can find the height or width after the convolution operation by applying the following formula:
Nout = size after convolution,
Nin = height or width of input matrix,
p = padding,
k = kernel size and
s = stride.
So, we calculate the height and width of the feature map obtained after the convolution operation in the first block. Our input image is of resolution 320×240×3. Also we know p = 0 (VALID padding), k = 2, and s = 1. So if the number of filters equals 32, then the output of the convolution operation will be of dimension 316×236×32. If padding is SAME, then it will be 320×240×32.
Next, a max pooling operation is used to extract the most important feature from the output of the convolution operation.
Thus, to downsample the feature map, a 2×2 filter will run across it with stride = 2, obtaining the maximum value and forming a pooled feature map. A 2×2 max pooling will yield an output shape of 160×120×32 with stride 2, for an input shape 320×240×32. The downsampling block can be defined with any number of filters for convolution operations. In each block, the output from convolution operations is saved as a variable for concatenation in later stages of upsampling.
Once downsampling is complete, the feature maps are fed to the upsampling layer to upscale the feature map to the resolution of that input.
Tensorflow APIs (v1.x) used in the downsampling process are as follows:
The result of downsampling is information about whether an object is present in the image or not; however, information about the object’s position is lost in the process. Upsampling takes the output from downsampling and upscales it to the same resolution as that of the input image.
An upsampling block consists of convolution operations followed by an upscaling operation using transposed convolution and then a concatenation operation.
The convolution operations are similar to the ones performed in the downsampling process above.
Transposed convolution is the opposite of ordinary convolution. Here it involves a transposed convolution operation with the same number of filters as in the convolution operations of the last block of downsampling, with a filter shape of 2×2. After that, the height and width will be changed according to the shape of the filter. Upsampling requires maintaining the same size as the corresponding downsample block; thus, the output of the transposed convolution should be double the size of the input.
Better prediction comes from implementing a skip connection, by concatenating the output of convolution operations of each downsampling block with the output of the transposed convolution. That occurs at the end of the first upsampling block and the same operations will be repeated until the output shape is the same as the input shape.
Next is a convolution operation with an N_CLASS (number of classes) filter of shape 3×3. After that, a softmax activation function obtains the probability of each class for a particular pixel. That is the result of segmentation of the input image.
Tensorflow APIs (v1.x) used in the upsampling process are as follows: