Dorsal Stream

Computer Vision and Machine Learning

© 2020 Dushyant Mehta

The Path to SelecSLS Net

A peek into how SelecSLS (Selective Short and Long Range Skip) Convolutional Neural Network architecture came to be. The architecture is fast, accurate, memory efficient, and prunable, and has a Pytorch implementation available.


For extending our work on real-time single-person and non-real-time multi-person 3D body pose estimation at MPII to real-time multi-person 3D pose estimation, I was investigating a whole bunch of techniques for speeding up CNN inference on GPUs.

At the time (Summer 2018) we found the exisiting CNN architectures to be inadequate for our task, either being too slow for real-time inference with 512x320 images, or losing significant accuracy in lieu of speed. After some misguided attempts at utilizing filter pruning (to be talked about in detail in some later post perhaps), and a subsequent CVPR paper on some surprising observations on implicit filter pruning in CNNs, I decided to try to design an architecture from scratch. The

ESPNet as the Starting Point

ESPNet [1] architecture was proposed at ECCV 2018, and meant for faster inference for image segmentation tasks. It utilizes dilated convolutions with different levels of dilation, and to ameliorate the block artefacts arising from the dilated convolutions, it uses hierarchical feature summation, as shown in Fig. 1.

My observation was that the convolution layers, instead of being laid out flat, can be interleaved with the summation layers, such that the network becomes deeper without any change in computational cost. Now, there is a caveat here. ESPNet would still be faster if a dataflow aware scheduler were to parallelize the convolution layers on the GPU. We were using Caffe without any modifications, and there the layers are run sequentially despite being parallel. Hence, there wasn’t a speed penalty from the proposed hierarchical reorganization. What was problematic, however, was that the block artefacts from dilated convolutions were being exacerbated by hierarchical dilated convolutions.

ESPNet Arch
Fig 1: The building block of ESPNet uses hierarchical aggregation, but flat transformation. My initial idea was to rearrange the convolution and summation layers such that the transformation and aggregation are both hierarchical, increasing the depth of the network, while retaining the same computation load.
Early Variants of SelecSLS
Fig 2: Mid development variants of SelecSLS from my notes. By this point (Autumn 2018), I had tried about 15 variations on the idea, and dropped dilated convolutions from consideration. You can see that it has hints of the design of Res2Net architecture (Fig. 3) which came out in Spring 2019.

Final Stretch and a Name Change

Thus, I went back to using regular convolutions. To reduce the computational costs, I adopted pointwise convolutions (1x1) to modulate the number of feature channels before 3x3 convolutions. I evaluated some tens of manually designed combinations of feature transformation and aggregation. In Fig. 2 below, some of the intermediate variants can be seen. The network in the middle bears close resemblance to Res2Net [2] architecture, and the network on the right is approaching the final design of SelecSLS Net [3]. The final variants of SelecSLS Net (Fig. 4) were obtained from the design on the right by removing the hierarchical feature aggregation, and only having concatenative skip connections. Additional inter-module skip connectivity was also introduced. See the paper for details and evaluations of the final variants. VoVNet architecture [4], which also came out in Spring 2019, also shares some similarities with the building block of SelecSLS Net, but does not use inter-module long-range skip connectivity, and is not as fast, and not as GPU memory efficient as SelecSLS Net.

The network architecture was called DLNAS Net till very late in the paper writing process (Winter 2018/19). The name was promptly changed when my supervisor learned that it stood for “Dushyant’s Late Night Architecture Search” :) .

Res2Net Arch
Fig 3: The building block of Res2Net [2] shares the same philosophy as my early experimentation, using a hierarchy of aggregation and transformation.
SelecSLS Arch
Fig 4: The building blocks, and some of the connectivity considerations for SelecSLS Net [3]. Sparse, and short range concatenative skip connections within the module, and long range inter-module connectivity promote information flow through the network.

Salient Features and Performance

As shown in the paper, SelecSLSNet is about 1.3 times faster than ResNet-50 on GPUs, while retaining the same level of accuracy on various human pose estimation tasks. On CPUs, the speedup increases to 1.8 times.

The source of the improved speed is not a reduction in FLOPs, but rather the much smaller memory footprint of SelecSLS compared to ResNet-50. At a mini-batch size of 1, and small image sizes, SelecSLS Net occupies only about 70-80% of the VRAM that ResNet-50 does. With larger mini-batch sizes, or larger image sizes, the memory occupancy of SelecSLS Net is only about 50% of ResNet-50!! This means that you can fit much larger mini-batch sizes into memory, or much larger images into memory for training and inference.

In addition to the faster speed and lower memory occupancy without compromising on accuracy, one of the key advantages is that due to the use of only concatenative skip connections, the process of feature pruning in SelecSLS Net is not as fussy as in ResNets. ResNets, due to additive skip connectivity, require careful machinations [5] to achieve usable feature pruning. The reference implementation includes a pruning example.

Show Me The Code!

A Pytorch reference implementation with ImageNet weights is available at github.com/mehtadushy/SelecSLS-Pytorch, and also in Ross Wightmann’s excellent collection of Pytorch image networks. If you port it to other frameworks, I can link to your repository. The repository github.com/mehtadushy/SelecSLS-Pytorch also contains code for pruning SelecSLS Net, and as an example, prunes based on the implicit sparsity emerging in CNNs.

References

  1. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation
  2. Res2Net: A New Multi-scale Backbone Architecture
  3. XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera
  4. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection
  5. Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks