pull figure

Abstract

Video action recognition is a complex task dependent on modeling spatial and temporal context. Standard approaches rely on 2D or 3D convolutions to process such context, resulting in expensive operations with millions of parameters. Recent efficient architectures leverage a channel-wise shift-based primitive as a replacement for temporal convolutions, but remain bottlenecked by spatial convolution operations to maintain strong accuracy and a fixed-shift scheme. Naively extending such developments to a 3D setting is a difficult, intractable goal.

To this end, we introduce RubiksNet, a new efficient architecture for video action recognition based on a proposed learnable 3D spatiotemporal shift operation (RubiksShift). We analyze the suitability of our new primitive for video action recognition and explore several novel variations of our approach to enable stronger representational flexibility while maintaining an efficient design. We benchmark our approach on several standard video recognition datasets, and observe that our method achieves comparable or better accuracy than prior work on efficient video action recognition at a fraction of the performance cost, with 2.9 - 5.9x fewer parameters and 2.1 - 3.7x fewer FLOPs. We also perform a series of controlled ablation studies to verify our significant boost in the efficiency-accuracy tradeoff curve is rooted in the core contributions of our RubiksNet architecture.

Video Overview

(No audio; use the Settings > Playback speed option in the YouTube player to adjust video to your reading speed)

Citation

@inproceedings{fanbuch2020rubiks,
  title={RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition},
  author={Linxi Fan* and Shyamal Buch* and Guanzhi Wang and Ryan Cao and Yuke Zhu and Juan Carlos Niebles and Li Fei-Fei},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

Funding Acknowledgements

funding logos

This research was sponsored in part by grants from Toyota Research Institute (TRI). Some computational support for experiments was provided by Google Cloud and NVIDIA. The authors also acknowledge fellowship support. Please refer to paper for full acknowledgements, thank you!