Generalized Sparse Convolution

Sparse Tensor

In traditional speech, text, or image data, features are extracted densely. Thus, the most common representations of these data are vectors, matrices, and tensors. However, for 3-dimensional scans or even higher-dimensional spaces, such dense representations are inefficient due to the sparsity. Instead, we can only save the non-empty part of the space as its coordinates and the associated features. This representation is an N-dimensional extension of a sparse matrix; thus it is known as a sparse tensor.

In the Minkowski Engine, we adopt the same sparse tensor for the basic data representation and the class is provided in MinkowskiEngine.SparseTensor. We use the COOrdinate (COO) format to save a sparse tensor [1]. This representation is simply a concatenation of coordinates in a matrix \(C\) and associated features \(F\).

\[\begin{split}\mathbf{C} = \begin{bmatrix} x_1^1 & x_1^2 & \cdots & x_1^D & b_1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x_N^1 & x_N^2 & \cdots & x_N^D & b_N \end{bmatrix}, \; \mathbf{F} = \begin{bmatrix} \mathbf{f}_1^T\\ \vdots\\ \mathbf{f}_N^T \end{bmatrix}\end{split}\]

In the above equation, we use a \(D\)-dimensional space and \(N\) number of points, each with the coordinate \((x_i^1, x_i^1, \cdots, x_i^D)\), and the associated feature \(\mathbf{f}_i\). \(b_i\) indicates the mini-batch index to disassociate instances within the same batch. Internally, we handle the batch index as an additional spatial dimension.

Generalized Sparse Convolution

The convolution is a fundamental operation in many fields. In image perception, the 2D convolution has achieved state-of-the-art performance in many tasks and is proven to be the most crucial operation in AI, and computer vision research. In this work, we adopt the sparse convolution [2] and propose the generalized sparse convolution. We use the generalized sparse convolution not only to the 3D spatial axes, but also to the temporal axis, which is proven to be more effective than recurrent neural networks (RNN) in some applications.

Specifically, we generalize the sparse convolution for generic input and output coordinates, and for arbitrary kernel shapes. The generalized sparse convolution encompasses not only all sparse convolutions but also the conventional dense convolutions. Let \(x^{\text{in}}_\mathbf{u} \in \mathbb{R}^{N^\text{in}}\) be an \(N^\text{in}\)-dimensional input feature vector in a \(D\)-dimensional space at \(\mathbf{u} \in \mathbb{R}^D\) (a D-dimensional coordinate), and convolution kernel weights be \(\mathbf{W} \in \mathbb{R}^{K^D \times N^\text{out} \times N^\text{in}}\). We break down the weights into spatial weights with \(K^D\) matrices of size \(N^\text{out} \times N^\text{in}\) as \(W_\mathbf{i}\) for \(|\{\mathbf{i}\}_\mathbf{i}| = K^D\). Then, the conventional dense convolution in D-dimension is

\[\mathbf{x}^{\text{out}}_\mathbf{u} = \sum_{\mathbf{i} \in \mathcal{V}^D(K)} W_\mathbf{i} \mathbf{x}^{\text{in}}_{\mathbf{u} + \mathbf{i}} \text{ for } \mathbf{u} \in \mathbb{Z}^D,\]

where \(\mathcal{V}^D(K)\) is the list of offsets in \(D\)-dimensional hypercube centered at the origin. e.g. \(\mathcal{V}^1(3)=\{-1, 0, 1\}\). The generalized sparse convolution in the following equation relaxes the above equation.

\[\mathbf{x}^{\text{out}}_\mathbf{u} = \sum_{\mathbf{i} \in \mathcal{N}^D(\mathbf{u}, \mathcal{C}^{\text{in}})} W_\mathbf{i} \mathbf{x}^{\text{in}}_{\mathbf{u} + \mathbf{i}} \text{ for } \mathbf{u} \in \mathcal{C}^{\text{out}}\]

where \(\mathcal{N}^D\) is a set of offsets that define the shape of a kernel and \(\mathcal{N}^D(\mathbf{u}, \mathcal{C}^\text{in})= \{\mathbf{i} | \mathbf{u} + \mathbf{i} \in \mathcal{C}^\text{in}, \mathbf{i} \in \mathcal{N}^D \}\) as the set of offsets from the current center, \(\mathbf{u}\), that exist in \(\mathcal{C}^\text{in}\). \(\mathcal{C}^\text{in}\) and \(\mathcal{C}^\text{out}\) are predefined input and output coordinates of sparse tensors. First, note that the input coordinates and output coordinates are not necessarily the same. Second, we define the shape of the convolution kernel arbitrarily with \(\mathcal{N}^D\). This generalization encompasses many special cases such as the dilated convolution and typical hypercubic kernels. Another interesting special case is the sparse submanifold convolution when we set \(\mathcal{C}^\text{out} = \mathcal{C}^\text{in}\) and \(\mathcal{N}^D = \mathcal{V}^D(K)\). If we set \(\mathcal{C}^\text{in} = \mathcal{C}^\text{out} = \mathbb{Z}^D\) and \(\mathcal{N}^D = \mathcal{V}^D(K)\), the generalized sparse convolution becomes the conventional dense convolution. If we define the \(\mathcal{C}^\text{in}\) and \(\mathcal{C}^\text{out}\) as multiples of a natural number and \(\mathcal{N}^D = \mathcal{V}^D(K)\), we have a strided dense convolution.