Rethinking Efficacy of Softmax For Lightweight Non-Local Neural Networks
Yooshin Cho, Youngsoo Kim, Hanbyel Cho, Jaesung Ahn, Hyeong Gwon Hong, Junmo Kim
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:23
Multi-view 3D reconstruction is the base for many other applications in computer vision. Video provides multi-view images and temporal information, which can help us better complete the reconstruction goal. Redundant information handling in video and multi-view feature extraction and fusion become the key issues in the shape prior extraction for reconstruction. in this paper, inspired by the recent great success in Transformer models, we propose a transformer-based 3D reconstruction network. We formulate the multi-view 3D reconstruction into three parts: frame encoder, fusion module, and shape decoder. We apply several special used tokens and perform the fusion progressively in the encoder phase, called patch-level progressive fusion module. These tokens describe which part of the object the frame should focus on and the local structural detail progressively. Then we further design a transformer fusion module to aggregate the structure information. Finally, multi-head attention is utilized to build the transformer-based decoder to reuse the shallow features from encoder. in experiments not only can ours method achieve competitive performance, but it also has low model complexity and computation cost.