ICCL: SELF-SUPERVISED INTRA- AND CROSS-MODAL CONTRASTIVE LEARNING WITH 2D-3D PAIRS FOR 3D SCENE UNDERSTANDING
Kyota Higa, Masahiro Yamaguchi, Toshinori Hosoi
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This paper proposes self-supervised intra- and cross-modal contrastive learning (ICCL) with 2D-3D pairs for 3D scene understanding. Learning from different modalities has produced substantial results in self-supervised learning. Our method learns a model with high transferability by minimizing contrastive losses based on 2D, 3D, and 2D-3D features. Compared with a conventional approach minimizing 3D and 2D-3D contrastive losses, our method minimizes a 2D contrastive loss in addition to them. It leads to learning a better feature representation. We evaluate the transferability by conducting three downstream tasks, including object classification and part segmentation. The results of the 3D object classification show that our approach achieves an accuracy of 91.7 and 85.4 (0.5 and 3.7 points higher than the conventional method). The results of the few-shot object classification and the part segmentation show that our accuracy is equal to or higher than conventional methods. With better feature representation for 2D images and 3D point clouds, transfer learning can be more accessible, enabling the implementation of various applications in many fields.