-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:14:47
The costly process of obtaining semantic segmentation la- bels has driven research towards to weakly supervised seman- tic segmentation (WSSS) methods, with only image-level la- bels available for training. The lack of dense semantic scene representation requires methods to increase complexity to ob- tain additional semantic information (i.e. object/stuff extent and boundary) about the scene. This is often done though in- creased model complexity and sophisticated multi-stage train- ing/refinement procedures. However, the lack of 3D geomet- ric structure of a single image makes these efforts desperate at a certain point. in this work, we propose to harness (inverse) depth maps estimated from one single image via a monocular depth estimation model to integrate the 3D geometric struc- ture of the scene into the segmentation model. in light of this proposal, we develop an end-to-end segmentation-based network model and a self-supervised training process to train for semantic masks from only image-level annotations in a single stage. We show that despite of its simplicity, our one- stage method achieves competitive mIoU scores (val: 64.32, test: 64.91) on Pascal VOC compared with those significantly more complex pipelines and outperforms SOTA single-stage methods.