OCVOS: Object-Centric Representation for Video Object Segmentation
Junho Jo, Dongyoon Wee, Nam Ik Cho
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Semi-supervised video object segmentation (VOS) methods aim to segment target objects with the help of pixel-level annotations in the first frame. Many methods employ Transformer-based attention modules to propagate the given annotations in the first frame to the most similar patch or pixel in the following frames. Although they have shown impressive results, they can still be prone to errors in challenging scenes with multiple overlapping objects. To tackle this problem, we propose an object-centric VOS (OCVOS) method that exploits query-based Transformer decoder blocks. After aggregating target object information with typical matching-based approaches, the Transformer networks extract object-wise information by interacting with object queries. In this way, the proposed method considers not only global and contextual information but also object-centric representations. We validate its effectiveness in inducing object-wise information compared to existing methods on the DAVIS and YouTube-VOS benchmarks.