Skip to main content

Exploring Visual-Audio Composition Alignment Network For Quality Fashion Retrieval In Video

Yanhao Zhang, Jianmin Wu, Xiong Xiong, Dangwei Li, Chenwei Xie, Yun Zheng, Pan Pan, Yinghui Xu

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:06:12
09 Jun 2021

Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval.

Chairs:
Sicheng Zhao

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $25.00
    Non-members: $40.00
  • SPS
    Members: Free
    IEEE Members: Free
    Non-members: Free