Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

Hien Ohnaka (National Institute of Technology, Tokuyama College); Shinnosuke Takamichi (The University of Tokyo); Keisuke Imoto (Doshisha University); Yuki Okamoto (Ritsumeikan University); Kazuki Fujii (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text representation of sound. From this perspective, onoma-to-wave has been proposed to synthesize environmental sounds from the desired onomatopoeia texts. Onomatopoeias have another representation: visual-text representations of sounds in comics, advertisements, and virtual reality. A visual onomatopoeia (visual text of onomatopoeia) contains rich information that is not present in the text, such as a long-short duration of the image, so the use of this representation is expected to synthesize diverse sounds. Therefore, we propose visual onoma-to-wave for environmental sound synthesis from visual onomatopoeia. The method can transfer visual concepts of the visual text and sound-source image to the synthesized sound. We also propose a data augmentation method focusing on the repetition of onomatopoeias to enhance the performance of our method. An experimental evaluation shows that the methods can synthesize diverse environmental sounds from visual text and sound-source images.

Tags:

Detection and classification of acoustic scenes and events

Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

Hien Ohnaka (National Institute of Technology, Tokuyama College); Shinnosuke Takamichi (The University of Tokyo); Keisuke Imoto (Doshisha University); Yuki Okamoto (Ritsumeikan University); Kazuki Fujii (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

GraphIT: Iterative reweighted l1 algorithm for sparse graph inference in state-space models

Training sound event detection with soft labels from crowdsourced annotations

HiSSNet: Sound Event Detection and Speaker Identification via Hierarchical Prototypical Networks for Low-Resource Headphones

Join an IEEE Society