Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

Sreeja Manghat, Sreeram Manghat, Tanja Schultz

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:15:20

08 May 2022

Dealing with Out Of Vocabulary (OOV) words or unseen words is one of the main issues of Machine Translation (MT) as well as automatic speech recognition (ASR) systems. For morphologically rich languages having high type token ratio, the OOV percentage is also quite high. Sub-word segmentation has been found to be one of the major approaches in dealing with OOVs. In this paper we present a hybrid sub-word segmentation algorithm to deal with OOVs. A sub-word segmentation evaluation methodology is also presented. We also present results of our segmentation approach in comparison to some of the popular sub-word segmentation algorithms. Malayalam is a morphological rich low resource Indic language with very high type token ratio. All the experiments are done for conversational code-switched Malayalam-English corpus.

Tags:

oov

sub-word segmentation

language modelling

low resource languages

malayalam

code-switching

Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

Sreeja Manghat, Sreeram Manghat, Tanja Schultz

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

MINIMUM WORD ERROR TRAINING FOR NON-AUTOREGRESSIVE TRANSFORMER-BASED CODE-SWITCHING ASR

IMPROVED META LEARNING FOR LOW RESOURCE SPEECH RECOGNITION

Join an IEEE Society