Real-Time Acoustic Scene Classification For Hearing Aids
Kamil Adilo?lu, Andreas Hüwel, Jörg-Hendrik Bach
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 04:35
Acoustic scene classification is a popular topic mostly combining the fields of audio signal processing and machine learning. Particularly the detection and classification of acoustic scenes and events (DCASE) challenge, which is held each year, increased the interest of the researchers to this topic. However, the definition of the acoustic scenes and the corresponding databases for training the classifiers do not account for the requirements of a hearing aid application. Furthermore, the proposed methods to classify the provided databases do not consider neither the computational nor the time restrictions of a hearing aid. For these reasons, we recorded typical scenes â so called listening situations â using two different types of binaural hearing aid shells for hearing aid scene classification applications. The first hearing aid is an in-ear hearing aid and has one microphone one each side. The second hearing aid is a behind-the-ear hearing aid and contains two microphones on each side. Performing long recording sessions in different listening situations, annotating and cutting them into 10s snippets and finally mixing them with speech, we compiled a database with 14 different classes. We computed the LogMel features of the snippets and trained a convolutional neural network (CNN) on these features using PyTorch. Finally, we implemented a real-time capable version of this network on our real-time capable signal processing platform the master hearing aid (MHA) in C++. The MHA can load the pre-trained CNN using the LibTorch library and can perform one forward iteration through the network given the input features. The system we would like to demonstrate at ICASSP in the Show & Tell session captures the binaural input signal using hearing aid shells, which an artificial head is wearing. We will play a sequence of test sounds on a loudspeaker (or possibly using headphones, if loudspeakers are now allowed), for which the true labels of the listening situations are known. The binaural signal is sent through an external sound card to the mini-PC, where the MHA is running. For each 10s chunks, the LogMel features are computed and one forward iteration of the CNN is executed. The output of the CNN is a softmax layer, which we take the maximum to determine the predicted listening situation out of possible 14 listening situations. The predicted listening situation is shown to the user on a GUI. This whole chain runs in real-time with a total delay of 10s as we perform the prediction on 10s input signals.