Affordable Access

deepdyve-link
Publisher Website

Face mask recognition from audio: The MASC database and an overview on the mask challenge

Authors
  • Mohamed, Mostafa M.1, 2
  • Nessiem, Mina A.1, 2
  • Batliner, Anton1
  • Bergler, Christian3
  • Hantke, Simone4
  • Schmitt, Maximilian1
  • Baird, Alice1
  • Mallol-Ragolta, Adria1
  • Karas, Vincent1
  • Amiriparian, Shahin1, 4
  • Schuller, Björn W.1, 5, 4
  • 1 Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Augsburg, Germany
  • 2 AI R&D Team SyncPilot GmbH, Augsburg, Germany
  • 3 Pattern Recognition Lab, FAU Erlangen-Nuremberg, Germany
  • 4 audEERING GmbH, Gilching, Germany
  • 5 GLAM – Group on Language, Audio & Music, Imperial College London, UK
Type
Published Article
Journal
Pattern Recognition
Publisher
Elsevier
Publication Date
Oct 04, 2021
Volume
122
Pages
108361–108361
Identifiers
DOI: 10.1016/j.patcog.2021.108361
PMID: 34629550
PMCID: PMC8489285
Source
PubMed Central
Keywords
Disciplines
  • Article
License
Unknown

Abstract

The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of 71.8 % Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of 80.1 % . Moreover, we present the results of fusing the approaches, leading to a UAR of 82.6 % . Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models.

Report this publication

Statistics

Seen <100 times