CSLNSpeech: Solving the extended speech separation problem with the help of Chinese sign language
- Authors
- Publication Date
- Nov 01, 2024
- Identifiers
- DOI: 10.1016/j.specom.2024.103131
- OAI: oai:HAL:hal-04719302v1
- Source
- HAL-Rennes 1
- Keywords
- Language
- English
- License
- Unknown
- External links
Abstract
Previous audio-visual speech separation methods synchronize the speaker's facial movement and speech in the video to self-supervise the speech separation. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network to learn the combination of three modalities, audio, face, and sign language information, to solve the speech separation problem better. We introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset to train the model, in which three modalities coexist: audio, face, and sign language. Experimental results show that the proposed model performs better and is more robust than the usual audio-visual system. In addition, the sign language modality can also be used alone to supervise speech separation tasks, and introducing sign language helps hearing-impaired people learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/ iveveive/SLNSpeech