Skip to yearly menu bar Skip to main content


Poster

Speech Self-Supervised Learning Using Diffusion Model Synthetic Data

Heting Gao · Kaizhi Qian · Junrui Ni · Chuang Gan · Mark Hasegawa-Johnson · Shiyu Chang · Yang Zhang


Abstract:

While self-supervised learning (SSL) in speech has greatly reduced the reliance of speech processing systems on annotated corpora, the success of SSL still hinges on the availability of a large-scale unannotated corpus, which is still often impractical for many low-resource languages or under privacy concerns. In this paper, we investigate whether existing SSL methods have been underutilizing the information in the pretraining and explore ways to improve their information efficiency. Motivated by the recent success of diffusion models in capturing the abundant information in data, we propose DiffS4L, a synthetic speech SSL algorithm based on diffusion models. DiffS4L introduces a diffusion model, which learns from a given small pretraining dataset and expands it into a much larger synthetic dataset with different levels of variations. The synthetic dataset is then used to pretrain SSL models. Our experiments show that DiffS4L can significantly improve the performance of SSL models, such as reducing the WER of the HuBERT pretrained model by 6.26 percentage points in the English ASR task. Notably, even the nonsensical babbles generated by the diffusion model can account for a significant portion of the performance improvement, which indicates the strong capability of diffusion models in capturing coherent information in speech that has been overlooked by SSL methods.

Live content is unavailable. Log in and register to view live content