Publication

Synthesizing Audio from Tongue Motion During Speech Using Tagged MRI Via Transformer

Researchers at CMITT are investigating how to use machine learning to predict the sounds of speech from the movements of the tongue. This is important because it could help us to understand how speech is produced and to develop new treatments for speech disorders.

  • Background: Speech is produced by the movement of the tongue and other muscles in the mouth. Scientists are interested in understanding how these movements relate to the sounds we produce.
  • Objective: This study aimed to develop a method for predicting the sounds of speech from the movements of the tongue using tagged MRI.
  • Methods: The researchers used a machine learning technique called “encoder-decoder translation”. This technique allows the model to learn the relationship between two different types of data.
  • Results: The researchers trained their model on a dataset of 63 pairs of motion fields and speech waveforms. They found that the model was able to predict the sounds of speech with a high degree of accuracy.
  • Conclusion: The study’s findings suggest that encoder-decoder translation is a promising method for predicting speech from tongue movements. This method could be used to develop new treatments for speech disorders.

Illustration of the framework for synthesizing audio waveforms from a sequence of motion fields, which consists of a two-stage encoder (with 3D CNN and Longformer), a 2D convolutional decoder (Dec), and a discriminator (Dis).

Link to arXiv paper: https://arxiv.org/abs/2302.07203

Xiaofeng Liu, Fangxu Xing, Jerry L. Prince, Maureen Stone, Georges El Fakhri, Jonghye Woo, “Synthesizing Audio from Tongue Motion During Speech Using Tagged MRI Via Transformer”, arXiv paper: https://arxiv.org/abs/2302.07203
Presented at SPIE Medical Imaging: Deep Dive Oral