A New Model for More Natural Synthesized Speech

Researchers from the University of Bremen and SUPSI have introduced Diff-ETS, an innovative electromyography-to-speech (ETS) conversion model aimed at producing more natural synthesized speech. This advancement could significantly benefit individuals unable to speak, such as those recovering from laryngectomy surgeries.

Traditional ETS models typically consist of two main components: an EMG encoder that translates electrical signals from muscles into speech features, and a vocoder that converts these features into audible speech. However, the synthesized speech often lacks naturalness due to limited data and noise in signals.

The Diff-ETS model introduces a third component: a score-based diffusion probabilistic model. This new addition enhances the acoustic features predicted by the EMG encoder, leading to improved speech quality.

In their approach, the researchers trained the EMG encoder to predict log Mel spectrograms (visual representations of audio signals) and phoneme targets from EMG signals. The diffusion model was then trained to refine these spectrograms, while a pre-trained vocoder synthesized the final speech output.

In experiments, the Diff-ETS model outperformed traditional ETS techniques, resulting in more human-like speech. The researchers conducted objective assessments and listening tests, confirming that Diff-ETS significantly improved speech naturalness.

Looking ahead, the team envisions further enhancements, including model compression and real-time speech generation capabilities. They also aim to train the diffusion model alongside the encoder and vocoder to further elevate speech quality.

This breakthrough could pave the way for more effective communication technologies for those with speech impairments, enabling them to express their thoughts and engage with others more easily.

Previous Post Next Post