d563: DeepSpeech

DeepSpeech – Project DeepSpeech is an open source Speech-To-Text engine. It uses a model trained by machine learning techniques, based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow project to make the implementation easier.


The Machine Learning Group at Mozilla is tackling speech recognition and voice synthesis as its first project. Speech is powerful. It brings a human dimension to our smartphones, computers and devices like Amazon Echo, Google Home and Apple HomePod. Speech interfaces enable hands-free operation and can assist users who are visually or physically impaired.

Mozillas DeepSpeech github link: https://github.com/mozilla/DееpSpееch


DeepSpeech documentation: http://dееpspееch.readthedocs.io/en/latest/

DeepSpeech based on state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a “phoneme.” Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5’00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems. [PDF]