The Montreal-based artificial intelligence startup Lyrebird today unveils its voice imitation algorithm.
Montréal, April 24th 2017 – In a world first, Montreal-based startup Lyrebird today unveiled a voice-imitation algorithm that can mimic a person’s voice and have it read any text with a given emotion, based on the analysis of just a few dozen seconds of audio recording.
With this innovation, Lyrebird is going a step further in the development of AI applications by offering to companies and developers new speech synthesis solutions. Users will be able to generate entire dialogs with the voice of their choice or design from scratch completely new and unique voices tailored for their needs.
On the website lyrebird.ai, samples using the voices of Donald Trump, Barack Obama and Hillary Clinton illustrate the accuracy and effectiveness of the technology. Suited to a wide range of applications, it can be used for personal assistants, for reading of audio books with famous voices, for connected devices of any kind, for speech synthesis for people with disabilities, for animation movies or for video game studios.
Lyrebird relies on deep learning models developed at the MILA lab of the University of Montréal, where its three founders are currently PhD students: Alexandre de Brébisson, Jose Sotelo and Kundan Kumar. The startup is advised by three of the most prolific professors in the field: Pascal Vincent, Aaron Courville and Yoshua Bengio. The latter, director of the MILA and AI pioneer, wants to make Montréal a world-capital of artificial intelligence and this new startup is part of this vision.
Lyrebird will offer an API to copy the voice of anyone. It will need as little as one minute of audio recording of a speaker to compute a unique key defining her/his voice. This key will then allow to generate anything from its corresponding voice. The API will be robust enough to learn from noisy recordings. The following sample illustrates this feature, the samples are not cherry-picked.
Please note that those are artificial voices and they do not convey the opinions of Donald Trump, Barack Obama and Hillary Clinton.
In this paper we propose a novel model for unconditional audio generation based on generating one audio sample at a time. We show that our model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.