Re: [orca-list] Should Orca have a separate "reading" voice? (was: audio demo of Neospeech and Svox-Pico with Orca/speakup)



Joanmarie Diggs <joanmarie diggs gmail com> wrote:
 
Willem, thanks for doing this! Ignoring the issues you found, the
Neospeech voice does sound awfully nice.

They're probably using concatenation synthesis, i.e., techniques which combine
pre-recorded speech segments such as diphones, triphones, etc., then "smooth"
the result. This approach requires large databases, which is why such
synthesizers tend to consume substantial memory, disk space, or both.

In contrast, SVOX Pico uses a relatively new technology in speech synthesis,
based on hidden Markov models, where the voice is entirely synthetic rather
than pre-recorded, but the synthesizer parameters are obtained from a
statistical model trained on real speech data. Machine learning techniques are
used in the textual analysis phase as well. As a result, it is very small in
regard to memory use, hence suitable for embedded devices such as mobile
phones, as well as installation media and almost any other context in which
one would want a small, efficient synthesizer with quality output. There are
limitations, of course, and it has its share of problems, but technologically
it's the result of very serious research by specialists in signal processing
and computational linguistics, and a significant contribution to free and
open-source software.

It would be nice if someone would fix the segfault under x86-64 that manifests
itself in the internal memory allocator, though.

Disclaimer: I am absolutely not qualified to discuss speech synthesis in
depth, an undergraduate course in phonetics notwithstanding.

This raises a question in my mind: If we're potentially going to have
access to speech synthesizers which are more human-sounding but perhaps
less performant, should Orca have a separate reading voice or SayAll
voice or some such thing? In other words, when you're typing, navigating
in menus and dialogs, etc., Orca would use one voice. When you're
reading text (and/or doing a SayAll), Orca would use another voice.

There may be synthesizers that require this. SVOX Pico shouldn't be one of
them, since it is designed to remain highly responsive in embedded
environments where CPU and memory resources are highly constrained.

Someone ought to write an OpenTTS module for it.

I don't know what the performance characteristics of concatenation
synthesizers are; they certainly take up more memory than ESpeak, SVOX Pico
and other small synthesizers. Bell Labs' TTS contains hundreds of megabytes of
speech data just for English, for example, which is one of the reasons for its
high quality speech - there are many pre-recorded segments for the selection
algorithm to choose from.

Aside from performance, users may however want different synthesizers under
different circumstances, so I'm not suggesting this is a bad idea at all.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]