Re: [orca-list] Punctuation, capital letters, exchange of characters and strings, generally error in the design of Orca



Thanks Milan.

So...I have a dilemma. The imperfect gnome-speech-based solution in Orca exists and generally works. The emergence of PulseAudio is also helping to address one of the major issues (audio device contention). While not perfect, the current solution provides emulation for missing TTS features in a very expedient and controllable means to give users what they want *today*. We can also quickly make adjustments to gnome-speech to provide support for features enabled by the speech engine (e.g., verbalized punctuation, capitalization, etc.) and we can quickly adjust Orca to pass things on to the speech engine rather than emulate them at the Orca layer. In addition, all of this is encapsulated in GNOME, making it easy to manage from the release and packaging standpoints.

What I'm getting from Brailcom is a proposed solution that, when implemented, seems like it could address a number of problems. It will eliminate the need for Orca to do emulation of missing features. It will provide features that are on the Orca requirements list, but which are not currently implemented (e.g., verbalized capitalization, audio icons, etc.). It will also act as a system service that many apps can use, which will run on a large number of platforms, and which does not require a desktop to be running.

That's great. As a result of this promise, I permitted the speech dispatcher code into Orca as a means to provide a proving ground. It is still interesting to me, but it does not come without issue: it is incomplete, it is not an accepted dependency for GNOME, dogmatic pursuit of purism, etc.

What I didn't expect was inflexible opposition from Brailcom to the practical solutions provided by Orca, such as the notion of the user specifying pronunciation definitions at a higher level. Until the unsophisticated user has a convenient mechanism for doing things such as tweaking pronunciations, Orca is going to provide a means to do this. Until verbalized punctuation is guaranteed to be supported by the lower layers, Orca is going to provide a means to emulate this. As the Orca project lead, this is my decision, and it is based upon user requirements. I hear Brailcom loud and clear - you don't like this. Please, let's agree to disagree and let's focus on SpeechDispatcher.

Until it is complete, stable, and we're sure it helps us meet the user requirements, I cannot make SpeechDispatcher a supported part of Orca. We have at least gotten to the point where we've identified the API that will be exposed to Orca, which is the speechd Python bindings. With the exception of some things, it sees like a viable API, though I need to dig into it a little deeper.

Assuming the API is workable as is, do you have an estimate for the amount of work (cost and timeframe) needed to complete the implementation and provide complete support for at least eSpeak, Festival, Cepstral, DECtalk, and IBMTTS? What is your support model and release schedule going to be once the implementation is done? What is your community model going to be (e.g., can others outside Brailcom contribute patches/enhancements to SpeechDispatcher)?

Will

Milan Zamazal wrote:
"WW" == Willie Walker <William Walker Sun COM> writes:

    WW> One of the questions I have right now is the ability for a
    WW> client to programmatically configure various things in
    WW> SpeechDispatcher, such as pronunciations for words.  In looking
    WW> at the existing API, I'm not sure I see a way to do this.  Nor
    WW> am I sure if this is something that a speech dispatcher user
    WW> needs to do on an engine-by-engine basis or if there is a
    WW> pronunciation dictionary that speech dispatcher provides for all
    WW> output modules to use.

SSIP supports SSML, so in theory it is possible to pass pronunciation
etc. using its means.  In practice SSML is probably only little
supported, if at all, in most TTS systems so it wouldn't work.  But it's
not a fault of Speech Dispatcher, it's just missing feature of something
-- preferably it should be present in TTS systems, or at least in some
frontend to them.  I think new speech dispatcher TTS driver library
should provide means for parsing SSML and the drivers should handle it
some way if the corresponding TTS system can't.

Just one remark to pronunciation as a typical representative of some
problems: It is important to distinguish between special pronunciation
and regular pronunciation.  In the first case, e.g. when some word
should be pronounced in a non-regular way for some reason, it's
completely valid to pass pronunciation information from the client to
the engine.  But in the latter case, e.g. when some engine mispronounces
some words, the client should no way attempt to "fix" it, this would
only make the situation worse.  The proper solution is to fix
pronunciation in the engine.  TTS drivers may attempt to work around it
when fixing the engine is not possible, but in this particular case it
should be considered as an extreme approach, to be applied only when
really nothing else works.  As for common pronunciation dictionaries I
doubt it can be done on a common level because different synthesizers
use different phoneme sets and their representation.

On the other hand it seems reasonable to handle some other features such
as signalling capitalization, punctuation, sound icons, etc. on a common
basis in the TTS drivers.  But beware, this may require language
dependent text analysis and may interfere with TTS processing of some
synthesizers.  So it shouldn't be applied universally and each of the
TTS drivers must have free choice how to handle such things -- whether
to let it on the synthesizer or whether (when the synthesizer is unable
to handle the requirements) to use TTS driver means.  When one thinks
about it more it becomes clear that it would be very useful to have just
a single common text analysis frontend to free speech synthesizers and
to make different synthesizers start their own work only after the
phonetic transcription of the input is available.  But this is another
issue.

[...]

    WW> I wasn't sure how to interpret "No", but my interpretation was
    WW> that emulation was NOT done, and this seems to match my
    WW> interpretation of "Right" above.  But, maybe "No" meant
    WW> something like "No, speech dispatcher itself doesn't do
    WW> emulation, but that can be done at a lower layer in the speech
    WW> dispatcher internals."  If that's the case, from the client's
    WW> point of view, it's still speech dispatcher, and the client can
    WW> now depend upon speech dispatcher to do the emulation.

Yes, I think there is some terminology confusion here.  The new Speech
Dispatcher contains TTS API and drivers as its part, while the current
implementation is focused basically just on message dispatching.  I'd
suggest to name the parts explicitly in discussion (dispatching,
interface, output modules, TTS API, TTS drivers, configuration) to avoid
confusion.

In my opinion it's basically as you write above.  Clients nor any of the
Speech Dispatcher parts with the exception of TTS drivers should care
about their emulation of missing TTS features.  They should perform
their own jobs and should rely on TTS systems and their TTS drivers to
ensure proper speech output.  Presence of common TTS API should
guarantee the emulation work will be done only once in a single place
behind the TTS API, i.e. in speech synthesizers (preferably) or in the
TTS drivers (when the TTS system way is not possible).  Possible
creation of common TTS processing frontend to speech synthesizers
mentioned above comes to play here, but considering current state of
things it would be premature to get distracted by this idea too much.

    WW> Let me try to rephrase this question: from Orca's point of view,
    WW> if text is handed off to speech dispatcher via speechd, will we
                                                       ^^^^^^^
                                                       SSIP?
    WW> be guaranteed that the appropriate emulation will be provided
    WW> for features that are not supported by a speech engine?  For
    WW> example, if an audio cue is desired for capital letters, will
    WW> the Orca user be guaranteed that something in Speech Dispatcher
    WW> will play an audio icon for capitalization if the engine doesn't
    WW> support this directly?  Or, if verbalized punctuation is not
    WW> supported by the engine, will the Orca user be guaranteed that
    WW> something in Speech Dispatcher will emulate the support if the
    WW> engine does not support this directly?

My simple answer is Yes (the detailed answer is above).

I'm not sure I'd agree with everyone here on particular details, but I
hope the basic ideas and explanations outlined above might be acceptable
to all members of the Speech Dispatcher team, as well as to Orca and
other client development teams.

Thanks for your questions helping to clarify things!

Regards,

Milan Zamazal
_______________________________________________
Orca-list mailing list
Orca-list gnome org
http://mail.gnome.org/mailman/listinfo/orca-list
Visit http://live.gnome.org/Orca for more information on Orca




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]