Re: Concerning Keyboard Status Menu



On 24/11/12 06:40 PM, Debarshi Ray wrote:
The thing for Chinese input method is: Few of them are doing a good job.
Styling of Chinese, dialect, modern Chinese cultures idioms *varies*.
Even the big commercial input method failed to achieve a good job on
every aspect mentioned above. That's why you saw several of commercial
input method installed even on a single user desktop. This is why input
method tend to be inconsistent.

The default pinyin input GNOME whitelisted is ibus-pinyin. It's a very
basic input engine that doing a relatively poor job on almost every
aspect I mentioned above. And I'm not being offensive to those
developers, Sunpinyin is no better than that.

Develop a Chinese IME is *extremely* hard and it has commercial
barriers. Big search engine companies have much complete training
dataset than any opensource organization, commercial dictionaries from
Chinese internet media companies are covering every aspect of Chinese
culture: ancient poetry, modern word, idiom...Companies like Microsoft
and Google have a much more sophisticated Machine Learning Research
Group than any opensource organization...

The question is, if it is so hard to develop a Chinese IME, then why not
join together to improve it instead of having lots of half-finished ones?
If we are so low on resources then we should try to avoid fragmentation,
shouldn't we?

Good question! As a 20-year Chinese native speaker, I would say that's impossible. This has never happen in the commercial input method world, and this is never going to happen in the opensource world either.

The situation of Chinese Language as well as input method is extremely complex. Workload of a complete universal input engine incredibly huge!

First. No one really know how to speak "Chinese". There are too many dialect. For instance my girlfriend is from Zhejiang and there will be a new dialect every 10km. Yes, these are new dialect, people speak different dialect *could not* understand each other. Some of these dialects have characters, say Cantonese, some of even cannot be fully expressed by Han character. (So that's why the Han character standard has been extended several time.)

Second, ways of inputing Chinese is so different. Pinyin is one, it basically encode the way Chinese are read. Besides Pinyin, there are at least I (who always failed my Chinese exam) know Wubi, Shuangpin, Erbi, Zhengma. All of them are complex enough to implemented a individual engine.

Third, just pick Mandarin Pinyin as an example, because Han character are not letter based, the problem of input method is basically the same as Speech Recognition. Several sub-problems of this topic are highly open. For instance, natural language segmentation, dictionary mining, context inference... These problems are so open that no engine developer is sure that this way is the best way. In fact, we all encourage each other to try new approach, because the current UX of opensource input method is still way behind a commercial one that we use on Windows.

Fourth, patent issue. As I mentioned in the first email, patent are discouraging open source input methods using commercial dictionaries. Because these dictionaries are either collected manually, or using sophisticated Machine Learning techniques mining on massive dataset that we don't have.



As a result, there is no "universal" input engine for Chinese. But each of the engines have its uniqueness. Take Mandarin Pinyin as an example:

* ibus-pinyin tend to be simple to hack, but provide poor UX since it does not consider language context. It's under GPL license.

* sunpinyin is more sophisticated, it uses 3-grams to overcome the Non-Markov property of Chinese. But still the dictionaries and the datasets are a problem. And the LGPL license and its history that originated from Sun Microsystem scared a lot of package maintainer away.

* libpinyin is considered the successor of sunpinyin, but under heavy development. It's still considered as unstable now.

* rime sounds different, they seems to target at people who really appreciate the beauty of ancient Chinese. (Correct me if I'm wrong of course)

As I said, each approach is a complete approach. They're *not fragmented*. We're not sure which one is the good idea, we're still trying to see which one is better. It feels pretty much like research, we all know every current approach sucks, and we're exploring different ways to make it better. If you focus on one of them, we lose the whole opportunities to make it better.


Cheers,
Debarshi


--

Thanks
Mike


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]