Re: Faster UTF-8 decoding in GLib
- From: Behdad Esfahbod <behdad esfahbod gmail com>
- To: Daniel Elstner <daniel kitta googlemail com>
- Cc: gtk-devel-list gnome org
- Subject: Re: Faster UTF-8 decoding in GLib
- Date: Fri, 26 Mar 2010 13:26:45 -0400
Final note: Please file separate bugs for any individual optimization you
think is worth performing (or is an obvious improvement).
Thanks,
behdad
On 03/26/2010 01:25 PM, Behdad Esfahbod wrote:
> Sorry for replying so late. I saw a few replies implying that the developer
> time to implement a (to me, unmeasurably) useful feature has been spent
> already so I should go ahead and commit it. There are various flaws with that
> argument:
>
> - It ignores the fact that writing a patch is a small part of the time spent
> on a change. Ignoring the maintainer review time as well as future
> maintenance. If you think I should commit without spending significant time
> on it, well, there's a reason you're not the maintainer :P. In short, it's
> the maintainer that is taking the risk, not you or the patch author. Guess
> why I'm replying this late? Because reading 18 messages and 20 patches takes
> time. Time I could spend on fixing a bug that has a measurable impact at least.
>
> - It also assumes that the patch is ready, and useful. The original patch
> series had various flaws. A few I list:
>
> * Introduce 256 new relocations!
>
> * Inlined a public function, but just to make an indirect function call
> instead. What's the point of inlining then?!
>
> * Had unknown impacts on systems with higher function call overhead.
>
> * Was not tested in real-life situations. Perf tests are not realistic.
> Calling g_utf8_next_char a million times in a loop is nothing like real-life.
> In real life strings that are processed are really short. Memory cache
> effects make any micro-optimization you make look like noise.
>
> * Changed the semantics of the glib UTF-8 functions. Dealing with UTF-8
> coming from outside world is very sensitive matter security-wise. There's
> backward compatibility also. Can't just decide to return a different value
> from now on.
>
> * The construct borrowed from glibmm, as beautiful as it is, is WRONG for
> 6-byte-long UTF-8. It just doesn't work. We historically support those
> sequences.
>
>
> That said. I'm not being unfair to anyone here. I personally am a utf-8
> microoptimizing geek myself. See for example this blogpost:
>
> http://mces.blogspot.com/2008/04/utf-8-bit-manipulation.html
>
> So I'm not even willing to commit my own optimization to that code without
> seeing real-world numbers first.
>
> Another idea, now that people are measuring: What about this:
>
> static const int utf8_mask_data[7] = {
> 0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
> };
>
> #define UTF8_COMPUTE(Char, Mask, Len) \
> G_STMT_BEGIN { \
> Len = utf8_skip_data[(guchar)(Char)]; \
> Mask = utf8_mask_data[Len]; \
> if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
> Len = -1; \
> } G_STMT_END
>
>
>
> behdad
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]