Re: Faster UTF-8 decoding in GLib



Final note: Please file separate bugs for any individual optimization you
think is worth performing (or is an obvious improvement).

Thanks,
behdad

On 03/26/2010 01:25 PM, Behdad Esfahbod wrote:
> Sorry for replying so late.  I saw a few replies implying that the developer
> time to implement a (to me, unmeasurably) useful feature has been spent
> already so I should go ahead and commit it.  There are various flaws with that
> argument:
> 
>   - It ignores the fact that writing a patch is a small part of the time spent
> on a change.  Ignoring the maintainer review time as well as future
> maintenance.  If you think I should commit without spending significant time
> on it, well, there's a reason you're not the maintainer :P.  In short, it's
> the maintainer that is taking the risk, not you or the patch author.  Guess
> why I'm replying this late?  Because reading 18 messages and 20 patches takes
> time.  Time I could spend on fixing a bug that has a measurable impact at least.
> 
>   - It also assumes that the patch is ready, and useful.  The original patch
> series had various flaws.  A few I list:
> 
>     * Introduce 256 new relocations!
> 
>     * Inlined a public function, but just to make an indirect function call
> instead.  What's the point of inlining then?!
> 
>     * Had unknown impacts on systems with higher function call overhead.
> 
>     * Was not tested in real-life situations.  Perf tests are not realistic.
> Calling g_utf8_next_char a million times in a loop is nothing like real-life.
>  In real life strings that are processed are really short.  Memory cache
> effects make any micro-optimization you make look like noise.
> 
>     * Changed the semantics of the glib UTF-8 functions.  Dealing with UTF-8
> coming from outside world is very sensitive matter security-wise.  There's
> backward compatibility also.  Can't just decide to return a different value
> from now on.
> 
>     * The construct borrowed from glibmm, as beautiful as it is, is WRONG for
> 6-byte-long UTF-8.  It just doesn't work.  We historically support those
> sequences.
> 
> 
> That said.  I'm not being unfair to anyone here.  I personally am a utf-8
> microoptimizing geek myself.  See for example this blogpost:
> 
>   http://mces.blogspot.com/2008/04/utf-8-bit-manipulation.html
> 
> So I'm not even willing to commit my own optimization to that code without
> seeing real-world numbers first.
> 
> Another idea, now that people are measuring: What about this:
> 
> static const int utf8_mask_data[7] = {
>   0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
> };
> 
> #define UTF8_COMPUTE(Char, Mask, Len) \
>   G_STMT_BEGIN { \
>     Len = utf8_skip_data[(guchar)(Char)]; \
>     Mask = utf8_mask_data[Len]; \
>     if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
>       Len = -1; \
>   } G_STMT_END
> 
> 
> 
> behdad


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]