Re: Faster UTF-8 decoding in GLib



Hi,

2010/3/26 Behdad Esfahbod <behdad behdad org>:
> Another idea, now that people are measuring: What about this:
>
> static const int utf8_mask_data[7] = {
>  0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
> };
>
> #define UTF8_COMPUTE(Char, Mask, Len) \
>  G_STMT_BEGIN { \
>    Len = utf8_skip_data[(guchar)(Char)]; \
>    Mask = utf8_mask_data[Len]; \
>    if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
>      Len = -1; \
>  } G_STMT_END

I have tried this, and contrary to my expectations as well, the result
for Core 2 was worse than mainline.

There are now two more changes on this branch:
http://git.collabora.co.uk/?p=user/zabaluev/glib.git;a=shortlog;h=refs/heads/fast-utf8-elstner

The mask variables now have explicit type guint32, rather than
gunichar. I think this makes sure that a summary left shift over 32
bits will result in zero, terminating the loop; if this is not enough,
a mask with 0xFFFFFFFF could be thrown in, which hopefully will be
optimized away on 32 bit targets.

g_utf8_get_char() is back to its previous implementation, in the name
of quirk compatibility. So, now there are three "gears" for UTF-8
iteration:

3. g_utf8_iterate() is the fastest, with almost no validation;
2. g_utf8_get_char() is slower, performs (yet undocumented) checks for
structurally correct UTF-8-ish sequences;
1. g_utf8_get_char_validated() is the slowest, performs thorough UTF-8
validation.

Best regards,
  Mikhail


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]