Re: Faster UTF-8 decoding in GLib
- From: Mikhail Zabaluev <mikhail zabaluev gmail com>
- To: Behdad Esfahbod <behdad behdad org>
- Cc: Daniel Elstner <daniel kitta googlemail com>, gtk-devel-list gnome org
- Subject: Re: Faster UTF-8 decoding in GLib
- Date: Tue, 30 Mar 2010 00:37:38 +0300
Hi,
2010/3/26 Behdad Esfahbod <behdad behdad org>:
> Another idea, now that people are measuring: What about this:
>
> static const int utf8_mask_data[7] = {
> 0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
> };
>
> #define UTF8_COMPUTE(Char, Mask, Len) \
> G_STMT_BEGIN { \
> Len = utf8_skip_data[(guchar)(Char)]; \
> Mask = utf8_mask_data[Len]; \
> if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
> Len = -1; \
> } G_STMT_END
I have tried this, and contrary to my expectations as well, the result
for Core 2 was worse than mainline.
There are now two more changes on this branch:
http://git.collabora.co.uk/?p=user/zabaluev/glib.git;a=shortlog;h=refs/heads/fast-utf8-elstner
The mask variables now have explicit type guint32, rather than
gunichar. I think this makes sure that a summary left shift over 32
bits will result in zero, terminating the loop; if this is not enough,
a mask with 0xFFFFFFFF could be thrown in, which hopefully will be
optimized away on 32 bit targets.
g_utf8_get_char() is back to its previous implementation, in the name
of quirk compatibility. So, now there are three "gears" for UTF-8
iteration:
3. g_utf8_iterate() is the fastest, with almost no validation;
2. g_utf8_get_char() is slower, performs (yet undocumented) checks for
structurally correct UTF-8-ish sequences;
1. g_utf8_get_char_validated() is the slowest, performs thorough UTF-8
validation.
Best regards,
Mikhail
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]