Re: Faster UTF-8 decoding in GLib



Sorry for replying so late.  I saw a few replies implying that the developer
time to implement a (to me, unmeasurably) useful feature has been spent
already so I should go ahead and commit it.  There are various flaws with that
argument:

  - It ignores the fact that writing a patch is a small part of the time spent
on a change.  Ignoring the maintainer review time as well as future
maintenance.  If you think I should commit without spending significant time
on it, well, there's a reason you're not the maintainer :P.  In short, it's
the maintainer that is taking the risk, not you or the patch author.  Guess
why I'm replying this late?  Because reading 18 messages and 20 patches takes
time.  Time I could spend on fixing a bug that has a measurable impact at least.

  - It also assumes that the patch is ready, and useful.  The original patch
series had various flaws.  A few I list:

    * Introduce 256 new relocations!

    * Inlined a public function, but just to make an indirect function call
instead.  What's the point of inlining then?!

    * Had unknown impacts on systems with higher function call overhead.

    * Was not tested in real-life situations.  Perf tests are not realistic.
Calling g_utf8_next_char a million times in a loop is nothing like real-life.
 In real life strings that are processed are really short.  Memory cache
effects make any micro-optimization you make look like noise.

    * Changed the semantics of the glib UTF-8 functions.  Dealing with UTF-8
coming from outside world is very sensitive matter security-wise.  There's
backward compatibility also.  Can't just decide to return a different value
from now on.

    * The construct borrowed from glibmm, as beautiful as it is, is WRONG for
6-byte-long UTF-8.  It just doesn't work.  We historically support those
sequences.


That said.  I'm not being unfair to anyone here.  I personally am a utf-8
microoptimizing geek myself.  See for example this blogpost:

  http://mces.blogspot.com/2008/04/utf-8-bit-manipulation.html

So I'm not even willing to commit my own optimization to that code without
seeing real-world numbers first.

Another idea, now that people are measuring: What about this:

static const int utf8_mask_data[7] = {
  0, 0x7f, 0x1f, 0x0f, 0x07, 0x03, 0x01
};

#define UTF8_COMPUTE(Char, Mask, Len) \
  G_STMT_BEGIN { \
    Len = utf8_skip_data[(guchar)(Char)]; \
    Mask = utf8_mask_data[Len]; \
    if (G_UNLIKELY ((guchar)(Char) >= 0xfe)) \
      Len = -1; \
  } G_STMT_END



behdad


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]