Re: Faster UTF-8 decoding in GLib



Hi,

Am Dienstag, den 16.03.2010, 22:52 +0200 schrieb Mikhail Zabaluev:

> I already made some minor changes to restrict what it produces (like,
> c & 0x3f is safer than c - 0x80),

No -- this was on purpose!  Using addition and subtraction here instead
of bitwise-and and bitwise-or allows the two operations to be fused into
one LEA instruction on AMD64 and i386.  It is a bit faster if I remember
correctly.

Note that I fine-tuned the code to produce the optimum assembler output.
Please verify any changes to my little baby in the disassembler dump. :)

>  and it should pass the test suite
> which has a lot of cases for invalid input with some expected output.

The test suite tests undefined behavior?  I think it is perfectly fine
to basically return anything here, as long as it does not end up in an
infinite loop or something.  There are dedicated functions for parsing
possibly invalid UTF-8.

> My understanding is that unvalidated decoding should also accept
> various software's misconstructions of UTF-8 and produce some
> meaningful output.

Meaningful in what sense?  And what kind of misconstructions would that
be, for example?  Actually, the Unicode specification demands that UTF-8
decoders not accept any invalid UTF-8 sequences, including even overlong
sequences which can be decoded just fine (due to security concerns).
This is achieved by validating all input first, or by using one of the
safe extraction functions.

> But the optimizer and
> the CPU make a better job at loops and branches of more traditional
> implementations when they have freedom to use them inline.

There may also be an opportunity for some constant-folding and
elimination of dead branches if the code is inlined.  Also, since the
function was not explicitly declared as inline, it could be that it was
inlined in one case but not the other.

--Daniel




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]