Re: Faster UTF-8 decoding in GLib



Hi,

2010/3/16 Daniel Elstner <daniel kitta googlemail com>:
>> I already made some minor changes to restrict what it produces (like,
>> c & 0x3f is safer than c - 0x80),
>
> No -- this was on purpose!  Using addition and subtraction here instead
> of bitwise-and and bitwise-or allows the two operations to be fused into
> one LEA instruction on AMD64 and i386.  It is a bit faster if I remember
> correctly.
>
> Note that I fine-tuned the code to produce the optimum assembler output.
> Please verify any changes to my little baby in the disassembler dump. :)

I'm afraid "optimum" assembler output != optimal microcode. Only
measurement will decide.

>>  and it should pass the test suite
>> which has a lot of cases for invalid input with some expected output.
>
> The test suite tests undefined behavior?

The API documentation says it's undefined, but the test suite makes it
clear what the actual expectations are. See tests/utf8.txt.

> I think it is perfectly fine
> to basically return anything here, as long as it does not end up in an
> infinite loop or something.

Yes, though we are already in the buffer overflow territory with all
implementations of g_utf8_get_char considered so far.

>> My understanding is that unvalidated decoding should also accept
>> various software's misconstructions of UTF-8 and produce some
>> meaningful output.
>
> Meaningful in what sense?  And what kind of misconstructions would that
> be, for example?

Wikipedia describes a couple:
http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations

I think it's useful to have functions loose enough to interoperate
with these too, as long as one uses the validating routines for any
untrusted input.

-- 
  Mikhail


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]