Re: Faster UTF-8 decoding in GLib
- From: Mikhail Zabaluev <mikhail zabaluev gmail com>
- To: Daniel Elstner <daniel kitta googlemail com>
- Cc: gtk-devel-list gnome org
- Subject: Re: Faster UTF-8 decoding in GLib
- Date: Wed, 17 Mar 2010 00:17:08 +0200
Hi,
2010/3/16 Daniel Elstner <daniel kitta googlemail com>:
>> I already made some minor changes to restrict what it produces (like,
>> c & 0x3f is safer than c - 0x80),
>
> No -- this was on purpose! Using addition and subtraction here instead
> of bitwise-and and bitwise-or allows the two operations to be fused into
> one LEA instruction on AMD64 and i386. It is a bit faster if I remember
> correctly.
>
> Note that I fine-tuned the code to produce the optimum assembler output.
> Please verify any changes to my little baby in the disassembler dump. :)
I'm afraid "optimum" assembler output != optimal microcode. Only
measurement will decide.
>> and it should pass the test suite
>> which has a lot of cases for invalid input with some expected output.
>
> The test suite tests undefined behavior?
The API documentation says it's undefined, but the test suite makes it
clear what the actual expectations are. See tests/utf8.txt.
> I think it is perfectly fine
> to basically return anything here, as long as it does not end up in an
> infinite loop or something.
Yes, though we are already in the buffer overflow territory with all
implementations of g_utf8_get_char considered so far.
>> My understanding is that unvalidated decoding should also accept
>> various software's misconstructions of UTF-8 and produce some
>> meaningful output.
>
> Meaningful in what sense? And what kind of misconstructions would that
> be, for example?
Wikipedia describes a couple:
http://en.wikipedia.org/wiki/UTF-8#UTF-8_derivations
I think it's useful to have functions loose enough to interoperate
with these too, as long as one uses the validating routines for any
untrusted input.
--
Mikhail
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]