Re: possible deadlock on invalid UTF-8 data



Jon Trowbridge <trow ximian com> writes:

> On Tue, 2001-11-27 at 14:54, Havoc Pennington wrote:
> >
> > On the other hand, the advantage of the endless loop (vs. reading
> > invalid memory) is that the bug is immediately evident, and pretty
> > easy to track down.
> 
> Wouldn't it be even more immediately evident and even easier to track
> down if it returned NULL or g_assert-ed or g_error-ed or something.
> 
> 
> It seems pathological for a library to signal an error by deadlocking.

#define g_utf8_next_char(p) (char *)((p) + g_utf8_skip[*(guchar *)(p)])

g_utf8_next_char() turns out to be a very time critical operation;
strings often get iterated over again and again, and checking each
time for valid UTF-8 is a heavy penalty. You really need to check
on input strings and not every time you process strings.

I don't really have a strong preference on the deadlock versus
continue incorrectly issue; note that the g_utf8_skip array is
currently inconsistent on the issue - it has 1 for the 0x80-0xA0 range
which isn't valid for the initial character, but 0 for 0xfe, 0xff.

The tradeoff here is basically:

 - Easy to debug

vs.

 - If encountered, hopefully continue working "well enough"
   to be minimally useful for the user.

If I recall correctly, I originally had it 0 for the 0x80-0xA0 range
as well and changed it to 1 on the theory that while a lockup 
is easier to debug for a developer, they can be _very_ confusing
to a user, worse than a lockup. 

Strings are validated at enough places that the chance of invalid
UTF-8 not getting caught at all is low.

So, on balance I think it's worth making the 0xfe, oxff entries
correspond.

Regards,
                                        Owen



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]