Le jeudi 07 septembre 2006 Ã 20:03 +0100, Jamie McCracken a Ãcrit :
Laurent Aguerreche wrote:Le jeudi 07 septembre 2006 Ã 17:05 +0100, Jamie McCracken a Ãcrit :Jamie McCracken wrote:Laurent Aguerreche wrote:I wonder whether the use of strlen() on UTF-8 is correct, it shouldn't... If I remember correctly, unicode can use arrays filled that way: '\0' 'H' '\0' 'E' '\0' 'L' '\0' L '\0' 'O' ("HELLO") where a '\0' can be replaced by a value to stock characters on 2 bytes. But I don't remember if it happens with UTF-8. I'll have to check what happen with strlen() and funky characters.utf-8 is not unicode. utf-8 if ascii is always 1 byte per character and is indistinguishable from plain text/ascii Non-ascii is always 2-4 bytes per character (mostly 2 bytes though).Also non-ascii bytes cannot contain an ascii character within its multibyte sequence. (multibyte characters in utf-8 always have bytes with most significant bit of 1 whereas ascii is always less than 128 so has msb of 0) for ref: http://en.wikipedia.org/wiki/UTF-8Ok, thank you. So I introduced a bug in tracker-utils.c during my work on UTF8. :-) In is_text_file(), I wrote: if (data_read) { char *s; s = g_locale_to_utf8 (buffer, 65565, NULL, NULL, NULL); I propose this replacement: if (data_read) { char *s; s = g_locale_to_utf8 (buffer, -1, NULL, NULL, NULL);yes thanks - seems I missed that one when reviewing your work. The buffer would need to be at most 4x the size of the input string to be fully utf-8 safe. It might be worth checking if thats the case elsewhere in tracker.
There is another bug now: tracker_db_save_file_contents() is called with directory as file_name... So, of course, fgets() blocks on it. It seems I've found the reason: text_filter_file was sometimes wrongly set to a non-NULL value (because no initialisation happened) in tracker_metadata_get_text_file(). I provide a patch. Laurent.
Attachment:
correct-tracker_metadata_get_text_file+variables.diff
Description: Text Data