Re: UTF-8
- From: Sven Neumann <sven gimp org>
- To: damien donlon sun com
- Cc: GNOME i18n list <gnome-i18n gnome org>
- Subject: Re: UTF-8
- Date: 10 Jul 2002 19:03:50 +0200
Hi,
Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer <damien.donlon@sun.com> writes:
> [2] Create a tool that can check whether a file is UTF-8 encoded.
> The tool should not be dependent on simply reading a charset field
> within the file to see whether it says UTF-8 but by analysing the
> byte stream. Does such a tool exist already within the community?
>
> I think it may be impossible to distinguish between UTF-8 and 8859-1
> if no character is outside the 0-127 range. Can anyone confirm? Is
> this a big problem in identifying UTF-8 encoded files?
this is correct. The 7bit ASCII encoding which is used in the 0-127
range of the ISO-8859-1 encoding (and others?) is a subset of
UTF-8. But I don't see any problem here since an ISO-8859-1 encoded
file that uses nothing but the characters from the 0-127 range is at
the same time a valid UTF-8 encoded file.
The standard 'file' utility seems to do a decent job at detecting
UTF-8 encoded file. It fails to distinguish some other encodings
correctly but some quick tests I did showed no false positive or
negative for UTF-8.
Salut, Sven
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]