Re: UTF-8

From: Sven Neumann <sven gimp org>
To: damien donlon sun com
Cc: GNOME i18n list <gnome-i18n gnome org>
Subject: Re: UTF-8
Date: 10 Jul 2002 19:03:50 +0200

Hi,

Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer  <damien.donlon@sun.com> writes:

> [2] Create a tool that can check whether a file is UTF-8 encoded.
>     The tool should not be dependent on simply reading a charset field
>     within the file to see whether it says UTF-8 but by analysing the
>     byte stream. Does such a tool exist already within the community?
> 
>     I think it may be impossible to distinguish between UTF-8 and 8859-1
>     if no character is outside the 0-127 range. Can anyone confirm? Is
>     this a big problem in identifying UTF-8 encoded files?

this is correct. The 7bit ASCII encoding which is used in the 0-127
range of the ISO-8859-1 encoding (and others?) is a subset of
UTF-8. But I don't see any problem here since an ISO-8859-1 encoded
file that uses nothing but the characters from the 0-127 range is at
the same time a valid UTF-8 encoded file.

The standard 'file' utility seems to do a decent job at detecting
UTF-8 encoded file. It fails to distinguish some other encodings
correctly but some quick tests I did showed no false positive or
negative for UTF-8.

Salut, Sven

Follow-Ups:
- Re: UTF-8
  - From: Karl Eichwalder

References:
- UTF-8
  - From: Christian Rose
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Carlos Perelló Marín
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Sven Neumann
- Re: UTF-8
  - From: Karl Eichwalder
- Re: UTF-8
  - From: Sven Neumann
- Re: UTF-8
  - From: Damien Donlon - Sun Ireland - Solaris Software - Localisation Engineer

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]