Re: IOChannel and dodgy encodings

From: Ron Steinke <rsteinke w-link net>
To: jlbec evilplan org, rsteinke w-link net
Cc: gtk-devel-list gnome org
Subject: Re: IOChannel and dodgy encodings
Date: Wed, 01 Aug 2001 23:26:46 -0700

> From: Joel Becker <jlbec evilplan org>
>
> > More sophisticated error handling is somewhat in flux. I've considered
> > only returning G_CONVERT_ERROR_ILLEGAL_SEQUENCE if the bad character is
> > first in the buffer. This would allow you do a seek with G_SEEK_CUR
> > to avoid the character, as the buffer which interferes with G_SEEK_CUR
> > in some encodings would be empty. The current implementation doesn't
> > do this, however. If you would be interested in such a thing, please
> > forward your reply to gtk-devel-list gnome org with comments.
>
> 	This may be interesting, but I do not know that it would do
> everything.  In my error case, I'm pretty sure that it is in the middle
> of the buffer somewheres.

It doesn't matter where in the file the error you were talking about
occurs. What I was talking about doing was making the error
handling for g_io_channel_read_chars() only return an ILLEGAL_SEQUENCE
error if the illegal character was the next one to be read, instead
of just somewhere in the GIOChannel internal buffer. This would
allow you to seek past it, or convert to another encoding
and read it in.

The complication is that, while you can't normally use G_SEEK_CUR
on an encoded channel, you would be able to in this case (needs to
go into g_io_channel_seek_position() docs if implemented). The
reason for this has to do with the way the internal buffers are
implemented.

<excessive technical detail alert>

For channels with non NULL encoding, there are actually two
read buffers. Data read directly from the file (or whatever) is placed in read_buf,
and after it has been converted to UTF-8 (or validated, if the
encoding is set to UTF-8) it is placed in encoded_read_buf.
Schematically, it looks something like this:

|<- encoded_read_buf ->|<- read_buf ->|<- data in file ...
^                      ^              ^
|                      |              \ real file pointer
|                      \ encoder
\ apparent file pointer

g_io_channel_seek_position() does all seeks relative to the
apparent file pointer. To convert this to calls to the
backend function io_seek(), we need to shift the offset for
G_SEEK_CUR to be relative to the real file pointer.
If we are reading data in an encoding other than
UTF-8, the length of the data in encoded_read_buf is
generally not the same as the length of the corresponding data in the
file.

Generally speaking, all data in read_buf which can be converted and
placed in encoded_read_buf is converted. This means that if an illegal
sequence appears in the input, it is only detected when it is the
first character in read_buf, halting conversion. If we were to return
all characters prior to the illegal sequence before returning
an error, we could guarantee that on error encoded_read_buf was empty,
allowing seeking with G_SEEK_CUR or changing encoding (which
is also forbidden if encoded_read_buf is nonempty) to deal with
bad data.

</excessive technical detail alert>

Basically, the GIOChannel API was only nailed down at the feature freeze a month
ago, and we need to spend some time thinking about the error handling.
In particular, I'm waiting to hear some responses (particularly from Owen)
on my comments on G_IO_STATUS_AGAIN and partial writes.
(see http://mail.gnome.org/archives/gtk-devel-list/2001-July/msg00398.html)

Ron Steinke

Follow-Ups:
- Re: IOChannel and dodgy encodings
  - From: Joel Becker

References:
- Re: IOChannel and dodgy encodings
  - From: Joel Becker

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]