Re: Fix for "Need encoding conversion for GIOChannel"
- From: Owen Taylor <otaylor redhat com>
- To: HideToshi Tajima <tajima happy eng sun com>
- Cc: gtk-devel-list gnome org
- Subject: Re: Fix for "Need encoding conversion for GIOChannel"
- Date: 26 Apr 2001 19:34:01 -0400
HideToshi Tajima <tajima happy eng sun com> writes:
> I've posted a patch for one of the remaining 2.0 API bugs:
> http://bugzilla.gnome.org/show_bug.cgi?id=52811, whose short summary
> is "Need encoding conversion for GIOChannel".
>
> The patch should give GIOChannel class the features listed below:
>
> * charset code conversion at read/write, read/write by chars
> * buffering
> * blocking vs non-blocking (controled by *set_flags/get_flags)
> * Reading a file a line at a time
> * Reading a whole file at once
> * error handling with more error feedbacks
>
> A simple testcase is attached to the above bugzilla report. I've run
> the test with this testcase on two Unix platforms - redhat linux6.2 Intel
> and Solaris 8 sparc, and the patch seems to work well on both. I have to
> say that win32 backend code is untested(more precisely unimplemented) - so
> I wish to get someone who knows win32 have a look, test, and complete the
> code.
Reading over your this patch, it's clear to me that this is a much
harder problem than I realized when I first wrote down the API.
There are various issues about the exact semantics of the functions
that seem minor at first glance, but have fairly large consequences
for the internal implementation.
So, most of this mail is just listing some of places where your
code did different things than I expected, or I wasn't sure
what I expected.
I think once we agree on exactly what the behavior should be for
these cases, making the code match that shouldn't be hard, even
it in a few cases it may be a little more complicated than I
expected initially.
So, my apologies if the following has more questions in it than
answers.
Regards,
Owen
* One issue that came to mind reading your code and thinking about
the problems is the handling of partial characters:
- If you call read() with a buffer of length 100, the first 60
characters take 99 bytes, and the 61st 2 bytes, does it put
half the last character result buffer?
- If you call write() with a buffer of length 100, with 66
complete characters taking 99 bytes and 1 incomplete character,
does it return a length written of 99, or a length written
of 100 and store the partial character internally?
If the answer is that conversion should take into account partial
characters, then the question is, what should happen when the
encoding for the stream is UTF-8? What should happen when it is
NULL?
* I think this code probably should be using iconv() directly
instead of going through g_convert(). The advantage of using
iconv() is that we can convert the number of characters that
fit into a fixed size output buffer, and we don't have to malloc
intermediate buffers; also using one iconv_t converter for
the entire stream is a potentially a fairly big efficiency win.
A final reason for this is that if you use iconv() directly,
then an encoding like UTF-16 will handle the BOM correctly,
since it will only reset the converter once for the stream,
not for every line.
This obviously does pose a problem for doing 'fallbacks'
in the style of g_convert_with_fallback().
* Changing the line terminator depending on the platform is a little
dubious - files don't nicely stick to their on platforms.
I think what I'd do is have a enum:
GIOLineTerminator
{
G_IO_LINE_AUTO,
G_IO_LINE_DOS, /* \r\n */
G_IO_LINE_UNIX, /* \n */
G_IO_LINE_MACINTOSH /* \r */
};
And have the default be AUTO - which means, separating on
any of \r, \n, or \r\n. Though this requires one character
read-ahead, hmmm, which means that using \r termination
with AUTO is unsuitable for network protocol type applications
where responses to complete lines are needed.
Related to this is the fact that we can't assume the input
encoding will have literal \n, \r characters in it; I'd
certainly like to be able to handle UTF-16 as a file encoding.
This means that conversion needs to be done _before_ looking
for terminators, not after.
* I'm not sure that converting the virtual functions in GIOFunc
to take GError * arguments is the right approach. The problem
is that GError * should only be used for "real" errors - it
is fairly expensive to duplicate a string, allocate memory
for the GError and so forth.
And read() and write() have two non-error conditions that are
signalled by an "errno" - EAGAIN and EINTR. With the
"io_channel_read_chars()" type signature, there is a clear
mapping for EAGAIN that doesn't involve setting the GError * -
return 0 bytes read with err == NULL.
But there is no corresponding mapping for EINTR. One approach
to handing EINTR would be to make g_io_channel_read()/write()
simply always restart on EINTR; that does affect people
trying to do, say, read/write timeouts with SIGALRM. But
I don't know if anybody really does that with GIOChannel.
* When a converter is involved, no buffering may not sense.
This is certainly true when reading. If a partial character
is read from the stream, I can't convert it and return
it to the user, so I have to store it somewhere.
On the read side, in fact, buffering is basically invisible
to the caller and is an implementation detail.
On the write side, we have the option of returning a short
write when partial characters are involved. (See the above
question.) Though this may not be particularly convenient
for the caller.
On the write side buffering is not invisible to the caller;
the timing of buffer flushing has important consequences.
Callers will have somewhat different expectations for what
happens on a non-buffered stream when they call write()
depending on whether the stream is blocking or non-blocking.
The naive expecation is:
For a blocking stream, all complete characters will be written
to the network out before the write() operation returns.
For a non-blocking stream, a short write will occur, with
the amount written being the amount of data that fits
into the network buffers.
But the second expectation isn't really implementable with
encoding conversion, since we have to convert the data
before we try writing it.
So, I think, instead the correct behavior for a "non-buffered"
non-blocking stream is that as much data as possible will
be written to the network before returning, but with the
potential for having more written to the internal GIOChannel buffer.
After all, the caller can't really tell if the data is
being buffered in the GIOChannel or the system's network
layer.
With these ideas in mind, I'm not sure that my original idea
of indicating a non-buffered stream with a buffer size of
0 is correct. Instead, I think the correct interface is
probably:
/* A hint to as how much buffering is useful. GIOChannel
* may not use exactly this size.
*/
g_io_channel_set_buffer_size (GIOChannel *channel,
gint size);
g_io_channel_set_auto_flush (GIOChannel *channel,
gboolean auto_flush);
Where auto_flush is the non-buffered behavior above. When
an conversion is in effect, GLib would make the buffer at
least big enough to store partial characters.
* When I proposed g_io_channel_read_line_string(), I actually
meant GString *buffer, not GString **buffer.
The idea is, by using a single GString as the buffer while
reading a series of lines, constant reallocation of the line
buffer is avoided.
g_io_channel_read_line() can then be implemented as:
GString *buffer = g_string_new (NULL);
g_io_channel_read_line_string (channel, buffer);
if (length)
*length = buffer->len;
if (string)
*string = g_string_free (string, FALSE);
else
g_string_free (string, TRUE);
* g_io_channel_read_line() should probably handle lines longer than
the buffer size. For blocking streams, this is not a problem;
you just keep reading, converting and writing to the output
buffer.
For non-blocking streams, it is a little more tricky, since
you might want to write something like:
gboolean
my_watch (GIOChannel *channel, GIOCondition cond, gpointer data)
{
char *string, int len;
GError *err = NULL;
while (g_io_channel_read_line (channel, &string, &len, &err))
{
printf ("Received line: %s\n", string);
g_free (string);
}
if (err)
[...]
}
But this involves keeping an arbitrary-sized input buffer.
And if you can mix calls to g_io_channel_read_line()
and g_io_channel_read_chars() [ though I can't really see why
you would want to do this ], this arbitrary-sized input buffer
needs to be accessible by all the read_* functions.
* GIOChannel is an interface for bidirectional channels, so it
is probably necessary to keep separate read and write buffers;
by only allocating the buffers on demand, the overhead of
this for GIOChannels being used unidirectionally should
be minimal.
* A few formatting issues
- GLib/GTK+ style is 'function (' not 'function('; 'if (', not 'if'
- GLib/GTK+ has the '{' on the next line GNU style, indented one
level.
if (tmperr)
{
g_propagate_error (err, tmperr);
return 0;
}
* Inline doc comments will be needed before the patch goes
into GTK+.
* There needs to be a call like g_io_channel_pending() to check
if there are any characters buffered, and the IO channel
sources need to check this in their pending() check() methods
before checking the underlying stream.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]