Re: Fix for "Need encoding conversion for GIOChannel"



HideToshi Tajima <tajima happy eng sun com> writes:

> I've posted a patch for one of the remaining 2.0 API bugs:
> http://bugzilla.gnome.org/show_bug.cgi?id=52811, whose short summary
> is "Need encoding conversion for GIOChannel".
> 
> The patch should give GIOChannel class the features listed below:
> 
>        * charset code conversion at read/write, read/write by chars
>        * buffering 
>        * blocking vs non-blocking (controled by *set_flags/get_flags)
>        * Reading a file a line at a time
>        * Reading a whole file at once
>        * error handling with more error feedbacks	
> 
> A simple testcase is attached to the above bugzilla report. I've run
> the test with this testcase on two Unix platforms - redhat linux6.2 Intel
> and Solaris 8 sparc, and the patch seems to work well on both. I have to
> say that win32 backend code is untested(more precisely unimplemented) - so
> I wish to get someone who knows win32 have a look, test, and complete the
> code.

Reading over your this patch, it's clear to me that this is a much
harder problem than I realized when I first wrote down the API.

There are various issues about the exact semantics of the functions
that seem minor at first glance, but have fairly large consequences
for the internal implementation.

So, most of this mail is just listing some of places where your
code did different things than I expected, or I wasn't sure
what I expected.

I think once we agree on exactly what the behavior should be for
these cases, making the code match that shouldn't be hard, even
it in a few cases it may be a little more complicated than I 
expected initially.

So, my apologies if the following has more questions in it than
answers.

Regards,
                                        Owen

 * One issue that came to mind reading your code and thinking about
   the problems is the handling of partial characters:

    - If you call read() with a buffer of length 100, the first 60
      characters take 99 bytes, and the 61st 2 bytes, does it put
      half the last character result buffer?

    - If you call write() with a buffer of length 100, with 66
      complete characters taking 99 bytes and 1 incomplete character, 
      does it return a length written of 99, or a length written
      of 100 and store the partial character internally?

   If the answer is that conversion should take into account partial
   characters, then the question is, what should happen when the
   encoding for the stream is UTF-8? What should happen when it is
   NULL?

 * I think this code probably should be using iconv() directly
   instead of going through g_convert(). The advantage of using
   iconv() is that we can convert the number of characters that
   fit into a fixed size output buffer, and we don't have to malloc
   intermediate buffers; also using one iconv_t converter for
   the entire stream is a potentially a fairly big efficiency win.
  
   A final reason for this is that if you use iconv() directly,
   then an encoding like UTF-16 will handle the BOM correctly,
   since it will only reset the converter once for the stream,
   not for every line.

   This obviously does pose a problem for doing 'fallbacks'
   in the style of g_convert_with_fallback().

 * Changing the line terminator depending on the platform is a little
   dubious - files don't nicely stick to their on platforms.

   I think what I'd do is have a enum:

    GIOLineTerminator 
    {
      G_IO_LINE_AUTO, 
      G_IO_LINE_DOS,       /* \r\n */
      G_IO_LINE_UNIX,      /* \n */
      G_IO_LINE_MACINTOSH  /* \r */
    };

   And have the default be AUTO - which means, separating on
   any of \r, \n, or \r\n. Though this requires one character
   read-ahead, hmmm, which means that using \r termination
   with AUTO is unsuitable for network protocol type applications
   where responses to complete lines are needed.

   Related to this is the fact that we can't assume the input
   encoding will have literal \n, \r characters in it; I'd 
   certainly like to be able to handle UTF-16 as a file encoding.
   This means that conversion needs to be done _before_ looking
   for terminators, not after.
   
 * I'm not sure that converting the virtual functions in GIOFunc
   to take GError * arguments is the right approach. The problem
   is that GError * should only be used for "real" errors - it
   is fairly expensive to duplicate a string, allocate memory
   for the GError and so forth.
   
   And read() and write() have two non-error conditions that are
   signalled by an "errno" - EAGAIN and EINTR. With the
   "io_channel_read_chars()" type signature, there is a clear
   mapping for EAGAIN that doesn't involve setting the GError * - 
   return 0 bytes read with err == NULL.

   But there is no corresponding mapping for EINTR. One approach
   to handing EINTR would be to make g_io_channel_read()/write() 
   simply always restart on EINTR; that does affect people
   trying to do, say, read/write timeouts with SIGALRM. But
   I don't know if anybody really does that with GIOChannel.   

 * When a converter is involved, no buffering may not sense. 
   This is certainly true when reading. If a partial character
   is read from the stream, I can't convert it and return
   it to the user, so I have to store it somewhere.

   On the read side, in fact, buffering is basically invisible
   to the caller and is an implementation detail.

   On the write side, we have the option of returning a short
   write when partial characters are involved. (See the above
   question.) Though this may not be particularly convenient
   for the caller.

   On the write side buffering is not invisible to the caller;
   the timing of buffer flushing has important consequences.

   Callers will have somewhat different expectations for what
   happens on a non-buffered stream when they call write()
   depending on whether the stream is blocking or non-blocking.
   The naive expecation is:

    For a blocking stream, all complete characters will be written
    to the network out before the write() operation returns.

    For a non-blocking stream, a short write will occur, with
    the amount written being the amount of data that fits
    into the network buffers.

   But the second expectation isn't really implementable with
   encoding conversion, since we have to convert the data
   before we try writing it.

   So, I think, instead the correct behavior for a "non-buffered" 
   non-blocking stream is that as much data as possible will
   be written to the network before returning, but with the
   potential for having more written to the internal GIOChannel buffer.
   After all, the caller can't really tell if the data is
   being buffered in the GIOChannel or the system's network
   layer.

   With these ideas in mind, I'm not sure that my original idea
   of indicating a non-buffered stream with a buffer size of
   0 is correct. Instead, I think the correct interface is 
   probably:

    /* A hint to as how much buffering is useful. GIOChannel
     * may not use exactly this size.
     */
    g_io_channel_set_buffer_size (GIOChannel	*channel,
                                  gint		 size);

    g_io_channel_set_auto_flush  (GIOChannel    *channel,
                                  gboolean       auto_flush);

   Where auto_flush is the non-buffered behavior above. When
   an conversion is in effect, GLib would make the buffer at
   least big enough to store partial characters.

 * When I proposed g_io_channel_read_line_string(), I actually
   meant GString *buffer, not GString **buffer. 

   The idea is, by using a single GString as the buffer while
   reading a series of lines, constant reallocation of the line
   buffer is avoided.

   g_io_channel_read_line() can then be implemented as:

    GString *buffer = g_string_new (NULL);
    g_io_channel_read_line_string (channel, buffer);

    if (length)
      *length = buffer->len;
    if (string)
      *string = g_string_free (string, FALSE);
    else
      g_string_free (string, TRUE);

 * g_io_channel_read_line() should probably handle lines longer than
   the buffer size. For blocking streams, this is not a problem;
   you just keep reading, converting and writing to the output 
   buffer.

   For non-blocking streams, it is a little more tricky, since 
   you might want to write something like:

    gboolean
    my_watch (GIOChannel *channel, GIOCondition cond, gpointer data)
    {
      char *string, int len;
      GError *err = NULL;

      while (g_io_channel_read_line (channel, &string, &len, &err))
        {
          printf ("Received line: %s\n", string);
          g_free (string);
        }

      if (err)
        [...]
    }

   But this involves keeping an arbitrary-sized input buffer.
 
   And if you can mix calls to g_io_channel_read_line()
   and g_io_channel_read_chars() [ though I can't really see why
   you would want to do this ], this arbitrary-sized input buffer
   needs to be accessible by all the read_* functions.

 * GIOChannel is an interface for bidirectional channels, so it 
   is probably necessary to keep separate read and write buffers;
   by only allocating the buffers on demand, the overhead of 
   this for GIOChannels being used unidirectionally should
   be minimal.

 * A few formatting issues 

   - GLib/GTK+ style is 'function (' not 'function('; 'if (', not 'if'

   - GLib/GTK+ has the '{' on the next line GNU style, indented one
     level. 
   
    if (tmperr)
      {
        g_propagate_error (err, tmperr);
        return 0;
      }

 * Inline doc comments will be needed before the patch goes
   into GTK+.

 * There needs to be a call like g_io_channel_pending() to check
   if there are any characters buffered, and the IO channel
   sources need to check this in their pending() check() methods
   before checking the underlying stream.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]