API proposal for charset code conversion at I/O



This is a forward of some notes from a discussion with HideToshi
Tajima from Sun about what would be necessary to add a useful
streaming charset conversion API to GIOChannel.

The initial idea was that we could just add
g_io_channel_read_and_convert() type functions similar to
g_io_channel_read(), but that didn't quite work out:

====

 * GIOChannel has no buffering, no idea of "putting back"
   characters, so I'm not sure how a read or write of a partial 
   character would be handled.

   And since there is no idea how much data you are getting,
   g_io_channel_read_locale_to_utf8() could end up getting more 
   data than fits in the output buffer - what do you 
   do with the excess?

   Also, there is no concept of blocking vs. non-blocking
   at the GIOChannel level, so when the first read that
   gio_channel_read_locale_to_utf8() attempts doesn't
   provide count bytes, do you try again or not?
   Either can be wrong.

 * For simple programs, reading some number of bytes at a time
   into a buffer is a fairly uncommon time operation. Common
   operations would be:

    - Reading a file a line at a time (I've written this
      function 100 times...)

    - Reading a whole file at once.

 * GIOError is not really good enough error infomation for this
   purpose. Problems include:

    - There is no method of getting a string for display
      to the user like with GError. (GIOError predates
      GError.)

    - How do you deal with a single mis-encoded character in
      a large file? Simply refusing to load the file is
      not useful to the user. You really want to:

       - provide some fallback [ g_convert_with_fallback()
         does this for string conversion, though not
         beautifully ]
 
       - Provide some error indication so that the app
         can display a dialog to the user indicating 
         that the file couldn't be read/written completely.
====

Now, long-term I'd like to see GIOChannel converted into something
much nice more full-featured. Along the lines of the 
Java character/byte streams, or Qt's QTextStream. 

I spent some time this the weekend working in making two such
abstractions in C -- the CVS source code's 'struct buffer' and
OpenSSL's 'BIO' (don't ask) -- work together so I have some definite
ideas about how that should work. But I'm sure that a full-featured
rework of GIOChannel is not feasible for GLib-2.0.

Though worried that piecemeal additions to GIOChannel will compromise
our ability to fix it right later, here is a rough proposal for what I
think is about the minimum set of changes to get charset conversion
working and useful.

They aren't small, though I think they would be pretty easy/fast to
implement.


GIOChannel needs to have buffering, and a representation
of blocking/non-blocking IO.

   /* Set the buffer size. 0 == unbuffered. -1 - pick a good
    * size.
    */
   void g_io_channel_set_buffer_size (GIOChannel *channel,
				      gint        size);
   gint g_io_channel_get_buffer_size (GIOChannel *channel);

   void     g_io_channel_set_blocking (GIOChannel *channel,
				       gboolean    blocking);
   gboolean g_io_channel_get_blocking (GIOChannel *channel);

   void g_io_channel_flush (GIOChannel *channel);


Buffering can all be done in the generic GIOChannel code,
but blocking/non-blocking requires an addition change to 
GIOChannel.

   struct _GIOFuncs
   {
     [...]

     void      (*io_flush)        (GIOChannel   *channel);

     /* Return value indicates whether operation was
      * successful. 
      */
     gboolean  (*io_set_blocking) (GIOChannel   *channel,
				   gboolean      blocking);
   };

Then you need the functions to set the encoding. I don't think generic
conversions [between arbitrary encodings, not just encoding <=> UTF-8]
are going to be used enough to be worth adding initially.

   /* Set the encoding for the input/output of the channel.
    * The internal encoding is always UTF-8. The channel
    * must be buffered.
    */
   gboolean g_io_channel_set_encoding (GIOChannel *channel,
				       const char *encoding);

   /* Sets whether fallback should be done as in g_convert_with_fallback
    */
   gboolean g_io_channel_set_use_fallback (GIOChannel  *channel,
				           gboolean     use_fallback,
				           gchar       *fallback);

Then, you need read-line and read-contents functions. [ There is
some question about what line terminators should consist of.
\n? \n \r\n \r? pango_find_paragraph_boundary()? ]

   /* Read a line, including the terminating character(s)
    * from a GIOChannel into a newly allocated string.
    * FALSE return indicates Error, EOF or (for non-blocking channel)
    * no data available. In case of error, error will be set.
    * g_io_channel_eof() can be used to distinguish no-data
    * from EOF. *length will contain allocated memory iff
    * the return is TRUE.
    */
   gboolean g_io_channel_read_line (GIOChannel   *channel,
				    gchar       **str_return,
				    gint         *length,
				    GError       *error);

   /* Read a line from a GIOChannel, using a GString as a buffer
    */
   gboolean g_io_channel_read_line_string (GIOChannel   *channel,
					   GString      *buffer,
					   GError       *error);

   /* Read all the remaining data from the file. Parameters as
    * for g_io_channel_read_line.
    */
   gboolean g_io_channel_read_to_end (GIOChannel   *channel,
				      gchar       **str_return,
				      gint         *length,
				      GError       *error);

You probably want replacements for read/write that conform
to the above conventions. 

   /* Replacements for g_io_channel_read/write to match the
    * above API. Return the number of bytes read; 0 on
    * EOF or no data; -1 on error.
    */
   gint g_io_channel_read_chars (GIOChannel    *channel, 
				 gchar         *buf, 
				 guint          count,
				 GError       *error);
   gint g_io_channel_write_chars (GIOChannel    *channel, 
				  gchar         *buf, 
				  guint          count,
				  GError       *error);

And finally, to make it convenient, you probably want
to be able to create a GIOChannel frmo a file directly.

   GIOChannel *g_io_channel_new_file (const gchar *filename,
				      const gchar *mode,
				      GError      *error);

So, reading a file line-by-line would look like:

============
gboolean
process_file (const char *filename);
{
  GError *error = NULL;
  GIOChannel *in;
  gboolean result;
  GString *buffer = g_string_new (NULL);

  in = g_io_channel_new_file (filename, "r", &error);
  if (error)
    goto out;

  if (!g_io_channel_set_encoding (in, "EUC-JP", &error))
    goto out;

  while (g_io_channel_read_line_string (in, buffer, &error))
    {
      [ process line ]
    }

 out:
  g_string_free (buffer, TRUE);
  
  if (in)
    g_io_channel_close (in);
  
  if (error)
    {
      g_message ("Error reading '%s'", filename, error->message);
      g_clear_error (error);

      return FALSE;
     }

  return TRUE;
}
==========

Which, I think is fairly nice, except for the length of g_io_channel_
to type.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]