API proposal for charset code conversion at I/O
- From: Owen Taylor <otaylor redhat com>
- To: gtk-devel-list gtk org
- Subject: API proposal for charset code conversion at I/O
- Date: 14 Mar 2001 12:14:32 -0500
This is a forward of some notes from a discussion with HideToshi
Tajima from Sun about what would be necessary to add a useful
streaming charset conversion API to GIOChannel.
The initial idea was that we could just add
g_io_channel_read_and_convert() type functions similar to
g_io_channel_read(), but that didn't quite work out:
====
* GIOChannel has no buffering, no idea of "putting back"
characters, so I'm not sure how a read or write of a partial
character would be handled.
And since there is no idea how much data you are getting,
g_io_channel_read_locale_to_utf8() could end up getting more
data than fits in the output buffer - what do you
do with the excess?
Also, there is no concept of blocking vs. non-blocking
at the GIOChannel level, so when the first read that
gio_channel_read_locale_to_utf8() attempts doesn't
provide count bytes, do you try again or not?
Either can be wrong.
* For simple programs, reading some number of bytes at a time
into a buffer is a fairly uncommon time operation. Common
operations would be:
- Reading a file a line at a time (I've written this
function 100 times...)
- Reading a whole file at once.
* GIOError is not really good enough error infomation for this
purpose. Problems include:
- There is no method of getting a string for display
to the user like with GError. (GIOError predates
GError.)
- How do you deal with a single mis-encoded character in
a large file? Simply refusing to load the file is
not useful to the user. You really want to:
- provide some fallback [ g_convert_with_fallback()
does this for string conversion, though not
beautifully ]
- Provide some error indication so that the app
can display a dialog to the user indicating
that the file couldn't be read/written completely.
====
Now, long-term I'd like to see GIOChannel converted into something
much nice more full-featured. Along the lines of the
Java character/byte streams, or Qt's QTextStream.
I spent some time this the weekend working in making two such
abstractions in C -- the CVS source code's 'struct buffer' and
OpenSSL's 'BIO' (don't ask) -- work together so I have some definite
ideas about how that should work. But I'm sure that a full-featured
rework of GIOChannel is not feasible for GLib-2.0.
Though worried that piecemeal additions to GIOChannel will compromise
our ability to fix it right later, here is a rough proposal for what I
think is about the minimum set of changes to get charset conversion
working and useful.
They aren't small, though I think they would be pretty easy/fast to
implement.
GIOChannel needs to have buffering, and a representation
of blocking/non-blocking IO.
/* Set the buffer size. 0 == unbuffered. -1 - pick a good
* size.
*/
void g_io_channel_set_buffer_size (GIOChannel *channel,
gint size);
gint g_io_channel_get_buffer_size (GIOChannel *channel);
void g_io_channel_set_blocking (GIOChannel *channel,
gboolean blocking);
gboolean g_io_channel_get_blocking (GIOChannel *channel);
void g_io_channel_flush (GIOChannel *channel);
Buffering can all be done in the generic GIOChannel code,
but blocking/non-blocking requires an addition change to
GIOChannel.
struct _GIOFuncs
{
[...]
void (*io_flush) (GIOChannel *channel);
/* Return value indicates whether operation was
* successful.
*/
gboolean (*io_set_blocking) (GIOChannel *channel,
gboolean blocking);
};
Then you need the functions to set the encoding. I don't think generic
conversions [between arbitrary encodings, not just encoding <=> UTF-8]
are going to be used enough to be worth adding initially.
/* Set the encoding for the input/output of the channel.
* The internal encoding is always UTF-8. The channel
* must be buffered.
*/
gboolean g_io_channel_set_encoding (GIOChannel *channel,
const char *encoding);
/* Sets whether fallback should be done as in g_convert_with_fallback
*/
gboolean g_io_channel_set_use_fallback (GIOChannel *channel,
gboolean use_fallback,
gchar *fallback);
Then, you need read-line and read-contents functions. [ There is
some question about what line terminators should consist of.
\n? \n \r\n \r? pango_find_paragraph_boundary()? ]
/* Read a line, including the terminating character(s)
* from a GIOChannel into a newly allocated string.
* FALSE return indicates Error, EOF or (for non-blocking channel)
* no data available. In case of error, error will be set.
* g_io_channel_eof() can be used to distinguish no-data
* from EOF. *length will contain allocated memory iff
* the return is TRUE.
*/
gboolean g_io_channel_read_line (GIOChannel *channel,
gchar **str_return,
gint *length,
GError *error);
/* Read a line from a GIOChannel, using a GString as a buffer
*/
gboolean g_io_channel_read_line_string (GIOChannel *channel,
GString *buffer,
GError *error);
/* Read all the remaining data from the file. Parameters as
* for g_io_channel_read_line.
*/
gboolean g_io_channel_read_to_end (GIOChannel *channel,
gchar **str_return,
gint *length,
GError *error);
You probably want replacements for read/write that conform
to the above conventions.
/* Replacements for g_io_channel_read/write to match the
* above API. Return the number of bytes read; 0 on
* EOF or no data; -1 on error.
*/
gint g_io_channel_read_chars (GIOChannel *channel,
gchar *buf,
guint count,
GError *error);
gint g_io_channel_write_chars (GIOChannel *channel,
gchar *buf,
guint count,
GError *error);
And finally, to make it convenient, you probably want
to be able to create a GIOChannel frmo a file directly.
GIOChannel *g_io_channel_new_file (const gchar *filename,
const gchar *mode,
GError *error);
So, reading a file line-by-line would look like:
============
gboolean
process_file (const char *filename);
{
GError *error = NULL;
GIOChannel *in;
gboolean result;
GString *buffer = g_string_new (NULL);
in = g_io_channel_new_file (filename, "r", &error);
if (error)
goto out;
if (!g_io_channel_set_encoding (in, "EUC-JP", &error))
goto out;
while (g_io_channel_read_line_string (in, buffer, &error))
{
[ process line ]
}
out:
g_string_free (buffer, TRUE);
if (in)
g_io_channel_close (in);
if (error)
{
g_message ("Error reading '%s'", filename, error->message);
g_clear_error (error);
return FALSE;
}
return TRUE;
}
==========
Which, I think is fairly nice, except for the length of g_io_channel_
to type.
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]