Re: Notes when applying IO channel patch



Ron Steinke <rsteinke w-link net> writes:

> Just a few notes on your comments. I'll try to get a patch on the
> API/urgent-de-break-ifying stuff ASAP.
> 
> 
> > * Empty strings are not the same as NULL - if string->len == 0,
> >   made g_io_channel_read_line g_io_channel_read_to_end()
> >   return an empty string,
> 
> This is correct for g_io_channel_read_line. In general,
> g_io_channel_read_to_end does not return a string
> (as it can be used to read binary data with the default
> encoding), so it should return NULL when 0 bytes are read.

Generally, it's a nice courtesy to nul-terminate all buffers,
even if they also can include embedded nuls. This means
that people who aren't properly handling embedded nuls
at least don't segfault. For this reason, a 0 byte 
read-to-end should return a 1 byte allocated buffer.

It's actually an interesting question as to whether
g_io_channel_read_line[_string]() should return embedded
nuls or not:

 For: the GNU coding standards require this, and it is
      nice to gracefully handle files with a bit of 
      junk in them.
 Against: Most GLib/GTK+/Pango functions taking a length use
          a length of the string as MIN (strlen(string), len),
          so programs will be typically broken for files
          including embedded NULs anyways, and the way
          you are doing it now, at least they get
          am excess line break in that location rather
          than dropping the rest of the line.
 
I suppose we can later add a IO channel flag to control handling
of embedded nuls and leave things as is for now.
 
> > * Removed G_IO_CHANNEL_DEFAULT_LINE_TERM ... in my opinion,
> >   files with the wrong termination on the wrong platform
> >   are common enough that it is better to make a single
> >   cross-platform choice and encourage people to autodetect
> >   if they don't have a definite idea.
> 
> This isn't for reading, it's for writing, for example:
> 
> g_io_channel_write_chars (channel, string, -1, &bytes, NULL);
> g_io_channel_write_chars (channel, G_IO_CHANNEL_DEFAULT_LINE_TERM, -1, &bytes, NULL);
> 
> will write a string and add a sensible choice for the line terminator.
> I'd like to reinstate this, but if you don't feel it's appropriate
> that's okay.

This relates to the O_BINARY question below -- if files on win32
are opened not in binary mode, then "\n" will actually result
in a \r\n in the output file. (And vice versa)

Even if we go with binary mode, unless we honestly expect people
to do:

 g_io_write_chars (channel, "Hi there" G_IO_CHANNEL_DEFAULT_LINE_TERM, -1, &bytes);

Instead of "Hi there\n", then I don't see the point in adding such a
#define. And I really don't think they aren't going to do that --
though if we think they are, we should be a bit kind and make it
G_DEFAULT_LINE_TERM. I don't think adding a constant just to make
people feel guilty every time they write "\n" is a good idea.
 
> > * I'm not happy with the mode specification for g_io_channel_new_file().
> >   I'm thinking that it would be better to either have:
> >
> >    g_io_channel_new_file (fname, "w+")
> 
> This is the way Toshi was doing it originally. I'll change it back.

After having to write G_IO_FILE_MODE_READ once today, I'm
already quite convinced that short and sweet is good. 
 
> > * What should be the handling of EPIPE: is it really an
> >   error not a status? (should EOF by recycled?). Would
> >   it make sense to use send() instead of write to 
> >   avoid SIGPIPE? Should that be optional?
> 
> Mabye define G_IO_STATUS_ILLEGAL_SEEK? It could be defined to
> be the same as G_IO_STATUS_EOF, but that seems unnecessary.

Keeping the number of members in GIOStatus 
 
> > * The set of errors in GIOChannelError seems a bit random - 
> >   contains obscure errors like EPERM, ENOLCK, EDEADLK, but certainly
> >   not nearly every error that could occur from open().
> 
> I think when I wrote this it was only being used by
> read/write/seek, and I added it to g_io_channel_new_file() later.
> I'll need to go through open an add all the errors. Are there
> any particular errors you wanted removed?

Why don't we just use GFileError for g_io_channel_new_file(),
that already has all the errnos we need for opening files,
and there isn't be much point in duplicating them
in GIOChannelError.

Because every OS has different errnos, we can't really
expect to be comprehensive. We just should try enuneration
values for the ones you might want to check for more
commonly.

The Unix specificaiton mentions, for read:

 EAGAIN
 EBADF
 EBADMSG
 EINTR
 EINVAL
 EIO
 EISDIR
 EOVERFLOW
 ENXIO

Additionally for write:

 EFBIG
 ENOSPC
 EPIPE
 ERANGE

Additionally for seek:

 ESPIPE

If we remove ERANGE, EBADMSG as being STREAMS things, say that
we warn and don't return EBADF and EFAULT, and handle EGAIN
and EINTR internally, then the set is:
  
 EINVAL
 EIO
 EISDIR
 EOVERFLOW
 ENXIO
 EFBIG
 ENOSPC
 EPIPE
 ESPIPE  

As compared to what we have currently, that removes:

 EACCES, EDEADLK, EMFILE, ENOLCK, EPERM, 

And adds:

 EOVERFLOW, ENXIO, EFBIG  

I'd also like to get rid of G_IO_CHANNEL_ERROR_ENCODE_RW since
it is a programming error, and should just be handled with
a g_warning().

> > * Now that G_IO_STATUS_INTR is gone, is it needed or useful
> >   to add a timed-write facility?
> 
> Could you explain more? I don't know what a timed-write facility is.

Well, one possible reason why people might _want_ EINTR is
the old practice of:

 alarm (5);
 count = write (...);

There are all sorts of reasons why this isn't a great way 
idea, but the idea of reading or writing only for a 
certain amount of time can be useful. To do this with
GIOChannel would involve:

 * Create a new GMainContext
 * Add a watch for the IO channel to the context with G_IO_OUT
 * Add a timeout source to the context
 * Set the channel to non-blocking
 * if write gives EGAIN, g_context_main_iteration (context)
   until one of the sources fires, and if it was the
   watch, retry the write.
 * Destroy the context

By which time, you figure that just waiting until someone
hits control-c wouldn't be bad...

It can be done at the IO channel implementation layer a _lot_ more
efficiently of code by (on unix) just calling g_poll() directly if we
added a timeout (can be -1) to GSourceFuncs->read/write.  and also
added g_io_channel_write_chars_timed() or something.

The counter-argument would be if you expect your reads
and writes to be slow, you should do things in a asynchronous
manner driven off of a GMainContext to begin with, so
adding a timed-write facility may just be encouraging
bad programming practice.

Given time constrains, we probably need to skip this issue.

> > * Need to make a read-only-returns-whole-characters guarantee
> >   for reading for UTF-8, and validate incoming data.
> 
> What is a good way for finding the ends of wide characters
> in a UTF-8 string?

What you really need is something like the g_unicode_get_char_extended()
function in gutf8.c. Since I already had another request to
make this function public, we probably should just do that.  
 
> > * I think we need to hide partial chars when writing from
> >   the user - the total possible partial char size is only
> >   known and is 5 bytes because it is UTF-8. (4 if we assume
> >   that it UTF-8 encoding valid unicode characters)
> 
> If I recall correctly, g_io_channel_write_chars() only returns
> G_IO_STATUS_PARTIAL_CHARS if the first character in the buffer
> is partial. Otherwise, it writes as many full characters
> as it can and returns normally.

I'd like to have it work so g_io_channel_write_chars(), when
called on a block stream, either writes all the characters
or fails with G_IO_STATUS_ERROR. People shouldn't have
to worry about partial writes. We can get this behavior
by keeping a small fixed size buffer of uncoverted partial
characters when writing.

In fact, I'd really like to get rid of G_IO_STATUS_PARTIAL_CHARS.
I think:

 - When writing, we should hide it
 - When reading, it indicates an error, and should be reported
   as an error.
 
> > * g_io_channel_set_encoding() leaves IO channel in inconsistent
> >   state on error. Would be better to do nothing if converters
> >   couldn't be opened.
> 
> I don't see this. As near as I can tell, the current
> implementation fails cleanly if it doesn't create one of the converters.

Looking it over, it looks like I misread the code and it
is fine as it is.
 
> >
> > * seek position doesn't seem to properly handle offset < 0
> >   from SEEK_CUR, or seeks of less than the buffer length.
> 
> Once again, I don't see this. All that happens for SEEK_CUR is
> that the offset is shifted by the length of the buffer, so
> the seek is relative to the current position of the file pointer
> instead of the head of the buffer.

On second read, this looks OK too.
 
> > * Set a flag to catch reads/writes on closed channels cleanly?
> 
> This is currently being done by a call to g_io_channel_get_flags()
> in g_io_channel_write_chars(). It is unnecessary for reads as
> long as the buffer is flushed when the channel is closed,
> as the first thing that will happen in a read is an
> attempt to fill the buffer that will fail with EBADF.

Not necessarily - another file descriptor could have been
allocated the same ID. Using a freed file descriptor is like
using freed memory. Since we have the opportunity to catch
this programming error reliably with one bit, it might
be nice to do so. 

(Related issue, we should probably remove BADF and FAULT from 
GIOChannelError and issue g_warning() insteads. These
never will be generated for properly written programs,
so there is no reason why the program should want to 
catch them, or report them nicely to the user.)
 
> If we cache the read-only channel flags, as you reccomend later,
> that will also work. Those flags will need to be changed when
> the user calls one of:
> 
> close()		as long as they use the API version instead of the native system
> 		call we're fine
>
> shutdown()	terminate read and/or write for sockets, we probably need to
> 		catch this in any watch or something

As soon as close() is called, getting the flags is an invalid operation
and we _must_ not do it. If the user calls close() or shutdown()
while a channel is open on a file descriptor, without going
through GIOChannel, they're taking their life into their own
hands. 

> > * Calling g_io_channel_seek() in g_io_channel_write_chars()
> >   is rather odd.
> 
> This has to do with mixing read/write on a buffered channel.
> 
> If a write is followed by a read, the contents of the write buffer
> need to be flushed to disk before the read, so that the read starts
> after the last character written to the buffer.
> 
> Similarly, if a read is followed by a write, the read buffer
> needs to be "flushed" so that the write starts immediately
> after the last character read from the buffer. This
> is done by calling seek() to reposition the file pointer
> at the head of the buffer. This is also why mixing
> reading and writing doesn't work for encoded channels
> (which can't use SEEK_CUR).

But there is no reason to actually make a seek() syscall, right?
You are just reusing the code in g_io_channel_seek_position()? 
That's what seemed a bit hackish to me. (But probably perfectly
OK.)
 
> > * There is no reason (IMO) to expand the write buffer arbitrarily
> >   when writing characters - just flush out each chunk as 
> >   it converted. 
> 
> With the current implementation, the buffer does not grow
> arbitrarily, it is flushed whenever it is larger than
> channel->buf_size. Do you mean that the buffer should be
> flushed whenever is contains _any_ characters? This
> would make g_io_channel_flush() sort of pointless.

What I mean is that, when doing encoding conversion,  you convert 
the entire write buffer into the target encoding before
ever flushing. There is no reason to do this rather than
simply converting as much as there is free space in your
buffer than flushing that.

> > * If the previous write flushed the entire buffer, get_flags() is 
> >   called again by g_io_channel_write_chars(), this could be
> >   expensive? If someone is writing to a non-writeable buffer,
> >   this is a programming error, so we dont' need to try
> >   and catch it cleanly.
> 
> See comments above about checking for closed files.

Just to reiterate, there is _no_ way to check for a closed
file through the Unix API.
 
> > * Add recognition of unicode paragraph separator to
> >   autodetect routine?
> 
> What is the byte value of this character?

It's g_unichar_to_utf8 (0x2029), or "\xe2\x80\xa9".
 
> > * O_BINARY handling on win32... should it be automatic?
> 
> I don't understand the question.
 
[ Warning, the below contains quite a bit of me talking about
  things I know only second-hand as if I had certain knowledge ]

This relates to some nasty breakage in the way the C library works on
platforms inheriting from DOS -- If you don't open a file in binary
mode (O_BINARY for open, "b" flag for read/write) then there is 
autoconversion on read and write between \r\n and \n. 

Now for stdio, I can sort of understand the point - making
people write "\r\n" would have made C quite incompatible between
DOS and Unix. But the fact that this is done not in stdio,
but at the "syscall" layer inexecusable: open() should
give you the bytes in the file, damnit.

So, if we want to make g_io_channel_new_file() useable on
Windows, we have have two options:

 1) Allow "b" in the flags, so that if you want unmunged data,
    you do new_file (filename, "rb", &error). 

 2) Simply always open files in binary mode.

The advantage of the first is that text files written through
GIOChannel will typically have the platform-native line endings
without porting intervention.

The advantage of the second is that it is much more likely people will
get things basically working for Win32 when writing on Unix. Also,
note that if your target encoding is the Windows-standard UTF-16LE
encoding, then being in text mode is disastrous.

I'd really appreciate some guidance from the Win32 people on
this issue.

One other issue that I discovered when writing a little
program using the new API today:

 - I think the append-to-whats-there behavior of 
   g_io_channel_read_line_string() is annoying, and would
   be inclined to get rid of it. 

   While I have, on occasion, when text munging in Perl
   used the "accumulate lines, until I hit something,
   then regex the accumulated lines as a while" technique,
   I think its relatively rare, and it would be better
   to make that inefficient than to make the 
   use of the string for g_io_channel_read_line_string()
   as simple a buffer inconvenient.

   If you don't make that change, I believe I introduced
   a bug in the code - the *line_term result is returned
   to the line rather than the buffer.

Regard,
                                        Owen







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]