IO and file path abstraction proposal



Recently people have been proposing various abstractions for
IO and path names.  I will share some work I've already done
since I think it is appropriate for glib.  Basically I would
like to see a complete abstraction: simply using "int fd" or
"FILE*" on UNIX is not good enough because I cannot provide
an implementation for those abstractions.  I might, for example
like to access a string file, a file that's in a database,
or anything else.   Essentially, this proposal is a C-ified
version of the C++ iostream abstraction (which is a bit bulky
for my taste, though I've used it extensively).

I'll append the interface I'm using now, though it needs to
be improved a bit before it is completely general, I'll also
sketch the changes it needs.  I am using this interface in
xdelta right now, so you can fetch xdelta 1.0 and see it in
action.  Xdelta takes several input files and produces 2 output
files.  The two output files are a delta.  The xdelta application
(which provides "delta" and "patch" functions) actually 
combines those two files into one output patch with a header,
this is made possible by using a general interface.  In PRCS2,
where the interface is being extensively used, the two files
output by xdelta are stored seperately for various reasons.
PRCS2 sends the output of xdelta into a database, the xdelta
application sends the output into a single file.  This is why
we need to be able to implement the IO abstraction, not just 
abstract for portability reasons.

I have an opaque type called "FileHandle", and a function table
like so:

struct _HandleFuncTable
{
  gssize            (* table_handle_length)       (FileHandle *fh);
  gssize            (* table_handle_pages)        (FileHandle *fh);
  gssize            (* table_handle_pagesize)     (FileHandle *fh);
  gssize            (* table_handle_map_page)     (FileHandle *fh, guint pgno, const guint8** mem);
  gboolean          (* table_handle_unmap_page)   (FileHandle *fh, guint pgno, const guint8** mem);
  const guint8*     (* table_handle_checksum_md5) (FileHandle *fh);
  gboolean          (* table_handle_close)        (FileHandle *fh, gint flags);
  gboolean          (* table_handle_write)        (FileHandle *fh, const guint8 *buf, gsize nbyte);
  gboolean          (* table_handle_copy)         (FileHandle *from, FileHandle *to, guint off, guint len);
  gboolean          (* table_handle_getui)        (FileHandle *fh, guint32* i);
  gboolean          (* table_handle_putui)        (FileHandle *fh, guint32 i);
  gssize            (* table_handle_read)         (FileHandle *fh, guint8 *buf, gsize nbyte);
};

Now this interface is a bit specialized for exactly what xdelta needs,
I only did as much abstraction as neccesary.  The presence of getui
and putui is somewhat arbitrary.  Those functions could also be supplied
outside the table, but I could implement huffman compression on ints
this way, if I wanted to.  The big missing piece of this interface is
the interface to select(), which we dont have a platform independent
version of (could we?  I'd like that).  I have a lower-level shared
implementation of this that uses only page-in and page-out which makes
implementing page based access easy, but does not work for non-seekable
or character based files.  Implementing it for file descriptors, FILE*s,
pipes, and strings is easy enough.  Also missing are functiosn to determine
other properties of the stream (open mode, seekable? etc.)  -- xdelta 
didn't need these and I have it implemented in a fairly 
difficult-to-add-to-glib (app-specific) way in PRCS2.  Note that the
inclusion of the md5 checksum stream function is sort of random, but
very useful to xdelta because xdelta needs to know the md5 checksum of
the stream it just read, so by moving that into the interface it can be
computed as the file is read, or in the case of PRCS2, the checksum
can be precomputed and saved.

Note that there is no abstract open here, that would be a virtual 
constructor.  Each abstract implementation must provide its own open
function, which is of course going to be different for each different
stream.

The way to do this in glib is:

typedef struct _GIOChannel GIOChannel;
typedef struct _GIOChannelFunctionTable _IOChannelFunctionTable;

struct _GIOChannel {
  GIOChannelFunctionTable* table;
};

The table is similar to the one above, suitably extended for according
to the notes above.  Then a number of macros are provided to make these
functions easier to call, for example:

#define gio_write(io,buf,len) (io)->table_handle_write((io),(buf),(len))

Tim mentioned another advantage: he can extend the function table with
new functions for setting characteristics of the sound devices he is
working on:

struct _GSoundIOChannelFunctionTable {
  GIOChannelFunctionTable table;
  /* more functions here */
};

and he can then make more #defines as above, and then he has extended
the iochannel interface with functions for setting the characteristics
of sound devices.

As another note, I've invested a fair amount of time in a serialization
code generator which uses this stream abstraction.  I write an elisp
description of a data structure and it generates code to serialize and
unserialize these objects from streams.  XDelta uses this to implement
the file format as well as cache various indices.  There is 
node hand-written code in xdelta for reading and writing file formats, and
I think it is very nice.  For these, I use two more interfaces called
SerialSource and SerialSink.  This interface sits slightly above the
IO abstraction and sources and sinks these serialized objects to and
from the io channels.  If glib moves in this direction the serialization
code I have can be used by everyone for implementing file formats and
network communication (I don't want to hear replies saying how this can
be done in other or better ways, because I am perfectly happy with it,
and it is very lightweight and portable.)

All of this can be found in the file serializeio.h in the xdelta 1.0
distribution.  The serialization stuff is in reposer.el.  The serialize
input file is in xd.ser.

Finally, the issue of paths, which will be neccesary for some of the 
"open file" IO stream constructors.  I like pieces of the previous
proposal but suggest that a new type be used for representing paths,
it makes many things convenient: passing strings around is difficult
and the split() function that was proposed makes manipulation easier,
but memory management more difficult.  From experience, I can say that
using an abstract, opaque path type is easier.  In PRCS2, I have an opaque 
type called "Path" and a "canonicalize" constructor, "dirname" and 
"basename" selectors, a "path_root()" function which returns the root 
path, and a function to convert paths to native strings.  This allows paths 
to be compared for pointer equality, and deals with the issue of weird
paths that contain multiple adjacent "/" characters and ".." components 
(note that I take the stand: /P1/P2/../P3 == /P1/P3) which is not neccesarily 
true, but the UNIX api is so weak in this area (symlinks) that I do not 
care, that's my policy (note that Emacs does the same).

-josh



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]