Re: XML libs (was Re: gconf backend)



On Sun, Sep 28, 2003 at 01:29:21PM -0400, Havoc Pennington wrote:
> On Sun, 2003-09-28 at 06:08, Daniel Veillard wrote:
> >   libxml2 is designed to be able to report multiple errors when parsing
> > a resource. And your API style does not allow this. It's critical for
> > a lot of work to be able to know that you have different problems
> > lines 100, 120 and 134. I understand your viewpoint and will try to
> > carry it on the list.
> 
> That makes sense, in a context where someone is human-editing the XML
> and wants to see all the errors in the document at once.
> 
> Rather than "exceptions" the other thing that would work would be to
> reliably _always_ call the error callback and set an error code on the
> context if a function fails (returns NULL or whatever). Right now the

This is the case for all pure error parsing functions as far as I can tell.
Most memory error should also behave that way, but there are some internal
APIs where the context is not passed down, that would imply duplicating
the tree/DOM API with an extra argument for the context.

> function can fail without the callback having been called. This is
> perhaps a more realistic change to libxml.

It's not about realistic, it's about what's needed.
  
> If you _always_ call the error callback on error, then it's possible in
> a wrapper or convenience library to convert the error callbacks into
> exceptions (in fact config-loader-libxml.c in dbus tries to do this
> already).

Problem is taht there are 2 (even 3 for warning) callbacks. 1 global
which is actually per thread (the context is duplicated by thread and
global variables like the global error callback and the global error
context argument can be set per thread) and one as one of the SAX
callback when the parser context is available.

> Introducing exceptions to the current API at this stage is basically a
> bad idea, since you have too many old functions that don't use them and
> you don't want to double the API. So perhaps the always-call-callback
> approach is right.

  I can introduce error informations for the new APIs being rolled out and
add one for teh xmlReader interface too.

> The xmlTextReader error callback API is good, as long as the provided
> error callback with xmlTextReaderSetErrorHandler() is _always_ called if
> a function fails.

  You seems to have a per-function approach. Parsing errors for XML
are codified, and are part of the spec actually, so except for memory
allocation errors, you won't get a "per function" error but an error
per defined in the spec once the condition is recognized.

> Perhaps the interesting thing to do is develop a tinyxml alternate lib
> _or_ a wrapper API. If you or someone does that though, again, please,
> do not ABI freeze it as soon as you implement it. It needs to be used in
> real life by several apps and iterated through rounds of improvement
> based on that.

  Discussed in a separate post, since the tiny XML would have to have
a separate API than libxml2 itself, I don't see the point of going through
this.

> I think this may be wrong though and xmlTextReader may be the API to go
> with. It's the one I started using in config-loader-libxml.c and it
> looks essentially reasonable.

  That's my point too. It's a bit slower than SAX even in the upcoming
2.6.0 but it's nearly standard (C# ECMA, with only slight deviations),
bullet-proof for "common" use case, while still being very flexible.

> I was on the libxml mailing list for a long time, btw. I just wasn't
> able to keep up with the mail volume.

 Okay

> > [1] http://mail.gnome.org/archives/xml/2003-September/msg00146.html
> 
> Most of the APIs in this mail essentially would not be used in my use
> cases, because I don't want to load an xmlDocPtr and want to do my own
> I/O. I would want to feed libxml the already-loaded bytes. The way
> provided in this mail is xmlReadMemory(), but that has the limitation
> that you have to load the whole file at once.

  There is a push interface to libxml2 parser too.

> What I really want is:
> 
>  context = context_new ();
>  context_add_bytes (context, buffer, len);

Actual code cut and past for the handling of --push option in
xmllint.c 
--------------------------
		int res, size = 1024;
                char chars[1024];
                xmlParserCtxtPtr ctxt;
                                                                                
                /* if (repeat) size = 1024; */
                res = fread(chars, 1, 4, f);
                if (res > 0) {
                    ctxt = xmlCreatePushParserCtxt(NULL, NULL,
                                chars, res, filename);
                    while ((res = fread(chars, 1, size, f)) > 0) {
                        xmlParseChunk(ctxt, chars, res, 0);
                    }
                    xmlParseChunk(ctxt, chars, 0, 1);
                    doc = ctxt->myDoc;
                    ret = ctxt->wellFormed;
                    xmlFreeParserCtxt(ctxt);
                    if (!ret) {
                        xmlFreeDoc(doc);
                        doc = NULL;
                    }
                }
--------------------------
  The 2 first arguments of xmlCreatePushParserCtxt are a SAX block and
the associated context if you don't want to build a tree.
  
> Where you can provide the document in incremental chunks, so I could
> call context_add_bytes() repeatedly appending more bytes until the
> document was complete. At the end you call context_finished() or
> something and the parser complains if the document isn't complete.

  C.f. below. you can check ctxt->wellFormed and ctxt->errNo at each chunk
or catch the synchronous error callbacks.

> > [2] http://xmlsoft.org/xmlreader.html#Walking
> 
> I like the reader API. So here are the nodes I know what to do with:
> 
>     XML_READER_TYPE_ELEMENT = 1,
>     XML_READER_TYPE_ATTRIBUTE = 2,
>     XML_READER_TYPE_TEXT = 3,
>     XML_READER_TYPE_COMMENT = 8,
>     XML_READER_TYPE_DOCUMENT_TYPE = 10,
>     XML_READER_TYPE_END_ELEMENT = 15,
> 
> Here are the nodes that if I wrote code I would just skip them:
> 
>     XML_READER_TYPE_NONE = 0,
>     XML_READER_TYPE_CDATA = 4,

  CDATA can be handled as TEXT, it's just text escaped.

>     XML_READER_TYPE_ENTITY_REFERENCE = 5,
>     XML_READER_TYPE_ENTITY = 6,
>     XML_READER_TYPE_PROCESSING_INSTRUCTION = 7,
>     XML_READER_TYPE_DOCUMENT = 9,
>     XML_READER_TYPE_DOCUMENT_FRAGMENT = 11,
>     XML_READER_TYPE_NOTATION = 12,
>     XML_READER_TYPE_WHITESPACE = 13,

  That's whitespace text that you may or may not ignore depending on the
XML vocabulary you use. It is application dependant.

>     XML_READER_TYPE_SIGNIFICANT_WHITESPACE = 14,
>     XML_READER_TYPE_END_ENTITY = 16,
>     XML_READER_TYPE_XML_DECLARATION = 17
> 
> Is my resulting application going to be compliant, assuming I asked for
> entity substitution? Or will my app fall over?
 
  handling CDATA should be done. And XML doesn't define compliance for
an application but for a parser. What the parser provides back to the 
application, what and when errors are raised is part of the spec, not what
the application does with the data returned, that's something you
seems to misunderstand about XML compliance. What I warned about was
that using a non compliant parser may loose data (silently) or build
into application code expectations on broken behaviour.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]