Re: XML libs (was Re: gconf backend)



On Sat, Sep 27, 2003 at 04:28:11PM -0400, Havoc Pennington wrote:
> >   libxml2 will garantee UTF-8 at the API level independantly of the
> > input encoding. If you don't ant to see entities, ask them to be sustitued.
> > If you see DTD, comments or PIs, simply ignore them. Still it's not a reason
> > to break conformance.
> 
> Having the app ignore those things is not different from having the XML
> lib ignore them. Either way they are ignored. What I'm worried about is

  No and that is a big mistake you're making. Example: you don't have to 
know about entities at the user level for those apps, you just let the
parser do the work for you. You won't even know that they have been there.
But if you ignore them at the lib level, you loose the data.
  You think in terms where you control the input and output. the
error is that your next big client is gonna use an Oracle back-end for
your XML data, and suddenly you don't control the production anymore,
and if you use a non conformant parser you made a promise that you just
can't hold, and that kind of thing has serious long-term costs.

> bugs in apps where they crash when they get an unusual XML construct
> they weren't expecting. That's why I don't think these things should be
> in the API when it's avoidable.

  The reason to use an api like the xmlReader, which is pretty much
fail-proof for users and offering a simplistic programming model.

> The fact is that whatever these features are supposed to do, your
> typical desktop developer will not understand or want. And so they won't
> handle it properly. So either we need the API to force you to handle it
> properly, or the API in practice may as well not even have the feature.

  no the library handle most of them under the hood.
see my other post, there are 6 kind of items you may face (entities
should be asked for substitution, contrary to Malcolm who need
to see them because he's doing an editor). 3 of them are the basics,
1 is namspace plumbing and you may or may not use them as ckecking,
and the 2 last are bits you can usually ignore.

> >   Well if you see a namespace declaration and ignore it, that probably mean
> > taht your receiving side code is not ready to understand what it is receiving
> > and IMHO you should certainly not ignore it but fail immediately to avoid
> > misinterpreting data.
> 
> That's fine, but how is it different from having the library simply not
> support namespaces and return an error if it sees one? This is exactly
> my point. The _app_ is what has to be compliant, not the XML library.

   He no, is your library is not namespace compliant you won't even see 
 the namespace, you will see attributes, and you will misinterpret them
 very likely.
   But as I pointed out, a non compliant lib is likely to bear even
what you consider very simple processing. I'm pretty sure I can find 
a very simple documents for which gmarkup will work differently from
expat and libxml2.

> And making desktop apps handle arbitrary XML documents seems pretty much
> impossible, because it's too complicated for non-XML-experts to
> understand.

  No you have only 6 kind of item of support and and 2 are easilly
ignorable.

> For web developers, it's different. Those guys focus on XML
> as a primary part of their expertise.

  This has nothing to do with web development versus data oriented
development. You have a spec, either you're compliant or not. It's a
contract. And all it costs you to comply to that contract is mostly
to reuse correctly a compliant library instead of trying to roll your
own.

> Some of the XML-focused apps like Conglomerate or perhaps Gnumeric no
> doubt handle these things, but the rest of the desktop doesn't.
> libxml-based gconf didn't handle namespaces any more than the gmarkup
> one does, as far as I know. I didn't do anything special to add such
> handling.

Right you don't have to. On top of libxml2 you should just halt processing
if you see a namespace you don't understand.

> >   You can't remap something like namespace, DTD, PI or comments to 
> > something which would be XML without them. Like asking a kernel to
> > remap the network layer on top of the disk driver because you don't
> > have a network card :-)
> 
> But the XML usage in GNOME isn't handling DTD, PI, namespaces, etc.
> Even when using libxml, the apps are just ignoring those features,
> except when libxml automates handling them. Apps are just assuming the
> gmarkup-like subset and ignoring everything else.

  Precisely in one case the library handle them underneath, with
gconf it would break.

> > I have asked on libxml2 list for feedback on error handling, but since
> > you're not subscribed I assume you will not provide any suggestion.
> 
> I have two suggestions; the first is to copy GError (read the extensive
> explanatory docs on it at 
> http://developer.gnome.org/doc/API/2.0/glib/glib-Error-Reporting.html).
> The second is to make one modification to GError which is to use 
> a statically-allocated struct instead of a malloc'd struct, as in 
> CORBA_environment and DBusError. This lets you report out-of-memory
> errors.
> 
> The rules for using GError are the important thing, rather than the
> detailed API. Always handle or propagate the error, for example; don't
> pile up errors; fail atomically; etc.

  Well libxml2 uses callback for errors, that's the model everybody
uses and I'm not sure that was ever questionned by the relatively large
user base. Since your model seems to impose an asynchronous processing
I think this will need some discussion on the mailing-list. I cannot
change radically to a new model without at list a bit of explanation.

> > > handling right. The total library size would be in the 200K range, or
> > > perhaps less if it used GLib functionality for portability/unicode. The
> > 
> >   So you're complaining for 6-700K of shared code ?
> 
> That is a very significant amount of code. The GTK+ stack is 3-4M total.
> Paging binary code off disk is a large part of our application startup
> time and bootup time. GLib is only 400K core plus 200K GObject.

  And ? If you use a fraction of libxml2 you will load only that. where
is the problem ? You're complaining that libxml2 has feature your don't
use. How is this a problem ? It is loaded already.

> Sure I can live with 6-700K if I have to, but it's not ideal. I'm
> describing the ideal library here.

  Well we had megabytes of user parser modified data pages for more than
one year and you're just suggesting to change that. There is some unfairness
there. Do some profiling, then pinpoint to where the problem really is.
Sorry that's not very scientific to say that that library looks large
so it must slow Gnome startup, oprofile a startup session, and see 
where the time is spend exactly. Reading teh large amount of small XML
files might be one point, but I doubt it's really library dependant. And
the time to demand load the 600KB in memory is peanuts. On the other
hand the decision to fragment gcond files in a number of small files
might be trashing the accesses to the disk who need to access a number
of non sequential blocks. Libxml2 reader parses a 20MB file in 4 seconds,
the problem of loading 600K from disk is nerly neglectible, really !
It's not like activating that code instanciates megabytes of data.
  
  My point is that I don't see where this is actually in any way ideal.
You can get a 10kb parser written in lisp using a lisp engine. Size of
code is not directly correlated to speed. Again the only problem would be
PDA, and libxml2 can be trimmed down in those environment (there is a 
WinCE port and people use it on Psion...)

> >   More precisely that library would not be XML compliant at all, like
> > gmarkup. And even in the small subset of "feature" taht you support
> > I wonder how much is correcly done, i.e. CR/LF remapping, attribute content
> > processing, do you process correctly 
> >   <doc attr="this attrbute value content
> > should be delivered to the application without a new line"/>
> > and
> >   <doc attr="this attribute value content &#10;
> > should be delivered to the application with one new line"/>
> > 
> >    I mean that even with a very basic subset the risks of diverging
> > from the standard is really high, and if you work on a subset you can't
> > test against the regression suite, the risk then is to generate data
> > and code which then just break when fed to a compliant library.
> 
> Yes, that's true. That's why I specified that my ideal library would
> handle these things.
> 
> Without having the ideal library though we have to balance the pros and
> cons of the various libraries we do have.

  I think not handling properly attribute content is a serious risk.

> This is why if someone says they don't like metacity I don't get angry,
> as long as they don't get personal about it. Nobody is making them use
> metacity so they don't have any reason to yell if they don't like it,
> just don't use it. I am a big fan of people having other WMs to use so I
> can keep mine simple.

  The BIG difference is that metacity implement the specs related to
the desktop behaviour (well I assume so it's your domain) but gmarkup
does not adhere to the specs which drive XML parsing. You just can't
compare non-compliant and compliant code bases.

> > >  - application can "throw" an error itself if it doesn't like the
> > >    elements/content it sees
> > 
> >   I really don't see why , one of the nice thing of your pseudo API
> > taht everybody would love to use is taht your didn't specificedif it
> > was push or pull (i;e; who keep control of the I/O flow, and I know
> > people will want both).
> 
> If there's no DTD validation, the app is doing its own error checking.
> Even if you had DTD validation, some things can't be expressed via DTD
> so the app has to do some of the checking anyway. e.g. the DTD can't
> express the possible values of the color attributes in metacity themes,
> that can be "#RRGGBB" or "gtk:fg[NORMAL]" or some other stuff. So the
> app needs to be able to throw an error like "invalid value for attribute
> foo on line 23"

  yes taht's error apps. Still if the input is broken at the XML level
the parser must report that and stop. That's the spec, and I doubt we 
disagree on that.
  BTW in Relax-NG your attribute content could be checked trivially
with regexp, not that I suggest doing this, it's possible.

> >    well since you do only SAX and the eader there is nothing to save.
> > Considering escaping of a string, I'm sorry to tell you that saving
> > element content and attribute content should use different escaping
> > routines, unless you're okay to loose data.
> 
> That's unfortunate, since we're lucky if app developers remember to
> escape at all. But if you had a single escape routine that had a
> mandatory argument like:
> 
>  escaped = xml_escape (text, len, XML_ESCAPE_MODE_ATTRIBUTE);
> 
> or:
> 
>  escaped = xml_escape (text, len, XML_ESCAPE_MODE_ELEMENT);
>  
> Then you could force people to handle this. This is what I mean when I
> say that anything you want people to reliably think about, you have to
> force them to think about. The app, not the library, has to be
> XML-compliant.

  That's saving. If you were using the library for saving you wouldn't have
to put the logic in the app. But apparently you don't want to. So you
have to put the logic in the app, nothing to argue about except maybe
you initial decision.

> > >  - is fairly fast, but it doesn't have to be the fastest ever
> >   and for the people who really want a fast parser ? If you can't
> > compromize for 600KB on disk, you probably can't compromise with
> > CPU cycles, why one and not the other ?
> 
> Because it's easy to profile and optimize, but hard to remove features.
> So one is more fixable over time than the other.

  Adding code to teh library doesn't cost much (linking) to the app at
run time if you don't use it. And if you use it, well you probably had
a good reason then to have those bytes.

> >   I think you have been induced into thinking that there is one magical
> > subset of XML, but I don't think it exists.
> 
> It exists for the desktop use-cases that I've implemented. That's all I
> can say or am saying.

  And you're ready to state that this looks XML but is not XML and are
ready to mainatin the special code for it ? Either you handle it fully
(for well-formedness) or not. And a substitute in those time of reduction
of costs and standard based strategies sounds a costly and hard to justify
alternative.

> >  Then you also seems to think
> > that the extraneous parts could be forgotten or remapped onto that subset
> > which is clearly not possible, while staying compliant.
> 
> Well, those parts aren't handled properly now; stuff breaks if you try
> to use them, no matter what XML lib you're using. Apps just don't expect
> XML to be more than a doctype, elements, attributes, content, and the
> simple entities, and if the XML lib feeds them other stuff they just get
> confused or ignore it.

  Yes it matter what XML lib you use. conformant libs will just do the
same processing and delibver the same output. Non-conformant subset based
ones will catch fire and burn, or more viciously corrup data and generate
wrong logic burried in code. As soon as the code on top start using a 
non-conformance deviation, the data and the code is toast.
  Regression tests for the whole code is nice, but it doesn't prevent
the code from implementing a behaviour violating the spec, and if you
have a subset you can't even test reliably.

Daniel

-- 
Daniel Veillard      | Red Hat Network https://rhn.redhat.com/
veillard redhat com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]