Hi Daniel, All,
I'm working with large XML documents (over 500K elements). It's
structure (written as XPath) looks like
/v/body[1]/word/valency_frames[1]/frame/frame_elements[1]/element/form
and below form is usually rather shallow structure with only a few
elements named "node". I'm interested in selecting form elements. I
use xmlXPathOrderDocElems to increase performance. If I write the
XPath as above, I get the results in a second or so. But if I write it
e.g. as
/v/body[1]/word/valency_frames[1]/frame/frame_elements[1]/descendant::form
(note that the subtree of frame_elements[1] is usually very small), it
takes several minutes. I used gprof to find out why. Here is the
interesting part:
index % time self children called name
<spontaneous>
[1] 98.8 432.52 0.00 xmlXPathNodeSetMerge [1]
Examining of xpath.c revealed that while AXIS_DESCENDANT uses
xmlXPathNodeSetMerge, AXIS_CHILD uses xmlXPathNodeSetMergeUnique,
which makes it a lot faster. The doc-comment for both of them is the
same. My question is
First, what is the difference between these two: the code indicates
that the latter assumes that the node-sets are disjoint, right? Can
xmlXPathNodeSetMergeUnique be used for AXIS_DESCENDANT and
AXIS_DESCENDANT_OR_SELF as well? (I probed if it would speed
descendant::form and it did, but maybe the change is semantically
incorrect). Alternatively, is there some space for optimization in
xmlXPathNodeSetMerge?
Thanks,
-- Petr
Attachment:
pgp0noDpKStc5.pgp
Description: PGP signature