[xml] Repost: Determining the correct start and end position of a tag



A thousand apologies for my last incomplete post, I have a new
computer with an awkwardly placed trackpad... anyway, the post in full
is...

Hi all,

I'm in the process of developing an XML plugin for Geany, a
lightweight Linux IDE. Part of the plugin is a custom GTK Tree Model
which displays the parsed tree of a document without having to load
every row into the tree view. I'm pretty happy with that part and it
all seems to be working rather well. Not content with that, I've been
working on enhancing the plugin in order to give Geany some of the
same features that you find in commercial XML editing software such as
XPath searching and XSL transformations. Again, so far, so good.

Where I've really come to grief is in trying to tie the model and the
Scintilla editor widget together. I am trying to implement a feature
which lets the user click on a row in the tree view and have the
cursor go to that position and vice versa. In order to do this I
needed to determine the start and end position of each tag and compare
it to the position returned from the mouse click in the editor window.

I was able to get the start and end position for each node in the tree
by first creating a Parser Context with xmlCreateURLParserCtxt and
then passing that to xmlParseDocument. I then copy the
xmlParserNodeInfoSeq node_seq from the Parser Context into a linked
list to enable a binary search for the position returned by the
Scintilla edit. So far so good. I load a document, click on the editor
and it moves the list view selection to the right node. Huzzah!

Overcome by my mastery of C and libxml I continue testing only to find
that I get unexpected results with non UTF-8 documents, specifically
ISO-8859-1. Testing, testing, testing, I determine that the positions
returned by the xmlParserNodeInfo for ISO-8859-1 documents are offset
exactly 41 characters less than those from the Scintilla widget. After
hacking about in the libxml source code, it appears to me that this
has something to do with the way the documents are parsed according to
their encoding and that this could account for the variation. I am
assuming it has to do with the position of the input buffer _after_
the encoding declaration has been parsed.

For now, I have a dirty, dirty little hack in place which determines
if the encoding is ISO-8859-1 and if so, it subtracts 41 from the
position passed. This is not good(tm) imho and I'm looking for a
better way, especially since looking through the source code has made
me aware of the far greater variety of document encodings out there
than I had hitherto been aware of. So I guess it's time to phrase my
questions:

* Is there an easier way of determining the correct offset for the
start position of a non-UTF-8 document other than a ghastly switch
statement with all of the potential offsets?

* Is it likely that access to the xmlParserNodeInfoSeq via
xmlParseDocument will be deprecated in the future and my code will
break on future versions?

I'm sure I'll have more questions as I proceed, but I would appreciate
some insight into the above from someone more familiar with the
internals of libxml than I.

Many thanks
Chris Daley

--
--------------------------------------
Chris Daley
Sydney, New South Wales
(EDT - UTC/GMT+11)

e: chebizarro gmail com
m: +61 437 031 214
s: chebizarro
tw: chebizarro
"There is no way to peace — peace is the way" - A.J. Muste



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]