Re: [xml] Parsing tag-soup HTML
- From: Nick Kew <nick webthing com>
- To: xml gnome org
- Subject: Re: [xml] Parsing tag-soup HTML
- Date: Mon, 18 Jun 2007 14:02:39 +0100
On Mon, 18 Jun 2007 08:14:01 -0400
Daniel Veillard <veillard redhat com> wrote:
  Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.
Because it lacks a ParseChunk API, which means it can't work with
Apache's pipelined filter architecture.  Unless you've added
such an API since I last looked.
So in terms of a first-iteration draft wishlist, tag-soup mode
should:
  - avoid inserting any implied tags in a SAX parse
  That would be contrary to what Tag Soup actually means for most
people as I pointed out.
OK, consider the example referenced from my blog in my first post,
coming from a microsoft sharepoint backend, which inserted a bogus
<meta> at the top.
Try running the following through "xmllint --html":
<meta http-equiv="content-type" content="text/html;charset=ascii" />
<html lang="en">
<head><title>foo</title></head>
<body><h1>Hello, World</h1></body>
</html>
and it becomes:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="content-type"
content="text/html;charset=ascii"></head>
<body>
<p> lang="en">
</p>
<title>foo</title>
<h1>Hello, World</h1>
</body>
</html>
From the point of view of the user, that's worse than the original,
because real-life browsers will render that first bogus paragraph.
It's because of examples like that that I want to make it a
configurable option NOT to insert any inferred tags.
 
  - treat contents of <script></script> and <style></style> as raw
    CDATA, and don't parse it.
  You need *some* parsing just to detect the end of tag, and now
you're back to the origin, what criteria will you keep
    </
    </sc
    </script
    </script>
    </SCRIPT
    </ScRIpT
    </SCRIPT >
 
 ?
Case-insensitive "</script" is the token to look for.
Having found it, we then look for ">" preceded by zero or
more whitespace chars.
Yes, that'll still screw up on document.write('</script>').
Needs more thought.  But at least it will leave things like
<script>
    document.write('<p>Something</p>');
</script>
intact.
Sounds like he's using "tag soup" to mean something that cleans it
up, in the tradition of Tidy or AccessValet.  I'm contemplating the
exact opposite: something that leaves it intact!
  And I think as an API you just can't ! You will break apps if you
deliver <em> aaa <b> bbb </em> ccc </b>
 as 2 opening tag and then 2 closing tag but inverted.
Cases like that don't seem to hit my inbox.  I guess that's because
even frontpage-weenies don't product code like that (or if they do,
they can see what's wrong for themselves).
Seems what you want is textual transformation only, and in that case
a parser doesn't sound like the best tool to implement this. But
maybe I misunderstand.
Yes, you could be right.  That's the other option.
I already have a simple sed-like filter (mod_line_edit), which
offers a fallback to users with hopelessly broken markup they
can't do anything about.  But that loses the point and the power
of a markup-aware parser generating a stream of events.
-- 
Nick Kew
Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/
[
Date Prev][
Date Next]   [
Thread Prev][
Thread Next]   
[
Thread Index]
[
Date Index]
[
Author Index]