Re: [xml] Parsing tag-soup HTML

From: Nick Kew <nick webthing com>
To: xml gnome org
Subject: Re: [xml] Parsing tag-soup HTML
Date: Mon, 18 Jun 2007 14:02:39 +0100

On Mon, 18 Jun 2007 08:14:01 -0400
Daniel Veillard <veillard redhat com> wrote:

  Out of context. I wonder why you think the reader would be that
much slower. I did only XML tests but the cost was within 20% of the
SAX parsing speed.


Because it lacks a ParseChunk API, which means it can't work with
Apache's pipelined filter architecture.  Unless you've added
such an API since I last looked.

So in terms of a first-iteration draft wishlist, tag-soup mode
should:
  - avoid inserting any implied tags in a SAX parse


  That would be contrary to what Tag Soup actually means for most
people as I pointed out.


OK, consider the example referenced from my blog in my first post,
coming from a microsoft sharepoint backend, which inserted a bogus
<meta> at the top.

Try running the following through "xmllint --html":

<meta http-equiv="content-type" content="text/html;charset=ascii" />
<html lang="en">
<head><title>foo</title></head>
<body><h1>Hello, World</h1></body>
</html>

and it becomes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<head><meta http-equiv="content-type"
content="text/html;charset=ascii"></head>
<body>
<p> lang="en"&gt;
</p>
<title>foo</title>
<h1>Hello, World</h1>
</body>
</html>

From the point of view of the user, that's worse than the original,
because real-life browsers will render that first bogus paragraph.
It's because of examples like that that I want to make it a
configurable option NOT to insert any inferred tags.

  - treat contents of <script></script> and <style></style> as raw
    CDATA, and don't parse it.


  You need *some* parsing just to detect the end of tag, and now
you're back to the origin, what criteria will you keep

    </
    </sc
    </script
    </script>
    </SCRIPT
    </ScRIpT
    </SCRIPT >
 
 ?


Case-insensitive "</script" is the token to look for.
Having found it, we then look for ">" preceded by zero or
more whitespace chars.

Yes, that'll still screw up on document.write('</script>').
Needs more thought.  But at least it will leave things like
<script>
    document.write('<p>Something</p>');
</script>
intact.

Sounds like he's using "tag soup" to mean something that cleans it
up, in the tradition of Tidy or AccessValet.  I'm contemplating the
exact opposite: something that leaves it intact!


  And I think as an API you just can't ! You will break apps if you
deliver <em> aaa <b> bbb </em> ccc </b>
 as 2 opening tag and then 2 closing tag but inverted.


Cases like that don't seem to hit my inbox.  I guess that's because
even frontpage-weenies don't product code like that (or if they do,
they can see what's wrong for themselves).

Seems what you want is textual transformation only, and in that case
a parser doesn't sound like the best tool to implement this. But
maybe I misunderstand.


Yes, you could be right.  That's the other option.

I already have a simple sed-like filter (mod_line_edit), which
offers a fallback to users with hopelessly broken markup they
can't do anything about.  But that loses the point and the power
of a markup-aware parser generating a stream of events.

-- 
Nick Kew

Application Development with Apache - the Apache Modules Book
http://www.apachetutor.org/

Follow-Ups:
- Re: [xml] Parsing tag-soup HTML
  - From: Stefan Behnel

References:
- [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard
- Re: [xml] Parsing tag-soup HTML
  - From: Nick Kew
- Re: [xml] Parsing tag-soup HTML
  - From: Daniel Veillard

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]