Re: [xml] Encodings precedence

From: Daniel Veillard <veillard redhat com>
To: Extra Fu <extrafu gmail com>
Cc: xml gnome org
Subject: Re: [xml] Encodings precedence
Date: Mon, 16 May 2011 16:24:46 +0800

On Fri, May 13, 2011 at 09:09:12AM -0400, Extra Fu wrote:

Hello,

I'm using libxm2 (2.7.6) and I've a question regarding encodings
precedences.

I have a array of bytes (UTF-8 HTML data) and I invoke
htmlCreatePushParserCtxt() with the encoding set to XML_CHAR_ENCODING_UTF8.
When I walk in the document's nodes, everything is fine unless the HTML file
was poorly generated, such as:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
...

The charset specified here is wrong as the HTML data is truly UTF-8 (I know
for sure). Nonetheless, the charset specified by the meta tag seems to take
precedence over the encoding specifed in the htmlCreatePushParserCtxt().

That is, when walking in the document's nodes using that wrong charset, it
seems that the xmlNodePtr's content isn't in UTF-8 - messing up my handler
as it expects UTF-8 data.

How can I best handle this? I could for sure strip the charset parameter of
the meta tag prior creating calling htmlCreatePushParserCtxt() but I would
rather "force" libxml to trust me and use UTF-8 on that poorly generated
content.


  Yes that's a problem, you ended up hitting a libxml2 deficiency:
there is no way to force ignoring the encoding defined in the document.
In your case the encoding you provide is UTF-8 which is the internal
one and as a result libxml2 behaves like if no hint had been given on
context creation.
  For XML the way to process with encodings is defined in appendix F
   http://www.w3.org/TR/REC-xml/#sec-guessing
where the "environment" encoding given is normally preempting any
internally defined one.
  Still I think the simplest is to actually provide a way to force
ignoring internal encodings when necessary, e.g. when the framework
transcode automatically the docuement encoding. The attached patch does
this, this includes a new option --noenc to xmllint doing this:

paphio:~/XML -> cat tst.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=foo">
</head>
<body>
  some content
</body>
</html>
paphio:~/XML -> xmllint --html --noout tst.html
tst.html:2: HTML parser error : htmlCheckEncoding: unknown encoding foo
<meta http-equiv="Content-Type" content="text/html; charset=foo">
                                                                ^
paphio:~/XML -> xmllint --html --noout --noenc tst.html
paphio:~/XML ->

  I also modified the output code to not end up with a silently dropped
docuement and no error on unknown internal encoding:

paphio:~/XML -> xmllint --html --noenc tst.html
output error : unknown encoding foo
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="Content-Type" content="text/html;
charset=foo"></head>
<body>
  some content
</body>
</html>
paphio:~/XML ->

Works for XML too:

paphio:~/XML -> xmllint enc.xml
enc.xml:1: parser error : Unsupported encoding foo
<?xml version="1.0" encoding="foo"?>
                                  ^
paphio:~/XML -> xmllint --noenc enc.xml
<?xml version="1.0"?>
<tst/>
paphio:~/XML ->

In that case the encoing is completely dropped from the output (which
differenciate the processing from the case where the encoding is just
passed to the parser, then the encoding= is preserved).

This may not be a good option for you if you are stuck with a released
version, but it's better to fix libxml2 there, and as you say right now
you will have to preprocess the input...

Daniel
-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
daniel veillard com  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/

Attachment: noenc.patch
Description: Text document

References:
- [xml] Encodings precedence
  - From: Extra Fu

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]