On Fri, May 13, 2011 at 09:09:12AM -0400, Extra Fu wrote:
Hello, I'm using libxm2 (2.7.6) and I've a question regarding encodings precedences. I have a array of bytes (UTF-8 HTML data) and I invoke htmlCreatePushParserCtxt() with the encoding set to XML_CHAR_ENCODING_UTF8. When I walk in the document's nodes, everything is fine unless the HTML file was poorly generated, such as: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <meta http-equiv="Content-Type" content="text/html; charset=Windows-1252"> ... The charset specified here is wrong as the HTML data is truly UTF-8 (I know for sure). Nonetheless, the charset specified by the meta tag seems to take precedence over the encoding specifed in the htmlCreatePushParserCtxt(). That is, when walking in the document's nodes using that wrong charset, it seems that the xmlNodePtr's content isn't in UTF-8 - messing up my handler as it expects UTF-8 data. How can I best handle this? I could for sure strip the charset parameter of the meta tag prior creating calling htmlCreatePushParserCtxt() but I would rather "force" libxml to trust me and use UTF-8 on that poorly generated content.
Yes that's a problem, you ended up hitting a libxml2 deficiency: there is no way to force ignoring the encoding defined in the document. In your case the encoding you provide is UTF-8 which is the internal one and as a result libxml2 behaves like if no hint had been given on context creation. For XML the way to process with encodings is defined in appendix F http://www.w3.org/TR/REC-xml/#sec-guessing where the "environment" encoding given is normally preempting any internally defined one. Still I think the simplest is to actually provide a way to force ignoring internal encodings when necessary, e.g. when the framework transcode automatically the docuement encoding. The attached patch does this, this includes a new option --noenc to xmllint doing this: paphio:~/XML -> cat tst.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <meta http-equiv="Content-Type" content="text/html; charset=foo"> </head> <body> some content </body> </html> paphio:~/XML -> xmllint --html --noout tst.html tst.html:2: HTML parser error : htmlCheckEncoding: unknown encoding foo <meta http-equiv="Content-Type" content="text/html; charset=foo"> ^ paphio:~/XML -> xmllint --html --noout --noenc tst.html paphio:~/XML -> I also modified the output code to not end up with a silently dropped docuement and no error on unknown internal encoding: paphio:~/XML -> xmllint --html --noenc tst.html output error : unknown encoding foo <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head><meta http-equiv="Content-Type" content="text/html; charset=foo"></head> <body> some content </body> </html> paphio:~/XML -> Works for XML too: paphio:~/XML -> xmllint enc.xml enc.xml:1: parser error : Unsupported encoding foo <?xml version="1.0" encoding="foo"?> ^ paphio:~/XML -> xmllint --noenc enc.xml <?xml version="1.0"?> <tst/> paphio:~/XML -> In that case the encoing is completely dropped from the output (which differenciate the processing from the case where the encoding is just passed to the parser, then the encoding= is preserved). This may not be a good option for you if you are stuck with a released version, but it's better to fix libxml2 there, and as you say right now you will have to preprocess the input... Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel veillard com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/
Attachment:
noenc.patch
Description: Text document