<html><body><p><a href="">
Notice that the output results in a script tag added to the resulting parsed output. Here is a small bit of Java/Xerces code to compare:
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.parsers.*;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class XmlTester {
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
String text = "<html><body><p><a href="">
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
// text contains the XML content
Document doc = builder.parse(new InputSource(new StringReader(text)));
System.out.println(getStringFromDocument(doc));
}
public static String getStringFromDocument(Document doc) throws TransformerException {
DOMSource domSource = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.transform(domSource, result);
return writer.toString();
}
}
This Java code, with the same input, results in the following output:
<html>
<body>
<p>
<a href="">
</p>
</body>
</html>
The attribute contents are quoted/escaped such that they don’t break out of the attribute once it is parsed. This libxml2 behavior doesn’t apply to all attributes. If we change the href to a class attribute there is no issue. This likely makes sense since the above mentioned commit specifically references not URI escaping.
-> libxml2 git:(bc5a5d65) ✗ cat test/HTML/ssiquote.html
<html><body><p><a class='<!--"><script>alert(1)</script>-->'>test1</a></p></body></html>
-> libxml2 git:(bc5a5d65) ✗ make testHTML
-> libxml2 git:(bc5a5d65) ✗ ./testHTML test/HTML/ssiquote.html
<html><body><p><a class='<!--"><script>alert(1)</script>-->'>test1</a></p></body></html>
So, I guess the question is, what do people think? I believe the argument from Daniel was roughly that this would be expected behavior for server side includes. However, this functionality seems to be in conflict with the Xerces behavior and it also leads to a trivial way to cause new/unexpected nodes to be introduced into the tree simply by parsing the document.