[libxml2] Different approach to fix quadratic behavior in HTML push parser
- From: Nick Wellnhofer <nwellnhof src gnome org>
- To: commits-list gnome org
- Cc:
- Subject: [libxml2] Different approach to fix quadratic behavior in HTML push parser
- Date: Mon, 10 Jan 2022 14:21:22 +0000 (UTC)
commit 798bdf13f6964a650b9a0b7b4b3a769f6f1d509a
Author: Nick Wellnhofer <wellnhofer aevum de>
Date: Mon Jan 10 14:50:20 2022 +0100
Different approach to fix quadratic behavior in HTML push parser
The old approach introduced a regression, see issue #312 and the
previous commit. Disable code that tries to recover from invalid start
tags. This only affects "recovery" mode.
Add a comment outlining a better fix in accordance with the HTML5 spec.
HTMLparser.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
---
diff --git a/HTMLparser.c b/HTMLparser.c
index d9d8d00d..9769ad5b 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -3958,13 +3958,25 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt) {
htmlParseErr(ctxt, XML_ERR_NAME_REQUIRED,
"htmlParseStartTag: invalid element name\n",
NULL, NULL);
+ /*
+ * The recovery code is disabled for now as it can result in
+ * quadratic behavior with the push parser. htmlParseStartTag
+ * must consume all content up to the final '>' in order to avoid
+ * rescanning for this terminator.
+ *
+ * For a proper fix in line with HTML5, htmlParseStartTag and
+ * htmlParseElement should only be called when there's an ASCII
+ * alpha character following the initial '<'. Otherwise, the '<'
+ * should be emitted as text (unless followed by '!', '/' or '?').
+ */
+#if 0
/* if recover preserve text on classic misconstructs */
if ((ctxt->recovery) && ((IS_BLANK_CH(CUR)) || (CUR == '<') ||
(CUR == '=') || (CUR == '>') || (((CUR >= '0') && (CUR <= '9'))))) {
htmlParseCharDataInternal(ctxt, '<');
return(-1);
}
-
+#endif
/* Dump the bogus tag like browsers do */
while ((CUR != 0) && (CUR != '>') &&
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]