[libxml2] Different approach to fix quadratic behavior in HTML push parser

From: Nick Wellnhofer <nwellnhof src gnome org>
To: commits-list gnome org
Cc:
Subject: [libxml2] Different approach to fix quadratic behavior in HTML push parser
Date: Mon, 10 Jan 2022 14:21:22 +0000 (UTC)

commit 798bdf13f6964a650b9a0b7b4b3a769f6f1d509a
Author: Nick Wellnhofer <wellnhofer aevum de>
Date:   Mon Jan 10 14:50:20 2022 +0100

    Different approach to fix quadratic behavior in HTML push parser
    
    The old approach introduced a regression, see issue #312 and the
    previous commit. Disable code that tries to recover from invalid start
    tags. This only affects "recovery" mode.
    
    Add a comment outlining a better fix in accordance with the HTML5 spec.

 HTMLparser.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)
---
diff --git a/HTMLparser.c b/HTMLparser.c
index d9d8d00d..9769ad5b 100644
--- a/HTMLparser.c
+++ b/HTMLparser.c
@@ -3958,13 +3958,25 @@ htmlParseStartTag(htmlParserCtxtPtr ctxt) {
        htmlParseErr(ctxt, XML_ERR_NAME_REQUIRED,
                     "htmlParseStartTag: invalid element name\n",
                     NULL, NULL);
+        /*
+         * The recovery code is disabled for now as it can result in
+         * quadratic behavior with the push parser. htmlParseStartTag
+         * must consume all content up to the final '>' in order to avoid
+         * rescanning for this terminator.
+         *
+         * For a proper fix in line with HTML5, htmlParseStartTag and
+         * htmlParseElement should only be called when there's an ASCII
+         * alpha character following the initial '<'. Otherwise, the '<'
+         * should be emitted as text (unless followed by '!', '/' or '?').
+         */
+#if 0
        /* if recover preserve text on classic misconstructs */
        if ((ctxt->recovery) && ((IS_BLANK_CH(CUR)) || (CUR == '<') ||
            (CUR == '=') || (CUR == '>') || (((CUR >= '0') && (CUR <= '9'))))) {
            htmlParseCharDataInternal(ctxt, '<');
            return(-1);
        }
-
+#endif
 
        /* Dump the bogus tag like browsers do */
        while ((CUR != 0) && (CUR != '>') &&

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]