Hello, authors of libxml2.
I'm using libxml2 to parse HTML and it sometimes produces the wrong result. In some weird circumstances, when the parser sees "</script>" it won't close the script tag, but instead it will literally add "</script>" to the text node and continue parsing the rest of the input verbatim as if it was script content.
After a lot of debugging, I determined the problem is in libxml2 and not the other libraries in my stack, and that it only seems to happen on version 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the diff between them, so I am still worried: I don't know if the bug is really fixed, or just dormant. I hope you can find the root cause, and maybe add a regression test if you do.
## How to reproduce:
I've uploaded an example html file that exhibits this problem, and a Python program to show it. It uses libxml2 via the lxml library. This will download the manylinux binary build of lxml 4.2.5, which is statically linked to libxml2 2.9.8.
$ virtualenv -p python3 ./venv
$ ./venv/bin/pip install --upgrade pip
$ ./venv/bin/pip install lxml==4.2.5
$ ./venv/bin/python test.py
I couldn't shorten the file very much, because if I delete even a single character, the bug stops triggering. (Could it be some buffer boundary issue?) Instead, I replaced most unimportant tags and text nodes with dummy text.
## Affected versions:
- lxml 4.1.1, which contains libxml2 2.9.7, is not affected.
- lxml 4.2.5, which contains libxml2 2.9.8, IS affected.
- lxml 4.3.0, which contains libxml2 2.9.9, is not affected (at least for this particular html file).
I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular version of lxml. I used this command:
STATIC_DEPS=true LDFLAGS='-flto -fPIC' CFLAGS='-O3 -g1 -pipe -fPIC -flto' LIBXML2_VERSION=2.9.9 ./venv/bin/pip install --no-binary :all: -vvv lxml==4.2.5
My particular test case doesn't trigger the bug in 2.9.9, but I don't know if that's because the bug is really fixed, or some constants/offsets have changed and now it triggers on other html files.
I hope you can solve the mystery. Please let me know if I can be of any help. And thanks for reading!
Tomi