[xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8



Hello, authors of libxml2.

I'm using libxml2 to parse HTML and it sometimes produces the wrong result. In some weird circumstances, when the parser sees "</script>" it won't close the script tag, but instead it will literally add "</script>" to the text node and continue parsing the rest of the input verbatim as if it was script content.

After a lot of debugging, I determined the problem is in libxml2 and not the other libraries in my stack, and that it only seems to happen on version 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the diff between them, so I am still worried: I don't know if the bug is really fixed, or just dormant. I hope you can find the root cause, and maybe add a regression test if you do.

## How to reproduce:

Test case: https://gist.github.com/TomiBelan/c9949b6519d115500b742393da61b188

I've uploaded an example html file that exhibits this problem, and a Python program to show it. It uses libxml2 via the lxml library. This will download the manylinux binary build of lxml 4.2.5, which is statically linked to libxml2 2.9.8.
$ virtualenv -p python3 ./venv
$ ./venv/bin/pip install --upgrade pip
$ ./venv/bin/pip install lxml==4.2.5
$ ./venv/bin/python test.py

I couldn't shorten the file very much, because if I delete even a single character, the bug stops triggering. (Could it be some buffer boundary issue?) Instead, I replaced most unimportant tags and text nodes with dummy text.

## Affected versions:

- lxml 4.1.1, which contains libxml2 2.9.7, is not affected.
- lxml 4.2.5, which contains libxml2 2.9.8, IS affected.
- lxml 4.3.0, which contains libxml2 2.9.9, is not affected (at least for this particular html file).

I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular version of lxml. I used this command:

STATIC_DEPS=true LDFLAGS='-flto -fPIC' CFLAGS='-O3 -g1 -pipe -fPIC -flto' LIBXML2_VERSION=2.9.9 ./venv/bin/pip install --no-binary :all: -vvv lxml==4.2.5

My particular test case doesn't trigger the bug in 2.9.9, but I don't know if that's because the bug is really fixed, or some constants/offsets have changed and now it triggers on other html files.

I hope you can solve the mystery. Please let me know if I can be of any help. And thanks for reading!

Tomi


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]