Skip UTF-8 BOM before parsing by bcmeireles · Pull Request #1590 · lark-parser/lark

bcmeireles · 2026-04-28T17:46:36Z

skip initial UTF-8 BOM before parsing text input and grammar files. Fixes #407

MegaIng · 2026-04-30T21:10:21Z

IMO this isn't larks job to do:

utf-8 BOM is very windows specific and generally not recommend - to the point where one can say input is malformed if it contains a BOM
It's part of the file encoding, which is not the abstraction layer lark operates on. Open files with the utf-8-sig encoding and the bom will be removed by the encoding layer built into python (See also: https://discuss.python.org/t/utf-8-bom-not-being-consumed-when-opening-file/74870)
We especially shouldn't do this for random bytes input where we don't even know if it's intended to be text.
(By using slicing we are copying the data which is something we are explicitly trying to avoid. This also falsifies indices. This could be fixed, but I still don't think we should for the reasons above)

Skip UTF-8 BOM before parsing

57f4e7e

Provide feedback