Skip to content

Skip UTF-8 BOM before parsing#1590

Open
bcmeireles wants to merge 1 commit into
lark-parser:masterfrom
bcmeireles:skip-utf8-bom
Open

Skip UTF-8 BOM before parsing#1590
bcmeireles wants to merge 1 commit into
lark-parser:masterfrom
bcmeireles:skip-utf8-bom

Conversation

@bcmeireles
Copy link
Copy Markdown

skip initial UTF-8 BOM before parsing text input and grammar files. Fixes #407

@MegaIng
Copy link
Copy Markdown
Member

MegaIng commented Apr 30, 2026

IMO this isn't larks job to do:

  • utf-8 BOM is very windows specific and generally not recommend - to the point where one can say input is malformed if it contains a BOM
  • It's part of the file encoding, which is not the abstraction layer lark operates on. Open files with the utf-8-sig encoding and the bom will be removed by the encoding layer built into python (See also: https://discuss.python.org/t/utf-8-bom-not-being-consumed-when-opening-file/74870)
  • We especially shouldn't do this for random bytes input where we don't even know if it's intended to be text.
  • (By using slicing we are copying the data which is something we are explicitly trying to avoid. This also falsifies indices. This could be fixed, but I still don't think we should for the reasons above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip BOM (Byte order mark)

2 participants