Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Document preprocessor
  • Loading branch information
thomas-zahner committed Nov 5, 2025
commit bb165b14b6878ef73d7dc8b38f53e6155804dc9d
1 change: 1 addition & 0 deletions astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ export default defineConfig({
"guides/config",
"guides/cli",
"guides/output",
"guides/preprocessing",
],
},
{
Expand Down
21 changes: 4 additions & 17 deletions src/content/docs/guides/getting-started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -206,24 +206,11 @@ In this command, we ignore the case when globbing, so it matches
- `~/projects/rust_game_/README`
- `~/projects/python_script_/Readme.markdown`

### Check Links From Epub File
### Check other file formats

If you have [atool](https://www.nongnu.org/atool) installed, you can check links inside `.epub` files as well!

```bash
acat -F zip {file.epub} "_.xhtml" "_.html" | lychee -
```

:::caution[Attention]
lychee parses other file formats as plaintext and extracts links using [linkify](https://github.com/robinst/linkify).
This generally works well if there are no format- or encoding
specifics, but in case you need dedicated support for a new file format, please
consider [creating an issue](https://github.com/lycheeverse/lychee/issues).
:::

[atool]: https://www.nongnu.org/atool
[linkify]: https://github.com/robinst/linkify
[issue]: https://github.com/lycheeverse/lychee/issues
By preprocessing files it is possible to do link checking on
files which aren't officially supported by lychee.
See [file preprocessing](preprocessing).

## GitHub Action

Expand Down
69 changes: 69 additions & 0 deletions src/content/docs/guides/preprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: File preprocessing
---

Out of the box lychee supports HTML, Markdown and plain text formats.
More precisely, HTML files are parsed as HTML5 with the use of the [html5ever] parser.
Markdown files are treated as [CommonMark] with the use of [pulldown-cmark].

For any other file format lychee falls back to a "plain text" mode.
This means that [linkify] attempts to extract URLs on a best-effort basis.
If invalid UTF-8 characters are encountered, the input file is skipped,
because it is assumed that the file is in a binary format lychee cannot understand.

lychee allows file preprocessing with the `--preprocess` flag.
For each input file the command specified with `--preprocess` is invoked instead of reading the input file directly.
In the following there are examples how to preprocess common file formats.
In most cases it's necessary to create a helper script for preprocessing,
as no parameters can be supplied from the CLI directly.

```bash
lychee files/* --preprocess ./preprocess.sh
```

The referenced `preprocess.sh` script could look like this:

```bash
#!/usr/bin/env bash

case "$1" in
*.pdf)
exec pdftohtml -i -s -stdout "$1"
# Alternatives:
# exec pdftotext "$1" -
# exec pdftk "$1" output - uncompress | grep -aPo '/URI *\(\K[^)]*'
;;
*.odt|*.docx|*.epub|*.ipynb)
exec pandoc "$1" --to=html --wrap=none --markdown-headings=atx
;;
*.odp|*.pptx|*.ods|*.xlsx)
# libreoffice can't print to stdout unfortunately
libreoffice --headless --convert-to html "$1" --outdir /tmp
file=$(basename "$1")
file="/tmp/${file%.*}.html"
sed '/<body/,$!d' "$file" # discard content before body which contains libreoffice URLs
rm "$file"
;;
*.adoc|*.asciidoc)
asciidoctor -a stylesheet! "$1" -o -
;;
*.csv)
# specify --delimiter if values not delimited by ","
exec csvtk csv2json "$1"
;;
*)
# identity function, output input without changes
exec cat
;;
esac
```

For more examples and information take a look at [lychee-all],
a repository dedicated to collect use-cases with file preprocessing.
Feel free to open up an issue if you are missing a specific file format or have questions.

[linkify]: https://github.com/robinst/linkify
[html5ever]: https://github.com/servo/html5ever
[CommonMark]: https://commonmark.org/
[pulldown-cmark]: https://github.com/pulldown-cmark/pulldown-cmark/
[lychee-all]: https://github.com/thomas-zahner/lychee-all/