Skip to content

Conversation

@thomas-zahner
Copy link
Member

@thomas-zahner thomas-zahner commented Oct 29, 2025

Addresses #1672

I've created a new repository for testing and documenting compatibility with other file formats: https://github.com/thomas-zahner/lychee-all
We might want to merge this information into the official docs later.

As mentioned in the issue this PR is heavily inspired by ripgrep's preprocessor.

@thomas-zahner thomas-zahner requested a review from mre October 29, 2025 11:15
Copy link
Contributor

@katrinafyi katrinafyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The most significant comment is the one about Skip's concept. I think a different collection of values would make more sense.

The rest of this comment is just my own commentary. Feel free to ignore it, and I certainly don't expect anything in this PR to change because of it.

When reading this PR, I had the thought that the new preprocess value has to be handled a lot and passed repeatedly to get to where it needs to be used. I think this is due to an architecture where it's all very hierarchical. It looks something like this (conceptually):

inputs
|> collector(basic_auth, skip, include_verbatim, client, preprocess, ...)

Here, collector calls other helper functions and has to amalgamate all their arguments. It is responsible for a lot of functionality, from resolving inputs all the way to link extraction and request building.

If the architecture was more like a flat pipeline, it would reduce the need for this argument injection. Instead, of one big "collector", it might look like this:

inputs
|> resolve_inputs(skip, glob_ignore_case)
|> preprocess_inputs(pre_cmd)
|> get_input_contents(basic_auth, retries, max_redirect)
|> extract_links(root_dir, base_url)

Hopefully, you can see how this reduces the parameters needed - each step only needs the parameters for its own functionality. A clear pipeline makes it much easier to implement features like --dump or --dump-inputs, which are just stopping at certain points in the pipeline (I started thinking about this because of the dumping issues). It also makes testing easier.

Anyway, this is all theoretical at the moment. I don't know if this is possible or how hard it would be. There is Chain in the codebase, but it's limited to homogenous pipeline functions. Anyway, as I said, nothing that needs to affects this PR right now.

@thomas-zahner thomas-zahner changed the title File preprocessing feat: file preprocessing Oct 31, 2025
@thomas-zahner
Copy link
Member Author

thomas-zahner commented Oct 31, 2025

@katrinafyi Thanks for your thoughts.

If the architecture was more like a flat pipeline, it would reduce the need for this argument injection. Instead, of one big "collector", it might look like this

I really do like this idea and I totally agree. It would probably simplify quite a lot. IMO we could open up an issue to tackle that separately.

Edit: opened up #1898

@thomas-zahner thomas-zahner merged commit 8011ef0 into lycheeverse:master Nov 4, 2025
7 checks passed
@mre mre mentioned this pull request Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants