-
-
Notifications
You must be signed in to change notification settings - Fork 183
feat: file preprocessing #1891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: file preprocessing #1891
Conversation
katrinafyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The most significant comment is the one about Skip's concept. I think a different collection of values would make more sense.
The rest of this comment is just my own commentary. Feel free to ignore it, and I certainly don't expect anything in this PR to change because of it.
When reading this PR, I had the thought that the new preprocess value has to be handled a lot and passed repeatedly to get to where it needs to be used. I think this is due to an architecture where it's all very hierarchical. It looks something like this (conceptually):
inputs
|> collector(basic_auth, skip, include_verbatim, client, preprocess, ...)Here, collector calls other helper functions and has to amalgamate all their arguments. It is responsible for a lot of functionality, from resolving inputs all the way to link extraction and request building.
If the architecture was more like a flat pipeline, it would reduce the need for this argument injection. Instead, of one big "collector", it might look like this:
inputs
|> resolve_inputs(skip, glob_ignore_case)
|> preprocess_inputs(pre_cmd)
|> get_input_contents(basic_auth, retries, max_redirect)
|> extract_links(root_dir, base_url)Hopefully, you can see how this reduces the parameters needed - each step only needs the parameters for its own functionality. A clear pipeline makes it much easier to implement features like --dump or --dump-inputs, which are just stopping at certain points in the pipeline (I started thinking about this because of the dumping issues). It also makes testing easier.
Anyway, this is all theoretical at the moment. I don't know if this is possible or how hard it would be. There is Chain in the codebase, but it's limited to homogenous pipeline functions. Anyway, as I said, nothing that needs to affects this PR right now.
|
@katrinafyi Thanks for your thoughts.
I really do like this idea and I totally agree. It would probably simplify quite a lot. IMO we could open up an issue to tackle that separately. Edit: opened up #1898 |
83a2411 to
02b5db8
Compare
02b5db8 to
6e877bd
Compare
6e877bd to
39cd56b
Compare
Co-authored-by: Matthias Endler <[email protected]>
Addresses #1672
I've created a new repository for testing and documenting compatibility with other file formats: https://github.com/thomas-zahner/lychee-all
We might want to merge this information into the official docs later.
As mentioned in the issue this PR is heavily inspired by ripgrep's preprocessor.