-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Migrate mapper attachments plugin to main repository #14605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: README.md
(default) works but the latest release (2.2.1) won't as we don't set the assembly id
Conflicts: pom.xml
Update maven assembly plugin version to 2.3
…from an attachment.
…tika instance. I have extended the Tika class to allow for setting of how much text to extract from a document to be on a per call basis.
….attachment.indexed_chars setting to globally change it (per index)
Reduce surface area of mapper attachments:
This patch adds a zip of about 200 files from tika's test suite, and we assert some content comes back from each. This is a good exercise of the various formats. I removed any huge files to try to keep size reasonable, but we want a bit of a variety so we know stuff is working. I fixed issues with the parser config by running this.
Add test documents from tika test suite
There have been security issues with tika's parsers in the past... let's take away the network, filesystem, everything we can. In some way, parsing these docs is a lot like executing untrusted code. I know its not pretty, but I think its worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rewrite the doc in another commit later as a asciidoc in plugins doc.
Thank you Robert for doing this. It will help a lot for maintain this plugin in a good shape until we propose something else. I'll take care of the documentation rewrite to asciidoc after this will be merged. Do you think we could also merge it in 2.1? |
I set 2.2 because the issues i found were dealt with in ways that only 2.2+ can support. |
Can we backport it to 2.1 without the fixes you wrote? I mean that the external repo won't have those fixes so we will have the same situation but at least the plugin will have the exact same lifecycle as elasticsearch itself? |
I'm not backporting bugs. This is a 3.0-only change now. |
LGTM |
Just to re-explain my reasoning for going to master only here. This PR is a three-pronged approach to fix security and insanity concerns:
So if gradle goes back to 2.2, then its safe to backport this there too. But without these 3 things its an incomplete solution. |
Migrate mapper attachments plugin to main repository
We got this plugin synced up to master, improved tests, fixed securitymanager issues, added over 200 test documents from tika test suite, trimmed the parsers to a more contained list (e.g. word, excel, openoffice, powerpoint, pdf, html, rtf, txt, etc), and added special security restrictions to try to better contain the parsers.
I think its in pretty good shape and we should merge it into the main repository. The fact is, people use it, and even if we don't like the idea of parsing things server side, its currently something functional. Otherwise we have to maintain it in a separate repository and that has proved difficult. If we have a better solution in the future for dealing with these documents, we can still probably reuse the testing and tika handling and so on.