Skip to content

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Nov 8, 2015

We got this plugin synced up to master, improved tests, fixed securitymanager issues, added over 200 test documents from tika test suite, trimmed the parsers to a more contained list (e.g. word, excel, openoffice, powerpoint, pdf, html, rtf, txt, etc), and added special security restrictions to try to better contain the parsers.

I think its in pretty good shape and we should merge it into the main repository. The fact is, people use it, and even if we don't like the idea of parsing things server side, its currently something functional. Otherwise we have to maintain it in a separate repository and that has proved difficult. If we have a better solution in the future for dealing with these documents, we can still probably reuse the testing and tika handling and so on.

kimchy and others added 30 commits December 5, 2011 14:05
(default) works but the latest release (2.2.1) won't as we don't set the
assembly id
Update maven assembly plugin version to 2.3
…tika instance.

I have extended the Tika class to allow for setting of how much text to
extract from a document to be on a per call basis.
….attachment.indexed_chars setting to globally change it (per index)
rmuir and others added 9 commits November 7, 2015 22:09
Reduce surface area of mapper attachments:
This patch adds a zip of about 200 files from tika's test suite,
and we assert some content comes back from each. This is a good exercise
of the various formats.

I removed any huge files to try to keep size reasonable, but we want
a bit of a variety so we know stuff is working.

I fixed issues with the parser config by running this.
Add test documents from tika test suite
There have been security issues with tika's parsers in the past...
let's take away the network, filesystem, everything we can.

In some way, parsing these docs is a lot like executing untrusted code.

I know its not pretty, but I think its worth it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rewrite the doc in another commit later as a asciidoc in plugins doc.

@dadoonet
Copy link
Contributor

dadoonet commented Nov 8, 2015

Thank you Robert for doing this. It will help a lot for maintain this plugin in a good shape until we propose something else.
I'm sure that found team will appreciate it as well! :)

I'll take care of the documentation rewrite to asciidoc after this will be merged.

Do you think we could also merge it in 2.1?

@rmuir
Copy link
Contributor Author

rmuir commented Nov 8, 2015

I set 2.2 because the issues i found were dealt with in ways that only 2.2+ can support.

@dadoonet
Copy link
Contributor

dadoonet commented Nov 8, 2015

Can we backport it to 2.1 without the fixes you wrote? I mean that the external repo won't have those fixes so we will have the same situation but at least the plugin will have the exact same lifecycle as elasticsearch itself?

@rmuir rmuir removed the v2.2.0 label Nov 8, 2015
@rmuir
Copy link
Contributor Author

rmuir commented Nov 8, 2015

I'm not backporting bugs. This is a 3.0-only change now.

@rjernst
Copy link
Member

rjernst commented Nov 8, 2015

LGTM

@rmuir
Copy link
Contributor Author

rmuir commented Nov 9, 2015

Just to re-explain my reasoning for going to master only here. This PR is a three-pronged approach to fix security and insanity concerns:

  1. whitelist parsers (requires master)
    This just simply cannot be done in maven, since it has no transitive=false. In gradle its easy:
    https://github.com/rmuir/elasticsearch/blob/migrate-mapper-attachments/plugins/mapper-attachments/build.gradle#L28-L51. I know what the maven equivalent would be: unmaintainable.
  2. sandbox parsers (requires 2.2+)
    We need security improvements in 2.2+ to further restrict the parsers so we have less concerns about bugs in them (especially stupid stuff like xml flaws).
  3. improve tests

So if gradle goes back to 2.2, then its safe to backport this there too. But without these 3 things its an incomplete solution.

rmuir added a commit that referenced this pull request Nov 9, 2015
Migrate mapper attachments plugin to main repository
@rmuir rmuir merged commit 102e254 into elastic:master Nov 9, 2015
dadoonet added a commit to dadoonet/elasticsearch that referenced this pull request Nov 9, 2015
dadoonet added a commit to elastic/elasticsearch-mapper-attachments that referenced this pull request Nov 9, 2015
dadoonet added a commit that referenced this pull request Nov 23, 2015
@clintongormley clintongormley added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Plugin Mapper Attachment labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.