Tika base64-to-extracted-text char filter (and: some troubles with Git)

I've made a small feature which I've been too curious to leave unimplemented.

(The implementation is very simplistic and crude, I haven't made any serious development with java in years. Hope the point gets through nontheless)

It's a char filter called `attachments_test` (up for renaming), a quite useful feature for getting acquainted with Tika, as well as troubleshooting "Why isn't query X giving a hit for attachment Y" tickets coming from clients.

So, a request like the following:

```
POST /_analyze?tokenizer=keyword&char_filters=attachments_test&text=e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0%3D
```

Would yield the following result:

```
{
   "tokens": [
      {
         "token": "Lorem ipsum dolor sit amet\n\n",
         "start_offset": 0,
         "end_offset": 0,
         "type": "word",
         "position": 1
      }
   ]
}
```

Of course, this is not something that should be used in actual analyzers (hence the stigmatizing suffix of `_test`)

Also, (and this would be a good micro-feature in the actual indexing logic) it will give an error for unpadded base64 strings, (there ended the good micro-feature in the indexing logic) indicating how many equal signs that were missing.

I hope you agree that this little feature can be regarded as quite useful! Without something similar, Tika is a mysterious little black box doing stuff that we don't understand, and people dream about copying the extracted text into own properties or using Luke to introspect the Lucene index (lots of questions about that on the net).

---

Also, there's a problem here, hope you can help me out... My commit was automatically merged with [my previous pull request](https://github.com/elastic/elasticsearch-mapper-attachments/pull/155), it seems to be open still, is this correct as it was tagged for inclusion in 3.1.0?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tika base64-to-extracted-text char filter (and: some troubles with Git) #170

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions