Skip to content

[ADR] JMAP: Avoid ElasticSearch on critical reads#259

Closed
chibenwa wants to merge 13 commits intoapache:masterfrom
chibenwa:adr-cassandra-emailqueryview
Closed

[ADR] JMAP: Avoid ElasticSearch on critical reads#259
chibenwa wants to merge 13 commits intoapache:masterfrom
chibenwa:adr-cassandra-emailqueryview

Conversation

@chibenwa
Copy link
Contributor

No description provided.

@chibenwa
Copy link
Contributor Author

https://issues.apache.org/jira/browse/JAMES-3440 is the JIRA entry for this...

Copy link
Member

@mbaechler mbaechler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot of small comments by I agree with this feature.
However, please ensure that paging is actually working with these new projections.

So, ElasticSearch is queried on every JMAP interaction. Administrators thus need to enforce availability and good performance
for this component.

Relying on more software for every read also harms our resiliency as ElasticSearch outages have major impacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you expect? If you loose any service you loose James availability: S3, Cassandra, RabbitMQ, ElasticSearch.
Why would we want to support unavailability of highly available services in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I loose ES, given that ADR content, I only loose advanced search.

My customers will be waaaay less complaining about "not having search" that "not being able to read their emails".

Why would we want to support unavailability of highly available services in the first place?

I and the people I work with are human, we do software, there will be unavailability on some of those services.

The question now is how we deal with it.

Also we should mention our ElasticSearch implementation in Distributed James suffers the following flaws:
- Updates of flags lead to updates of the all Email object, leading to sparse segments
- We currently rely on scrolling for JMAP (in order to ensure messageId uniqueness in the response while respecting limit & position)
- We noticed some very slow traces against ElasticSearch, even for simple queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any clue why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And clue why.

But ElasticSearch slow performance likely would require its own ADR. That's a lengthy topic.

Paging is one, there's many others. I described scrolling & data mutabilityu above.

- We noticed some very slow traces against ElasticSearch, even for simple queries.

Regarding Distributed James data-stores responsibilities:
- Cassandra is the source of truth for metadata, its storage needs to be adapted to known access patterns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence

Copy link
Contributor Author

@chibenwa chibenwa Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cassandra is the source of truth for metadata

-> I think you have no problem understanding this

its storage needs to be adapted to known access patterns.

-> This come from Cassandra storage constraints. You need to plan your reads ahead (or allow filtering and kill your cluster)

It seems pretty clear to me as it is, please do not hesitate to suggest enhencements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, i now understand. The ambiguity comes from the fact I expect responsibilities in this list, not details about how Cassandra works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's responsibility is to handle known, common data access pattern, that's not mutually exclusive.

Copy link
Member

@rouazana rouazana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem of position and limit is hard, it could have consequences on Cassandra.

ElasticSearch is meant to have some native position & limit capabilities. I know we don't use them essentially because of rights managements, but maybe we are doing a misusage here expecting that Cassandra behaves better in this use case.

Anyway I'm ok to experience it, but we should really care of the global performance in this case.

``` No newline at end of file
```

Note that to handle position & limit, we need to fetch `position + limit` ordered items then removing `position` firsts items. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I scroll quickly n times, I will generate 1+2+...+n = n*(n+1)/2 cassandra requests ~= O(n²)

that's pretty bad, no? Couldn't it be a cause of ElasticSearch slowness? Could it slow down Cassandra?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True for Cassandra.

True for ElasticSearch.

JMAP includes some limits concurent call, rate limiting - that can help mitigating these concenrs in the future.

Couldn't it be a cause of ElasticSearch slowness?

Maybe for some.

I succeeded to clearly link some to reindexing as well thanks to @tuanlc .

chibenwa and others added 3 commits November 13, 2020 17:22
Co-authored-by: Matthieu Baechler <matthieu.baechler@gmail.com>
Co-authored-by: Matthieu Baechler <matthieu.baechler@gmail.com>
Co-authored-by: Matthieu Baechler <matthieu.baechler@gmail.com>
@chibenwa
Copy link
Contributor Author

the problem of position and limit is hard, it could have consequences on Cassandra.

We are returning a full list on metadata on every IMAP synchronisation (that does a full fetch because we do not support QRSYNC). Clients trigger this every 15 minutes or so, and it get executed (with extra metadata on mutable data) in 1-2 seconds for mailboxes around 200.000 mails.

This is a VERY rare operation in JMAP.

I'm not scared ;-)

If you are (or other people are) they can turn that of.

If users run into issues in production plateform, they can disable this.

Of course if that turns out being a bad idea, that could be removed from the code base and this ADR abandonned. But let's give a chance to this experimental feature a chance first, because I really believe that is the best decision we can take about ElasticSearch.

@mbaechler
Copy link
Member

For solving the scrolling issue, we can design a (git like) DAG to store entries and associate a DAG node to a scrolling state by using state feature of JMAP.
By the way, we'll need to implement state at one point.

@chibenwa
Copy link
Contributor Author

What is a DAG ?

Or we can wait scrolling being a problem before over-engineering it.

So far, we just don't know if the current proposal is good enough or not.

@mbaechler
Copy link
Member

The problem is, if you don't include the needed complexity from the start, you won't know how it will behave once you include the complexity and thus you may loose your time.

A DAG is a direct acyclic graph, like git.

Whatever the implementation (a DAG may not be the best idea), the idea is to have a "persistent structure" (every change creates a new immutable state) so that a scroll is bound to a given structure. RBDMS usually implements that using MVCC. JMAP state maps to this concept.

I don't know what is the best implementation for that in Cassandra to be honest.

@chibenwa
Copy link
Contributor Author

The problem is, if you don't include the needed complexity from the start, you won't know how it will behave once you include the complexity and thus you may loose your time.

I take the risk.

This proposal is a small implementation effort. Discarding it when needed won't be a problem

A DAG is a direct acyclic graph, like git.

Thanks for the explanation.

@mbaechler
Copy link
Member

This proposal is a small implementation effort. Discarding it when needed won't be a problem

Exactly why writing an ADR before doing a PoC may not be the best idea

@rouazana
Copy link
Member

Exactly why writing an ADR before doing a PoC may not be the best idea

And doing a PoC without a proper ADR is often misunderstood.

Here we have some kind of feature flag, so it can be easily tried and removed if not conclusive. The ADR is interesting because without it the first question I would have asked would have been: "why do you want to do this", and the second one "how do you handle pagination". And thus long debates which are really better explained here.

@chibenwa
Copy link
Contributor Author

JMAP state maps to this concept. (DAG)

@mbaechler I would be curious to know why you think that. Can you develop a bit?

I think before starting complicated developments, having a flat, ordered list of changes, served from oldest to newest is way easier to implement than the "from newest to oldest using some intermediate temporary states" documented as an optimization by the spec.

Would that be what you reference as a DAG?

@chibenwa
Copy link
Contributor Author

Merged

@chibenwa chibenwa closed this Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants