*: Use exponential buckets for histogram metrics#1545
*: Use exponential buckets for histogram metrics#1545brancz merged 5 commits intothanos-io:masterfrom
Conversation
394e1b2 to
2a2826a
Compare
|
@kakkoyun how is this PR going? |
|
@GiedriusS I had to park this one for a while. But I haven't abandoned it, I'll have another look at it soon. I have also discovered similar issues with Store GW histograms, I may include those improvements in this PR as well. |
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
536109e to
a3568a5
Compare
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
a3568a5 to
1855ace
Compare
| grpc_prometheus.WithHistogramBuckets([]float64{ | ||
| 0.001, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.4, | ||
| }), | ||
| grpc_prometheus.WithHistogramBuckets(prometheus.ExponentialBuckets(0.001, 2, 15)), |
There was a problem hiding this comment.
Before:
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.001"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.01"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.05"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.1"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.2"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.4"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.8"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="1.6"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="3.2"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="6.4"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="+Inf"} 0
After:
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.001"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.002"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.004"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.008"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.016"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.032"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.064"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.128"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.256"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="0.512"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="1.024"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="2.048"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="4.096"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="8.192"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="16.384"} 0
grpc_server_handling_seconds_bucket{grpc_method="Series",grpc_service="thanos.Store",grpc_type="server_stream",le="+Inf"} 0
There was a problem hiding this comment.
An example distirbution for existing buckets, from a real life system.
sum(grpc_server_handling_seconds_bucket{job=~"thanos-store.*", grpc_type="server_stream"}) by (le)
{le="6.4"} | 158
{le="0.05"} | 2
{le="0.1"} | 5
{le="0.2"} | 13
{le="0.4"} | 34
{le="0.8"} | 62
{le="+Inf"} | 187
{le="0.001"} | 0
{le="0.01"} | 0
{le="1.6"} | 103
{le="3.2"} | 133
pkg/store/gate.go
Outdated
| }, | ||
| Name: "gate_duration_seconds", | ||
| Help: "How many seconds it took for queries to wait at the gate.", | ||
| Buckets: prometheus.ExponentialBuckets(0.001, 2, 15), |
There was a problem hiding this comment.
An example distirbution for existing buckets, from a real life system.
sum(thanos_bucket_store_series_gate_duration_seconds_bucket{job="thanos-store"}) by (le)
{le="10"} | 0
{le="5"} | 0
{le="+Inf"} | 187
{le="0.6"} | 0
{le="1"} | 0
{le="0.25"} | 0
{le="2"} | 0
{le="3.5"} | 0
{le="0.01"} | 0
{le="0.05"} | 0
{le="0.1"} | 0
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
bb49247 to
3aac86e
Compare
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
|
I’m expecting that we will need even higher buckets, but this is better than what we have and will clarify the need for more, so lgtm. |
|
I think higher buckets has to depend on query timeout, so probably we need higher ones, but do we need so many lower level buckets? Do we really care if we have a request going 0.001 (seconds!) or 0.128 seconds? :thinking_face: |
|
I'm happy to re-address all the issues after we know more about distribution. What we have does not provide much, I can do another iteration to tune them. |
* Use exponential buckets for compactor histogram metrics Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust histogram buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust store gate bucket Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust http duration buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: suntianyuan <suntianyuan@baidu.com>
* Use exponential buckets for compactor histogram metrics Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust histogram buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust store gate bucket Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust http duration buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Aleksey Sin <asin@ozon.ru>
* Use exponential buckets for compactor histogram metrics Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust histogram buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust store gate bucket Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Adjust http duration buckets Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Aleksey Sin <asin@ozon.ru>
This PR changes existing bucket configurations to fix issues that observed with latency graphs.
For example, as you can observe there are large differences between mean and P50 latencies.
thanos_compact_garbage_collection_duration_seconds_bucketthanos_compact_sync_meta_duration_seconds_bucketThis increases the number of buckets for most of the histograms. For certain metrics, it significantly affects cardinality. However, it's needed to properly instrument the components.
Changes
Uses exponential buckets to provide more even distribution. (number of buckets, before and after)
grpc_server_handling_seconds_bucket: 10 -> 15 (+exposes multiple labels)http_request_duration_seconds_bucket: 11 -> 17 (+exposes 3 labels, code, method, handler)thanos_compact_sync_meta_duration_seconds_bucket: 14 -> 15thanos_compact_garbage_collection_duration_seconds_bucket: 14 -> 15thanos_objstore_bucket_operation_duration_seconds_bucket: 15 -> 17thanos_bucket_store_series_get_all_duration_seconds_bucket: 14 -> 15thanos_bucket_store_series_gate_duration_seconds_bucket: 14 -> 15thanos_bucket_store_series_merge_duration_seconds_bucket: 10 -> 15Verification
make testMINIO_ENABLED=1 ./scripts/quickstart.shandcurlto/metrics.