You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-instrumentation/20200415-cardinality-enforcement.md
+42-61Lines changed: 42 additions & 61 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,12 +6,12 @@ owning-sig: sig-instrumentation
6
6
participating-sigs:
7
7
- sig-instrumentation
8
8
reviewers:
9
-
- todo
9
+
- sig-instrumentation
10
10
approvers:
11
-
- todo
11
+
- sig-instrumentation
12
12
editor: todo
13
13
creation-date: 2020-04-15
14
-
last-updated: 2020-04-15
14
+
last-updated: 2020-05-19
15
15
status: provisional
16
16
---
17
17
@@ -33,33 +33,21 @@ status: provisional
33
33
34
34
## Summary
35
35
36
-
TLDR; metrics with unbounded dimensions can cause memory issues in the components they instrument.
37
-
38
-
The simple solution to this problem is to say "don't do that". SIG instrumentation has already explicitly stated this in our instrumentation guidelines: which says that ["one should know a comprehensive list of all possible values for a label at instrumentation time."](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#dimensionality--cardinality).
39
-
40
-
The problem is more complicated. First, SIG Instrumentation doesn't have a way to validate adherence to SIG instrumentation guidelines outside of reviewing PRs with instrumentation changes. Not only is this highly manual, and error-prone, we do not have a terrific existing procedure for ensuring SIG Instrumentation is tagged on relevant PRs. Even if we do have such a mechanism, it isn't a fully sufficient solution because:
41
-
42
-
1. metrics changes can be seemingly innocous, even to the most diligent of code reviewers (i.e these issues are hard to catch)
43
-
2. metrics cardinality issues exist latently; in other words, they're all over the place in Kubernetes already (so even if we could prevent 100% of these occurrence from **now on**, that wouldn't guarantee Kubernetes is free of these classes of issues).
44
-
36
+
Background for this KEP is that metrics with unbounded dimensions can cause memory issues in the components they instrument. Our proposed solution for this is we want to be able to *dynamically configure* an allowlist of label values for a metric *at runtime*.
45
37
46
38
## Motivation
47
39
48
-
TLDR; Having metrics turn into memory leaks sucks and it sucks even more when we can't fix these issues without re-releasing the entire Kubernetes binary.
49
-
50
-
**Q:***How have we approached these issues historically?*
51
-
52
-
**A:**__Unfortunately, not consistently.__
40
+
Having metrics turn into memory leaks is a problem, but what is even a bigger problem is when we can't fix these issues without re-releasing the entire Kubernetes binary.
53
41
54
-
Sometimes, a [metric label dimension is intended to be bound to some known sets of values but coding mistakes cause IDs to be thrown in as a label value](https://github.com/kubernetes/kubernetes/issues/53485).
42
+
Historically we have approached these issues in various ways and were not consistent. A few approaches:
55
43
56
-
In anther case, [we opt to delete the entire metric](https://github.com/kubernetes/kubernetes/pull/74636) because it basically can't be used in a meaningfully correct way.
44
+
- Sometimes, a [metric label dimension is intended to be bound to some known sets of values but coding mistakes cause IDs to be thrown in as a label value](https://github.com/kubernetes/kubernetes/issues/53485).
57
45
58
-
Recently we've opted to both (1) [wholesale delete a metric label](https://github.com/kubernetes/kubernetes/pull/87669) and (2) [retroactively introduce a set of acceptable values for a metric label](https://github.com/kubernetes/kubernetes/pull/87913).
46
+
- In another case, [we opt to delete the entire metric](https://github.com/kubernetes/kubernetes/pull/74636) because it basically can't be used in a meaningfully correct way.
59
47
60
-
Fixing these issues is a currently a manual process, both laborious and time-consuming.
48
+
- Recently we've opted to both (1) [wholesale delete a metric label](https://github.com/kubernetes/kubernetes/pull/87669) and (2) [retroactively introduce a set of acceptable values for a metric label](https://github.com/kubernetes/kubernetes/pull/87913).
61
49
62
-
We don't have a standard prescription for resolving this class of issue. This is especially bad when you consider that this class of issue is so totally predictable (in general).
50
+
Fixing these issues is currently a manual process, both laborious and time-consuming. We don't have a standard prescription for resolving this class of issue.
63
51
64
52
### Goals
65
53
@@ -71,19 +59,22 @@ We will expose the machinery and tools to bind a metric's labels to a discrete s
71
59
72
60
It is *not a goal* to implement and plumb this solution for each Kubernetes component (there are many SIGs and a number of verticals, which may have their own preferred way of doing things). As such it will be up to component owners to leverage this functionality that we provide, by feeding configuration data through whatever mechanism deemed appropriate (i.e. command line flags or reading from a file).
73
61
62
+
These flags are really only meant to be used as escape hatches, and should not be used to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore.
63
+
74
64
## Proposal
75
65
76
-
TLDR; we want to be able to *dynamically configure* a whitelist of label values for a metric.
66
+
The simple solution to this problem would be for each metric added to keep the unbounded dimensions in mind and prevent it from happening. SIG instrumentation has already explicitly stated this in our instrumentation guidelines: which says that ["one should know a comprehensive list of all possible values for a label at instrumentation time."](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/instrumentation.md#dimensionality--cardinality). The problem is more complicated. First, SIG Instrumentation doesn't have a way to validate adherence to SIG instrumentation guidelines outside of reviewing PRs with instrumentation changes. Not only is this highly manual, and error-prone, we do not have a terrific existing procedure for ensuring SIG Instrumentation is tagged on relevant PRs. Even if we do have such a mechanism, it isn't a fully sufficient solution because:
77
67
78
-
By *dynamically configure*, we mean configure a whitelist *at runtime* rather than during build/compile step (this is so ugh).
68
+
1. metrics changes can be seemingly innocuous, even to the most diligent of code reviewers (i.e these issues are hard to catch)
69
+
2. metrics cardinality issues exist latently; in other words, they're all over the place in Kubernetes already (so even if we could prevent 100% of these occurrence from **now on**, that wouldn't guarantee Kubernetes is free of these classes of issues).
79
70
80
-
And by *at runtime*, we mean, more specifically, during the boot sequence for a Kubernetes component (and we mean daemons here, not cli tools like kubectl).
71
+
Instead, the proposed solution is we will be able to *dynamically configure* an allowlist of label values for a metric. By *dynamically configure*, we mean configure an allowlist *at runtime* rather than during the build/compile step. And by *at runtime*, we mean, more specifically, during the boot sequence for a Kubernetes component (and we mean daemons here, not CLI tools like kubectl).
81
72
82
-
Brief aside: a Kubernetes component (which is a daemon) is an executable, which you can launch from the command line manually if you so desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, you can check out the [zillion flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). It is also possible to read configuration data from files (like yamls) during the component boot sequence. This KEP does not have an opinion on the specific mechanism used to load config data into a Kubernetes binary during the boot sequence. What we *actually* care about, is just the fact that it is possible.
73
+
Brief aside: a Kubernetes component (which is a daemon) is an executable, which can be launched from the command line manually if desired. Components take a number of start-up configuration flags, which are passed into the component to modify execution paths (if curious, check out the [large amount of flags we have on the kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)). It is also possible to read configuration data from files (like yaml format) during the component boot sequence. This KEP does not have an opinion on the specific mechanism used to load config data into a Kubernetes binary during the boot sequence. What we *actually* care about, is just the fact that it is possible.
83
74
84
-
Our design, is thus config-ingestion agnostic.
75
+
Our design is thus config-ingestion agnostic.
85
76
86
-
Our design is also based on the premise that metrics can be uniquely identified (i.e. by their metric descriptor). Fortunately for us, this is actually a built in constraint of prometheus clients (which is how we instrument the Kubernetes components). This metric ID is resolvable to a unique string (this is pretty evident when you look at a raw prometheus client endpoint).
77
+
Our design is also based on the premise that metrics can be uniquely identified (i.e. by their metric descriptor). Fortunately for us, this is actually a built in constraint of prometheus clients (which is how we instrument the Kubernetes components). This metric ID is resolvable to a unique string (this is pretty evident when looking at a raw prometheus client endpoint).
87
78
88
79
We want to provide a data structure which basically maps a unique metric ID and one of it's labels to a bounded set of values.
89
80
@@ -119,7 +110,7 @@ And we want to express a set of expected values:
119
110
{
120
111
"metric-id": "some_metric",
121
112
"label": "label_too_many_values",
122
-
"labelValueWhitelist": [
113
+
"labelValueAllowlist": [
123
114
"1",
124
115
"2",
125
116
"3"
@@ -129,13 +120,13 @@ And we want to express a set of expected values:
129
120
130
121
```
131
122
132
-
Since we already have an interception layer built into our Kubernetes monitoring stack (from the metrics stability effort), we can leverage existing wrappers to provide a global entrypoint for intercepting and enforcing rules for individual metrics.
123
+
Since we already have an interception layer built into our Kubernetes monitoring stack (from the metrics stability effort), we can leverage existing wrappers to provide a global entrypoint for intercepting and enforcing rules for individual metrics.
133
124
134
-
**Q:***But won't this invalidate our metric data?*
125
+
As a result metric data will not be invalid, just data will be more coarse when the metric type is a counter. In the case of a gauge type metric the data will be invalid but the number of series would be bound.
135
126
136
-
**A:**__It shouldn't invalidate metric data. But metric data may be more coarse.__
127
+
Following is an example used to demonstrate the proposed solution.
137
128
138
-
It's easier to demonstrate our strategy than explain stuff out in written word, so we're going to just do that. Let's say we have some client code somewhere which instantiates our metric from above:
129
+
Let's say we have some client code somewhere which instantiates our metric from above:
This would not change. What would change is if we encounter label values outside our explicit whitelist of values. So if where to encounter this malicious piece of code:
163
+
This would not change. What would change is if we encounter label values outside our explicit allowlist of values. So if where to encounter this malicious piece of code:
173
164
174
165
175
166
```golang
@@ -180,7 +171,7 @@ This would not change. What would change is if we encounter label values outside
180
171
}
181
172
```
182
173
183
-
Then in existing Kubernetes components, we would have a terrible memory leak, since we would effectively create a million metrics (one per unique label value). If we curled our prometheus endpoint it would thus look something like this:
174
+
Then in existing Kubernetes components, we would have a terrible memory leak, since we would effectively create a million metrics (one per unique label value). If we curled our metrics endpoint it would thus look something like this:
We can effect this change by registering a whitelist to our prom registries during the boot-up sequence. Since we intercept metric registration events and have a wrapper around each of the primitive prometheus metric types, we can effectively add some code that looks like this (disclaimer: the below is pseudocode for demonstration):
202
+
We can effect this change by registering a allowlist to our prometheus registries during the boot-up sequence. Since we intercept metric registration events and have a wrapper around each of the primitive prometheus metric types, we can effectively add some code that looks like this (disclaimer: the below is pseudocode for demonstration):
This design allows us to optionally adopt @lilic's excellent idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify a whitelist. Personally, I like that idea since it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since you could do something like this:
236
+
This design allows us to optionally adopt the idea about simplifying the interface for component owners, who can then opt to just specify a metric and label pair *without* having to specify an allowlist. The good part of this idea is it simplifies how a component owner can implement our cardinality enforcing helpers without having to necessary plumb through complicated maps. This would make it considerably easier to feed this data in through the command line since it can be done like this:
..which would then be interpreted by our machinery as this:
242
+
This would then be interpreted by our machinery as this:
248
243
249
244
250
245
```json
251
246
[
252
247
{
253
248
"metric-id": "some_metric",
254
249
"label": "label_too_many_values",
255
-
"labelValueWhitelist": []
250
+
"labelValueAllowlist": []
256
251
}
257
252
]
258
253
259
254
```
260
255
261
256
## Open-Question
262
-
_(Discussion Points which need to be resolved prior to merge)_
263
-
264
-
-@dashpole
265
-
266
-
> Should have labels with a specific set of values, should we start enforcing that all metrics have a whitelist at compile-time?
267
-
268
-
-@x13n
269
-
270
-
> ... instead of getting label_too_many_values right away, [enforcing the cardinality limit directly] would still work until certain label cardinality limit is reached. Whitelisting would guarantee a value will not be dropped, but other values wouldn't be dropped either unless there is too many of them. Cluster admin can configure the per metric and per label limits once and can get alerted on "some metric labels are dropped" instead of "your metrics storage is getting blown up".
271
-
272
-
-@brancz/@lilic
273
-
274
-
> Potentially we would want to treat buckets completely separately (as in a separate flag just for bucket configuration of histograms). @bboreham opened the original PR for apiserver request duration bucket reduction, maybe he has some input as well.
275
-
>
276
-
> My biggest concern I think with all of this is, it's going to be super easy to have extremely customized kubernetes setups where our existing dashboards and alerting rule definitions are just not going to apply generally anymore. I'd like to make sure we emphasize that these flags are really only meant to be used as escape hatches, and we must always strive to truly fix the root of the issue.
0 commit comments