You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modeling/metadata-model.md
+206-1Lines changed: 206 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -331,4 +331,209 @@ to see an example of a timeseries aspect.
331
331
Because timeseries aspects are updated on a frequent basis, ingests of these aspects go straight to elastic search (
332
332
instead of being stored in local DB).
333
333
334
-
You can retrieve timeseries aspects using the "aspects?action=getTimeseriesAspectValues" end point.
334
+
You can retrieve timeseries aspects using the "aspects?action=getTimeseriesAspectValues" end point.
335
+
336
+
#### Aggregatable Timeseries aspects
337
+
Being able to perform SQL like *group by + aggregate* operations on the timeseries aspects is a very natural use-case for
338
+
this kind of data (dataset profiles, usage statistics etc.). This section describes how to define, ingest and perform an
339
+
aggregation query against a timeseries aspect.
340
+
341
+
##### Defining a new aggregatable Timeseries aspect.
342
+
343
+
The *@TimeseriesField* and the *@TimeseriesFieldCollection* are two new annotations that can be attached to a field of
344
+
a *Timeseries aspect* that allows it to be part of an aggregatable query. The kinds of aggregations allowed on these
345
+
annotated fields depends on the type of the field, as well as the kind of aggregation, as
346
+
described [here](#Performing-an-aggregation-on-a-Timeseries-aspect).
347
+
348
+
*`@TimeseriesField = {}` - this annotation can be used with any type of non-collection type field of the aspect such as
349
+
primitive types and records (see the fields *stat*, *strStat* and *strArray* fields
350
+
of [TestEntityProfile.pdl](https://github.com/linkedin/datahub/blob/master/test-models/src/main/pegasus/com/datahub/test/TestEntityProfile.pdl)).
351
+
352
+
* The `@TimeseriesFieldCollection {"key":"<name of the key field of collection item type>"}` annotation allows for
353
+
aggregation support on the items of a collection type (supported only for the array type collections for now), where the
354
+
value of `"key"` is the name of the field in the collection item type that will be used to specify the group-by clause (
355
+
see *userCounts* and *fieldCounts* fields of [DatasetUsageStatistics.pdl](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/dataset/DatasetUsageStatistics.pdl)).
356
+
357
+
In addition to defining the new aspect with appropriate Timeseries annotations,
358
+
the [entity-registry.yml](https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml)
359
+
file needs to be updated as well. Just add the new aspect name under the list of aspects against the appropriate entity as shown below, such as `datasetUsageStatistics` for the aspect DatasetUsageStatistics.
360
+
```yaml
361
+
entities:
362
+
- name: dataset
363
+
keyAspect: datasetKey
364
+
aspects:
365
+
- datasetProfile
366
+
- datasetUsageStatistics
367
+
```
368
+
369
+
##### Ingesting a Timeseries aspect
370
+
The timeseries aspects can be ingested via the GSM REST endpoint `/aspects?action=ingestProposal` or via the python API.
371
+
372
+
Example1: Via GSM REST API using curl.
373
+
374
+
```shell
375
+
curl --location --request POST 'http://localhost:8080/aspects?action=ingestProposal' \
##### Performing an aggregation on a Timeseries aspect.
427
+
428
+
Aggreations on timeseries aspects can be performed by the GSM REST API for `/analytics?action=getTimeseriesStats` which
429
+
accepts the following params.
430
+
* `entityName` - The name of the entity the aspect is associated with.
431
+
* `aspectName` - The name of the aspect.
432
+
* `filter` - Any pre-filtering criteria before grouping and aggregations are performed.
433
+
* `metrics` - A list of aggregation specification. The `fieldPath` member of an aggregation specification refers to the
434
+
field name against which the aggregation needs to be performed, and the `aggregationType` specifies the kind of aggregation.
435
+
* `buckets` - A list of grouping bucket specifications. Each grouping bucket has a `key` field that refers to the field
436
+
to use for grouping. The `type` field specifies the kind of grouping bucket.
437
+
438
+
We support three kinds of aggregations that can be specified in an aggregation query on the Timeseries annotated fields.
439
+
The values that `aggregationType` can take are
440
+
441
+
* `LATEST`: The latest value of the field in each bucket. Supported for any type of field.
442
+
* `SUM`: The cumulative sum of the field in each bucket. Supported only for integral types.
443
+
* `CARDINALITY`: The number of unique values or the cardinality of the set in each bucket. Supported for string and
444
+
record types.
445
+
446
+
We support two types of grouping for defining the buckets to perform aggregations against.
447
+
448
+
* `DATE_GROUPING_BUCKET`: Allows for creating time-based buckets such as by second, minute, hour, day, week, month,
449
+
quarter, year etc. Should be used in conjunction with a timestamp field whose value is in milliseconds since *epoch*.
450
+
The `timeWindowSize` param specifies the date histogram bucket width.
451
+
* `STRING_GROUPING_BUCKET`: Allows for creating buckets grouped by the unique values of a field. Should be used in
452
+
conjunction with a string type field always.
453
+
454
+
The API returns a generic SQL like table as the `table` member of the output that contains the results of
455
+
the `group-by/aggregate` query, in addition to echoing the input params.
456
+
457
+
* `columnNames`: the names of the table columns. The group-by `key` names appear in the same order as they are specified
458
+
in the request. Aggregation specifications follow the grouping fields in the same order as specified in the request,
459
+
and will be named `<agg_name>_<fieldPath>`.
460
+
* `columnTypes`: the data types of the columns.
461
+
* `rows`: the data values, each row corresponding to the respective bucket(s).
462
+
463
+
Example1: Latest unique user count for each day.
464
+
```shell
465
+
# QUERY
466
+
curl --location --request POST 'http://localhost:8080/analytics?action=getTimeseriesStats' \
467
+
--header 'X-RestLi-Protocol-Version: 2.0.0' \
468
+
--header 'Content-Type: application/json' \
469
+
--data-raw '{
470
+
"entityName": "dataset",
471
+
"aspectName": "datasetUsageStatistics",
472
+
"filter": {
473
+
"criteria": []
474
+
},
475
+
"metrics": [
476
+
{
477
+
"fieldPath": "uniqueUserCount",
478
+
"aggregationType": "LATEST"
479
+
}
480
+
],
481
+
"buckets": [
482
+
{
483
+
"key": "timestampMillis",
484
+
"type": "DATE_GROUPING_BUCKET",
485
+
"timeWindowSize": {
486
+
"multiple": 1,
487
+
"unit": "DAY"
488
+
}
489
+
}
490
+
]
491
+
}'
492
+
493
+
# SAMPLE RESPOSNE
494
+
{
495
+
"value": {
496
+
"filter": {
497
+
"criteria": []
498
+
},
499
+
"aspectName": "datasetUsageStatistics",
500
+
"entityName": "dataset",
501
+
"groupingBuckets": [
502
+
{
503
+
"type": "DATE_GROUPING_BUCKET",
504
+
"timeWindowSize": {
505
+
"multiple": 1,
506
+
"unit": "DAY"
507
+
},
508
+
"key": "timestampMillis"
509
+
}
510
+
],
511
+
"aggregationSpecs": [
512
+
{
513
+
"fieldPath": "uniqueUserCount",
514
+
"aggregationType": "LATEST"
515
+
}
516
+
],
517
+
"table": {
518
+
"columnNames": [
519
+
"timestampMillis",
520
+
"latest_uniqueUserCount"
521
+
],
522
+
"rows": [
523
+
[
524
+
"1631491200000",
525
+
"1"
526
+
]
527
+
],
528
+
"columnTypes": [
529
+
"long",
530
+
"int"
531
+
]
532
+
}
533
+
}
534
+
}
535
+
```
536
+
For more examples on the complex types of group-by/aggregations, refer to the tests in the group `getAggregatedStats` of [ElasticSearchTimeseriesAspectServiceTest.java](https://github.com/linkedin/datahub/blob/master/metadata-io/src/test/java/com/linkedin/metadata/timeseries/elastic/ElasticSearchTimeseriesAspectServiceTest.java).
0 commit comments