Skip to content

Conversation

@dexter-mh-lee
Copy link
Contributor

Add retention to Local DB. To do this:

  1. Added RetentionService that is coupled with EntityService
  • Manages retention policies (get, set, delete policies for entity, aspect combo - both optional to set defaults)
  • Apply retention policy to a given urn, aspect
  • Apply retention policies in batch
  1. Created a DataHubRetention entity to store the retention policies. This lets us use all the useful capabilities that an entity has within entity service
  2. Setup triggering logic for applying retention.
    a) Each call to setRetention triggers a batchApplyRetention for the affected rows
    a) bootstrap step: on start-up, if there are no retention ingested, it sets retention for default based on env variable, which inturn calls batchApplyRetention for all rows

Checklist

  • [v] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • [v] Tests for the changes have been added/updated (if applicable)
  • [v] Docs related to the changes have been added/updated (if applicable)

@github-actions
Copy link
Contributor

github-actions bot commented Dec 10, 2021

Unit Test Results

     45 files  +  1       45 suites  +1   43m 40s ⏱️ + 5m 46s
   635 tests +12     577 ✔️ +  6  58 💤 +6  0 ±0 
1 419 runs  +32  1 345 ✔️ +26  74 💤 +6  0 ±0 

Results for commit fb83735. ± Comparison against base commit 36cf835.

♻️ This comment has been updated with latest results.

@@ -0,0 +1,79 @@
# Configuring Database Retention

## Goal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great doc!

private Boolean _emitAspectSpecificAuditEvent = false;
@Getter
@Setter
private RetentionService retentionService;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never be null, right?


@Nonnull
private UpdateAspectResult ingestAspectToLocalDB(@Nonnull final Urn urn, @Nonnull final String aspectName,
protected UpdateAspectResult ingestAspectToLocalDB(@Nonnull final Urn urn, @Nonnull final String aspectName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we are factoring this out.


// Reset retention policies
_retentionService.setRetention(null, null, new DataHubRetentionInfo().setRetentionPolicies(new RetentionArray(
ImmutableList.of(new Retention().setVersion(new VersionBasedRetention().setMaxVersions(1))))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if max version is 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will just not do anything. It doesn't revive deleted aspects!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you ingest more after setting it to 10, it will ingest without deleting until it reaches 10 versions

@Qualifier("retentionService")
private RetentionService _retentionService;

@Value("${entityService.retention.disable}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other configs use "enabled" as the flags. can we also do that here to keep things consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense. on it

// 0. Execute preflight check to see whether we need to ingest policies
log.info("Ingesting default retention...");

// Whether we are at clean boot or not.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this check if we are at clean boot or if retention has been disabled by a flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update comment


// 4. Set the specified retention policies
log.info("Setting {} policies", retentionPolicyMap.size());
boolean hasUpdate = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just shortcircuit and call _retentionservice.batcha... inside the if directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah because, I want to batch apply after setting all retention policies. Each set retention sets a single policy and returns whether that set caused an actual update. If there is at least one update, we apply retention to all

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see

return parseYamlRetentionConfig(retentionFileOrDir);
}

private Map<DataHubRetentionKey, DataHubRetentionInfo> parseYamlRetentionConfig(File retentionConfigFile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great ! is there a comment about what the expected structure of a given yaml file should be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add

entityRegistry:
path: ${ENTITY_REGISTRY_PLUGIN_PATH:/etc/datahub/plugins/models}
retention:
path: ${RETENTION_PLUGIN_PATH:/etc/datahub/plugins/retention}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awsomeeee


entityService:
retention:
disable: ${ENTITY_SERVICE_DISABLE_RETENTION:true}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above -- enabled / enable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


pluginEntityRegistry:
path: ${ENTITY_REGISTRY_PLUGIN_PATH:$HOME/.datahub/plugins/models}
path: ${ENTITY_REGISTRY_PLUGIN_PATH:/etc/datahub/plugins/models}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. removing

@@ -0,0 +1,14 @@
- entity: "*"
aspect: "*"
retention:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: do we need the extra "retention" field?

could it be

  • entity
    aspect
    retentionPollicies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. that's ideal, but I wanted to simply convert the section into DataHubRetentionInfo object without adding more logic there, which requires this field

for (RecordDataSchema.Field field : keyAspect.schema().getFields()) {
Object value = keyAspect.data().get(field.getName());
String valueString = value.toString();
String valueString = value == null ? "" : value.toString();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!!!

.or();
boolean retentionApplied = false;
for (Retention retention : retentionPolicies) {
if (retention.hasVersion() && applyVersionBasedRetention(urn, aspectName, deleteQuery, retention.getVersion(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code initially confused me because I didn't realize that deleteQuery was being mutated inside of applyVersionBasedRetention as "querySoFar"... what if that method simply returned a query and we inserted it into the expression list here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was what I wanted to do, but didn't think there was a way. Looking again, there might be a way. let me see

return _server.find(EbeanAspectV2.class)
.select(String.format("%s, %s, %s", EbeanAspectV2.URN_COLUMN, EbeanAspectV2.ASPECT_COLUMN,
EbeanAspectV2.METADATA_COLUMN))
.where()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this query will not be super expensive. i was planning to migrate policies away from using a query like this to using elastic directly.. i guess if its once a boot up maybe not too bad

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better than for policies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. From tracing, I saw that these queries are sub-10ms queries, since we start with urn, aspect.

private void validateRetentionInfo(DataHubRetentionInfo retentionInfo) {
Set<String> retentionsSoFar = new HashSet<>();
for (Retention retention : retentionInfo.getRetentionPolicies()) {
if (retention.data().size() != 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup so we support it as list of retention. Inside a single Retention object, we should have just one set, kind of emulating a union behavior. We could instead have a single Retention object instead of a list of retention objects as policy and set as many policies as needed. WDYT?

* Base class that encapsulates different retention policies.
* Only one of the fields should be set
*/
record Retention {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other option is to have a type and arguments:

type: enum Retentiontype: {
indefinite,
version,
time
}
// set if type is version
maxVersion: optional long

// set if type is time
maxAgeInseconds

this structure just makes reading the object in code a bit more complex.. is there a reason you went with the approach of just setting a field and letting the reader check the presence of each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmn since there is no shared fields, I think this approach is much easier to read. Otherwise, each type needs to be mapped to the field that needs to be filled out in code instead of in pdl.

private void validateRetentionInfo(DataHubRetentionInfo retentionInfo) {
Set<String> retentionsSoFar = new HashSet<>();
for (Retention retention : retentionInfo.getRetentionPolicies()) {
if (retention.data().size() != 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this use of the raw data map scares me.. all the key validation that's required

@Nonnull
@Override
public ExecutionMode getExecutionMode() {
return ExecutionMode.ASYNC;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any concern with allowing datahub to finish booting up before running retention? like any race conditions with ingestion or other bootstrap steps we need to be concerned about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect any since version 0 is never deleted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay great

log.info("Finished applying retention to all records");
}

private Map<String, DataHubRetentionInfo> getRetentionPolicies() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's call this getAllRetentionPolicies

@Qualifier("retentionService")
private RetentionService _retentionService;

@Value("${entityService.retention.enable}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nit on this again - I think the others (analytics, auth) use "enabled"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(policies)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


entityService:
retention:
enable: ${ENTITY_SERVICE_ENABLE_RETENTION:false}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled to be consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,14 @@
- entity: "*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great - just one final nit on the enabled config.

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shirshanka shirshanka merged commit 8e4769f into datahub-project:master Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants