-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(retention): Add retention to Local DB #3715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(retention): Add retention to Local DB #3715
Conversation
| @@ -0,0 +1,79 @@ | |||
| # Configuring Database Retention | |||
|
|
|||
| ## Goal | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great doc!
| private Boolean _emitAspectSpecificAuditEvent = false; | ||
| @Getter | ||
| @Setter | ||
| private RetentionService retentionService; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never be null, right?
|
|
||
| @Nonnull | ||
| private UpdateAspectResult ingestAspectToLocalDB(@Nonnull final Urn urn, @Nonnull final String aspectName, | ||
| protected UpdateAspectResult ingestAspectToLocalDB(@Nonnull final Urn urn, @Nonnull final String aspectName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad we are factoring this out.
|
|
||
| // Reset retention policies | ||
| _retentionService.setRetention(null, null, new DataHubRetentionInfo().setRetentionPolicies(new RetentionArray( | ||
| ImmutableList.of(new Retention().setVersion(new VersionBasedRetention().setMaxVersions(1)))))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if max version is 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will just not do anything. It doesn't revive deleted aspects!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if you ingest more after setting it to 10, it will ingest without deleting until it reaches 10 versions
| @Qualifier("retentionService") | ||
| private RetentionService _retentionService; | ||
|
|
||
| @Value("${entityService.retention.disable}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other configs use "enabled" as the flags. can we also do that here to keep things consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah makes sense. on it
| // 0. Execute preflight check to see whether we need to ingest policies | ||
| log.info("Ingesting default retention..."); | ||
|
|
||
| // Whether we are at clean boot or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this check if we are at clean boot or if retention has been disabled by a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update comment
|
|
||
| // 4. Set the specified retention policies | ||
| log.info("Setting {} policies", retentionPolicyMap.size()); | ||
| boolean hasUpdate = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just shortcircuit and call _retentionservice.batcha... inside the if directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah because, I want to batch apply after setting all retention policies. Each set retention sets a single policy and returns whether that set caused an actual update. If there is at least one update, we apply retention to all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see
| return parseYamlRetentionConfig(retentionFileOrDir); | ||
| } | ||
|
|
||
| private Map<DataHubRetentionKey, DataHubRetentionInfo> parseYamlRetentionConfig(File retentionConfigFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great ! is there a comment about what the expected structure of a given yaml file should be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add
| entityRegistry: | ||
| path: ${ENTITY_REGISTRY_PLUGIN_PATH:/etc/datahub/plugins/models} | ||
| retention: | ||
| path: ${RETENTION_PLUGIN_PATH:/etc/datahub/plugins/retention} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awsomeeee
|
|
||
| entityService: | ||
| retention: | ||
| disable: ${ENTITY_SERVICE_DISABLE_RETENTION:true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above -- enabled / enable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| pluginEntityRegistry: | ||
| path: ${ENTITY_REGISTRY_PLUGIN_PATH:$HOME/.datahub/plugins/models} | ||
| path: ${ENTITY_REGISTRY_PLUGIN_PATH:/etc/datahub/plugins/models} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops. removing
| @@ -0,0 +1,14 @@ | |||
| - entity: "*" | |||
| aspect: "*" | |||
| retention: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: do we need the extra "retention" field?
could it be
- entity
aspect
retentionPollicies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. that's ideal, but I wanted to simply convert the section into DataHubRetentionInfo object without adding more logic there, which requires this field
| for (RecordDataSchema.Field field : keyAspect.schema().getFields()) { | ||
| Object value = keyAspect.data().get(field.getName()); | ||
| String valueString = value.toString(); | ||
| String valueString = value == null ? "" : value.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!!!
| .or(); | ||
| boolean retentionApplied = false; | ||
| for (Retention retention : retentionPolicies) { | ||
| if (retention.hasVersion() && applyVersionBasedRetention(urn, aspectName, deleteQuery, retention.getVersion(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code initially confused me because I didn't realize that deleteQuery was being mutated inside of applyVersionBasedRetention as "querySoFar"... what if that method simply returned a query and we inserted it into the expression list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was what I wanted to do, but didn't think there was a way. Looking again, there might be a way. let me see
| return _server.find(EbeanAspectV2.class) | ||
| .select(String.format("%s, %s, %s", EbeanAspectV2.URN_COLUMN, EbeanAspectV2.ASPECT_COLUMN, | ||
| EbeanAspectV2.METADATA_COLUMN)) | ||
| .where() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully this query will not be super expensive. i was planning to migrate policies away from using a query like this to using elastic directly.. i guess if its once a boot up maybe not too bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better than for policies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. From tracing, I saw that these queries are sub-10ms queries, since we start with urn, aspect.
| private void validateRetentionInfo(DataHubRetentionInfo retentionInfo) { | ||
| Set<String> retentionsSoFar = new HashSet<>(); | ||
| for (Retention retention : retentionInfo.getRetentionPolicies()) { | ||
| if (retention.data().size() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup so we support it as list of retention. Inside a single Retention object, we should have just one set, kind of emulating a union behavior. We could instead have a single Retention object instead of a list of retention objects as policy and set as many policies as needed. WDYT?
| * Base class that encapsulates different retention policies. | ||
| * Only one of the fields should be set | ||
| */ | ||
| record Retention { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other option is to have a type and arguments:
type: enum Retentiontype: {
indefinite,
version,
time
}
// set if type is version
maxVersion: optional long
// set if type is time
maxAgeInseconds
this structure just makes reading the object in code a bit more complex.. is there a reason you went with the approach of just setting a field and letting the reader check the presence of each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmn since there is no shared fields, I think this approach is much easier to read. Otherwise, each type needs to be mapped to the field that needs to be filled out in code instead of in pdl.
| private void validateRetentionInfo(DataHubRetentionInfo retentionInfo) { | ||
| Set<String> retentionsSoFar = new HashSet<>(); | ||
| for (Retention retention : retentionInfo.getRetentionPolicies()) { | ||
| if (retention.data().size() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this use of the raw data map scares me.. all the key validation that's required
| @Nonnull | ||
| @Override | ||
| public ExecutionMode getExecutionMode() { | ||
| return ExecutionMode.ASYNC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any concern with allowing datahub to finish booting up before running retention? like any race conditions with ingestion or other bootstrap steps we need to be concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect any since version 0 is never deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay great
| log.info("Finished applying retention to all records"); | ||
| } | ||
|
|
||
| private Map<String, DataHubRetentionInfo> getRetentionPolicies() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call this getAllRetentionPolicies
| @Qualifier("retentionService") | ||
| private RetentionService _retentionService; | ||
|
|
||
| @Value("${entityService.retention.enable}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to nit on this again - I think the others (analytics, auth) use "enabled"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(policies)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| entityService: | ||
| retention: | ||
| enable: ${ENTITY_SERVICE_ENABLE_RETENTION:false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled to be consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| @@ -0,0 +1,14 @@ | |||
| - entity: "*" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
jjoyce0510
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great - just one final nit on the enabled config.
shirshanka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add retention to Local DB. To do this:
a) Each call to setRetention triggers a batchApplyRetention for the affected rows
a) bootstrap step: on start-up, if there are no retention ingested, it sets retention for default based on env variable, which inturn calls batchApplyRetention for all rows
Checklist