-
Notifications
You must be signed in to change notification settings - Fork 3
Add option to automatically update collection extents and summaries based on ingested items #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fmigneault
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look fine.
Comments mostly for considering extended use cases that could be addressed or not, or added at a later time depending on criticality.
STACpopulator/cli.py
Outdated
| "--update-collection", | ||
| choices=get_args(inspect.signature(STACpopulatorBase.__init__).parameters["update_collection"].annotation), | ||
| default="none", | ||
| help="Update collection information based on new items created or updated by this populator. " | ||
| "Only applies if --update is also set.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but just out of curiosity, would it make sense to allow updating only the collection, such as recomputing the extent/summaries without overriding other Items metadata that could have been pushed at another time (not within the same populator call)?
If --update is used, does it imply --update-collection as well by default if it is not provided explicitly? I would expect summaries/extents to be computed automatically anyway without having to specify the option. If such, what more does the option do since --update would be required for it to take effect. If not, wouldn't that potentially introduce erroneous summaries/extents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it would make sense to have a "recompute summaries and extents" option but I'm feeling like that's should be it's own standalone thing since that would involve crawling through the existing items within that catalog.
If --update is used, does it imply --update-collection as well by default if it is not provided explicitly?
No, but maybe it should. I kept the semantics of --update as it was before since I wanted to ensure backwards compatibility.
If such, what more does the option do
--update-collection updates the summaries and extents based on any new/updated items in the collection. --update actually tells the populator to update the collection on the STAC-API if it already exists.
wouldn't that potentially introduce erroneous summaries/extents?
It definitely would if the user didn't update those summaries/extents by some other means (which was the assumption before). Again, I did this to ensure backwards compatibility with the previous processes but I'm happy to break things if everyone is happy with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I agree about keeping them separate options to allow flexibility.
Do you think it would be more common that --update --update-collection all is desired or --update --update-collection none?
My guess would be that, unless we explicitly want to handle a special case where --update-collection all would introduce invalid metadata from partial Item references, we would always want everything to be synced. So, "all" would be a more appropriate default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree. That would break backwards compatibility though... are you ok if we do that? If so I can make some additional changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Since we are already accumulating a few changes and new features, we might as well do it with the most common expectation.
STACpopulator/populator_base.py
Outdated
| else: | ||
| item_interval = [stac_item["properties"][prop] for prop in ("start_datetime", "end_datetime")] | ||
| collection_intervals = self._collection["extent"]["temporal"]["interval"] | ||
| if collection_intervals: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If none was already provided, the ones found with items are not applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's supposed to be an else in there but I don't know if you can have a valid collection without at least one interval value. I'll add in the else condition just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too, but could it be possible that the Collection data is purposely partially defined in this case have it being defined entirely from the computed STAC Items values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we support that right now since we're calling publish_stac_collection before ingest which means that the STAC collection must exist on the STAC API before the data is ingested and the API should reject an invalid/partial collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before ingest which means that the STAC collection must exist on the STAC API before
This assumes the remote API is used, but can't the ingestion be with the directory loader?
Since that is probably the only case where a partial collection could be defined, I think it would be fair to consider this feature as unsupported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but the directory loader still send the data to the API once it's loaded. Right now we don't support "uploading" STAC data to a catalog that's just a bunch of json files on disk.
I think it would be fair to consider this feature as unsupported.
Yeah I'd say it's unsupported as well
| "temporal": { | ||
| "interval": [ | ||
| ["1850-01-01T00:00:00Z", null] | ||
| ["2020-01-01T00:00:00Z", null] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I like the use of null for "unknown" end datetime, but I think STAC technically disallows it. There are some open issues about it in stac-spec, so might need to revalidate for this test.
On a side note, if null would be allowed, is that property handled everywhere that datetime comparaisons are performed. For example, if Collection used null explicitly and an Item provided an end value, would the null be considered "greater / open-ended" compared to the Item's end datetime? In a sense, the ˋnull` could be "higher priority" as collection summary to indicate that Items are still being added and that the Collection is continuously updated with new entries.
Similarly, would a null start datetime in the Collection be considered "lower" than a specific Item's start datetime? Maybe less common case than the open-ended datetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null values are allowed: https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#temporal-extent-object
There are some open issues about it in stac-spec
Are you talking about this one? Because I think that one is related to a different part of the spec and discusses adding support for null values: radiantearth/stac-spec#1161
On a side note, if null would be allowed, is that property handled everywhere that datetime comparaisons are performed
Yes it's supported
For example, if Collection used null explicitly and an Item provided an end value, would the null be considered "greater / open-ended" compared to the Item's end datetime?
Similarly, would a null start datetime in the Collection be considered "lower" than a specific Item's start datetime?
Yes for both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is exactly this issue I was referring to. I am aware null is permitted in extents. My "disallowed" mentions are about the reference datetimes of the STAC Items specifically, and which are used as source metadata to compare and merge into the collection extent.
Technically, there would never be a valid STAC Item with start_datetime: "..." + end_datetime: null. Therefore, the strategy to combine them into the Collections temporal extent that could take advantage of open-ended null might be ambiguous.
If any STAC Item specifies end_datetime: null, it would be refused by the API, and therefore the corresponding collection update would not happen. But if the extent is explicitly provided with a [..., null] and that takes precedence, then all is good.
This reverts commit 4424d8d.
|
I've added a separate CLI option to update an existing collection based on all of the items in the collection. |
|
@henriaidasso this is the implementation that @fmigneault mentioned in https://github.com/crim-ca/stac-populator/pull/114/files#r2418104894 |
fmigneault
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor typing adjustment to simplify things.
STACpopulator/cli.py
Outdated
| dest="update_collection", | ||
| choices=get_args(inspect.signature(STACpopulatorBase.__init__).parameters["update_collection"].annotation), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be updated using UpdateModes directly (also including "none" in it)?
It would be easier to interpret and there wouldn't be partial subset of possibles values across the code.
Adds a new CLI option to automatically update a collection's extents and/or summaries based on the ingested items that are added.
Note that this only applies if the
--updateoption is also set.Also adds a warning if an item's bounding box is outside of the bounds set by the STAC spec (-180->180, -90->90)