Skip to content

Conversation

@mishaschwartz
Copy link
Contributor

Adds a new CLI option to automatically update a collection's extents and/or summaries based on the ingested items that are added.

Note that this only applies if the --update option is also set.

Also adds a warning if an item's bounding box is outside of the bounds set by the STAC spec (-180->180, -90->90)

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine.

Comments mostly for considering extended use cases that could be addressed or not, or added at a later time depending on criticality.

Comment on lines 50 to 54
"--update-collection",
choices=get_args(inspect.signature(STACpopulatorBase.__init__).parameters["update_collection"].annotation),
default="none",
help="Update collection information based on new items created or updated by this populator. "
"Only applies if --update is also set.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but just out of curiosity, would it make sense to allow updating only the collection, such as recomputing the extent/summaries without overriding other Items metadata that could have been pushed at another time (not within the same populator call)?

If --update is used, does it imply --update-collection as well by default if it is not provided explicitly? I would expect summaries/extents to be computed automatically anyway without having to specify the option. If such, what more does the option do since --update would be required for it to take effect. If not, wouldn't that potentially introduce erroneous summaries/extents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it would make sense to have a "recompute summaries and extents" option but I'm feeling like that's should be it's own standalone thing since that would involve crawling through the existing items within that catalog.

If --update is used, does it imply --update-collection as well by default if it is not provided explicitly?

No, but maybe it should. I kept the semantics of --update as it was before since I wanted to ensure backwards compatibility.

If such, what more does the option do

--update-collection updates the summaries and extents based on any new/updated items in the collection. --update actually tells the populator to update the collection on the STAC-API if it already exists.

wouldn't that potentially introduce erroneous summaries/extents?

It definitely would if the user didn't update those summaries/extents by some other means (which was the assumption before). Again, I did this to ensure backwards compatibility with the previous processes but I'm happy to break things if everyone is happy with that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I agree about keeping them separate options to allow flexibility.

Do you think it would be more common that --update --update-collection all is desired or --update --update-collection none?

My guess would be that, unless we explicitly want to handle a special case where --update-collection all would introduce invalid metadata from partial Item references, we would always want everything to be synced. So, "all" would be a more appropriate default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. That would break backwards compatibility though... are you ok if we do that? If so I can make some additional changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since we are already accumulating a few changes and new features, we might as well do it with the most common expectation.

else:
item_interval = [stac_item["properties"][prop] for prop in ("start_datetime", "end_datetime")]
collection_intervals = self._collection["extent"]["temporal"]["interval"]
if collection_intervals:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If none was already provided, the ones found with items are not applied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's supposed to be an else in there but I don't know if you can have a valid collection without at least one interval value. I'll add in the else condition just in case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, but could it be possible that the Collection data is purposely partially defined in this case have it being defined entirely from the computed STAC Items values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we support that right now since we're calling publish_stac_collection before ingest which means that the STAC collection must exist on the STAC API before the data is ingested and the API should reject an invalid/partial collection.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before ingest which means that the STAC collection must exist on the STAC API before

This assumes the remote API is used, but can't the ingestion be with the directory loader?
Since that is probably the only case where a partial collection could be defined, I think it would be fair to consider this feature as unsupported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the directory loader still send the data to the API once it's loaded. Right now we don't support "uploading" STAC data to a catalog that's just a bunch of json files on disk.

I think it would be fair to consider this feature as unsupported.

Yeah I'd say it's unsupported as well

"temporal": {
"interval": [
["1850-01-01T00:00:00Z", null]
["2020-01-01T00:00:00Z", null]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I like the use of null for "unknown" end datetime, but I think STAC technically disallows it. There are some open issues about it in stac-spec, so might need to revalidate for this test.

On a side note, if null would be allowed, is that property handled everywhere that datetime comparaisons are performed. For example, if Collection used null explicitly and an Item provided an end value, would the null be considered "greater / open-ended" compared to the Item's end datetime? In a sense, the ˋnull` could be "higher priority" as collection summary to indicate that Items are still being added and that the Collection is continuously updated with new entries.

Similarly, would a null start datetime in the Collection be considered "lower" than a specific Item's start datetime? Maybe less common case than the open-ended datetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null values are allowed: https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#temporal-extent-object

There are some open issues about it in stac-spec

Are you talking about this one? Because I think that one is related to a different part of the spec and discusses adding support for null values: radiantearth/stac-spec#1161

On a side note, if null would be allowed, is that property handled everywhere that datetime comparaisons are performed

Yes it's supported

For example, if Collection used null explicitly and an Item provided an end value, would the null be considered "greater / open-ended" compared to the Item's end datetime?
Similarly, would a null start datetime in the Collection be considered "lower" than a specific Item's start datetime?

Yes for both

Copy link
Collaborator

@fmigneault fmigneault Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is exactly this issue I was referring to. I am aware null is permitted in extents. My "disallowed" mentions are about the reference datetimes of the STAC Items specifically, and which are used as source metadata to compare and merge into the collection extent.

Technically, there would never be a valid STAC Item with start_datetime: "..." + end_datetime: null. Therefore, the strategy to combine them into the Collections temporal extent that could take advantage of open-ended null might be ambiguous.

If any STAC Item specifies end_datetime: null, it would be refused by the API, and therefore the corresponding collection update would not happen. But if the extent is explicitly provided with a [..., null] and that takes precedence, then all is good.

@mishaschwartz
Copy link
Contributor Author

I've added a separate CLI option to update an existing collection based on all of the items in the collection.
All of the code is pretty much the same as before except that I moved it into a separate file instead of keeping it under the STACpopulatorBase class so that it could be used outside of a populator run.

@mishaschwartz
Copy link
Contributor Author

Copy link
Collaborator

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor typing adjustment to simplify things.

Comment on lines 52 to 53
dest="update_collection",
choices=get_args(inspect.signature(STACpopulatorBase.__init__).parameters["update_collection"].annotation),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be updated using UpdateModes directly (also including "none" in it)?
It would be easier to interpret and there wouldn't be partial subset of possibles values across the code.

@mishaschwartz mishaschwartz merged commit 9bd6cf1 into master Nov 12, 2025
12 checks passed
@mishaschwartz mishaschwartz deleted the collection-auto-update branch November 12, 2025 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants