Skip to content
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
63ff6c9
fix .dtype property
TomNicholas Jul 15, 2025
e765e13
use zarr data types in create_array_v3_metadata
TomNicholas Jul 15, 2025
c60795a
change expected repr
TomNicholas Jul 15, 2025
b2d9549
fix test for checking dtypes are the same
TomNicholas Jul 15, 2025
a05d9e8
fix icechunk tests
TomNicholas Jul 15, 2025
f172930
fix kerchunk tests
TomNicholas Jul 15, 2025
02dfa56
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 15, 2025
7dfeabe
Merge branch 'develop' into zarr-data-types-refactor-compat2
TomNicholas Jul 15, 2025
b42022f
fix combine test
TomNicholas Jul 15, 2025
7bb6e6f
fix conversion of v3 to v2 metadata
TomNicholas Jul 16, 2025
744fd49
write function to normalize kerchunk references into true json
TomNicholas Jul 16, 2025
35f94b5
use the new function in our tests
TomNicholas Jul 16, 2025
9663d05
remove outdated imports
TomNicholas Jul 16, 2025
e1e54b3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 16, 2025
915e4c3
Merge branch 'develop' into kerchunk_to_true_json
TomNicholas Jul 16, 2025
ae3f871
Merge branch 'kerchunk_to_true_json' into zarr-data-types-refactor-co…
TomNicholas Jul 16, 2025
8548086
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 16, 2025
01baf83
missed a coersion
TomNicholas Jul 16, 2025
ad0be5c
Merge branch 'kerchunk_to_true_json' of https://github.com/TomNichola…
TomNicholas Jul 16, 2025
de9c8fd
Merge branch 'kerchunk_to_true_json' into zarr-data-types-refactor-co…
TomNicholas Jul 16, 2025
35cfddf
fix dmrpp test
TomNicholas Jul 16, 2025
813ee16
Merge branch 'zarr-data-types-refactor-compat2' of https://github.com…
TomNicholas Jul 16, 2025
f0739c1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 16, 2025
77493a6
Merge branch 'develop' into zarr-data-types-refactor-compat2
TomNicholas Jul 16, 2025
31941fb
require latest zarr
TomNicholas Jul 16, 2025
64a513b
Merge branch 'zarr-data-types-refactor-compat2' of https://github.com…
TomNicholas Jul 16, 2025
a17884d
update minimum version in pixi
TomNicholas Jul 16, 2025
93f7758
fix array return type and add test
TomNicholas Jul 17, 2025
407f208
change expected group repr dtype str
TomNicholas Jul 17, 2025
09d801d
fix metadata comparison by just using updated version of ArrayV3Metad…
TomNicholas Jul 17, 2025
9117116
don't need to add .nbytes to ManifestArray
TomNicholas Jul 17, 2025
3ebb731
remove rogue print statements
TomNicholas Jul 17, 2025
bf104cb
try fixing zarr data type error in icechunk writer in CI
TomNicholas Jul 17, 2025
b1961b6
remove xfail now that datetime dtypes are supported in zarr
TomNicholas Jul 17, 2025
b0bddf2
add xfail for one test case
TomNicholas Jul 17, 2025
fc1bd34
add note about now supporting big-endian
TomNicholas Jul 17, 2025
925ffec
remove unnecessary conversion of dtype
TomNicholas Jul 17, 2025
0d7ad60
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
[Chuck Daniels](https://github.com/chuckwondo).
- Now throws a warning if you attempt to write an entirely non-virtual dataset to a virtual references format ([#657](https://github.com/zarr-developers/VirtualiZarr/pull/657)).
By [Tom Nicholas](https://github.com/TomNicholas).
- Support big-endian data via zarr-python 3.0.9 and zarr v3's new data types system ([#618](https://github.com/zarr-developers/VirtualiZarr/issues/618), [#677](https://github.com/zarr-developers/VirtualiZarr/issues/677)) By [Max Jones](https://github.com/maxrjones) and [Tom Nicholas](https://github.com/TomNicholas).

### Breaking changes

Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ dependencies = [
"numcodecs>=0.15.1",
"ujson",
"packaging",
"zarr>=3.0.8,!=3.1.0",
"zarr>=3.1.0",
"obstore>=0.5.1",
]

Expand Down Expand Up @@ -169,7 +169,7 @@ rust = "*"
xarray = "==2025.3.0"
numpy = "==2.0.0"
numcodecs = "==0.15.1"
zarr = "==3.0.8"
zarr = "==3.1.0"
obstore = "==0.5.1"

# Define commands to run within the test environments
Expand Down
14 changes: 9 additions & 5 deletions virtualizarr/manifests/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,8 +87,10 @@ def chunks(self) -> tuple[int, ...]:

@property
def dtype(self) -> np.dtype:
dtype_str = self.metadata.data_type
return dtype_str.to_numpy()
"""The native dtype of the data (typically a numpy dtype)"""
zdtype = self.metadata.data_type
dtype = zdtype.to_native_dtype()
return dtype

@property
def shape(self) -> tuple[int, ...]:
Expand All @@ -111,9 +113,11 @@ def __repr__(self) -> str:
@property
def nbytes_virtual(self) -> int:
"""
Size required to hold these references in memory in bytes.
The total number of bytes required to hold these virtual references in memory in bytes.

Note this is not the size of the referenced array if it were actually loaded into memory,
Notes
-----
This is not the size of the referenced array if it were actually loaded into memory (use `.nbytes`),
this is only the size of the pointers to the chunk locations.
If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
"""
Expand Down Expand Up @@ -170,7 +174,7 @@ def __eq__( # type: ignore[override]
if self.shape != other.shape:
raise NotImplementedError("Unsure how to handle broadcasting like this")

if not utils.metadata_identical(self.metadata, other.metadata):
if not self.metadata.to_dict() == other.metadata.to_dict():
Copy link
Member Author

@TomNicholas TomNicholas Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment explains why were are now able to simplify this: zarr-developers/zarr-python#2930 (comment)

return np.full(shape=self.shape, fill_value=False, dtype=np.dtype(bool))
else:
if self.manifest == other.manifest:
Expand Down
24 changes: 4 additions & 20 deletions virtualizarr/manifests/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
parse_dimension_names,
parse_shapelike,
)
from zarr.dtype import parse_data_type

from virtualizarr.codecs import convert_to_codec_pipeline, get_codecs

Expand Down Expand Up @@ -75,15 +76,16 @@ def create_v3_array_metadata(
ArrayV3Metadata
A configured ArrayV3Metadata instance with standard defaults
"""
zdtype = parse_data_type(data_type, zarr_format=3)
return ArrayV3Metadata(
shape=shape,
data_type=data_type.name if hasattr(data_type, "name") else data_type,
data_type=zdtype,
chunk_grid={
"name": "regular",
"configuration": {"chunk_shape": chunk_shape},
},
chunk_key_encoding=chunk_key_encoding,
fill_value=fill_value,
fill_value=zdtype.default_scalar() if fill_value is None else fill_value,
codecs=convert_to_codec_pipeline(
codecs=codecs or [],
dtype=data_type,
Expand Down Expand Up @@ -155,24 +157,6 @@ def check_same_shapes(shapes: list[tuple[int, ...]]) -> None:
)


# TODO remove this once https://github.com/zarr-developers/zarr-python/issues/2929 is solved upstream
def metadata_identical(metadata1: ArrayV3Metadata, metadata2: ArrayV3Metadata) -> bool:
"""Checks the metadata of two zarr arrays are identical, including special treatment for NaN fill_values."""
metadata_dict1 = metadata1.to_dict()
metadata_dict2 = metadata2.to_dict()

# fill_value is a special case because numpy NaNs cannot be compared using __eq__, see https://stackoverflow.com/a/10059796
fill_value1 = metadata_dict1.pop("fill_value")
fill_value2 = metadata_dict2.pop("fill_value")
if np.isnan(fill_value1) and np.isnan(fill_value2): # type: ignore[arg-type]
fill_values_equal = fill_value1.dtype == fill_value2.dtype # type: ignore[union-attr]
else:
fill_values_equal = fill_value1 == fill_value2

# everything else in ArrayV3Metadata is a string, Enum, or Dataclass
return fill_values_equal and metadata_dict1 == metadata_dict2


def _remove_element_at_position(t: tuple[int, ...], pos: int) -> tuple[int, ...]:
new_l = list(t)
new_l.pop(pos)
Expand Down
6 changes: 3 additions & 3 deletions virtualizarr/tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -343,12 +343,12 @@ def test_non_dimension_coordinates(
for coord in ds.coords:
assert ds.coords[coord].attrs == roundtrip.coords[coord].attrs

@pytest.mark.xfail(
reason="Datetime and timedelta data types not yet supported by zarr-python 3.0" # https://github.com/zarr-developers/zarr-python/issues/2616
)
def test_datetime64_dtype_fill_value(
self, tmpdir, roundtrip_func, array_v3_metadata
):
if roundtrip_func == roundtrip_as_in_memory_icechunk:
pytest.xfail(reason="xarray can't decode the ns datetime fill_value")

Comment on lines +349 to +351
Copy link
Member Author

@TomNicholas TomNicholas Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As zarr-developers/zarr-python#2616 was closed, these xfails became xpasses!

Except for this one case - roundtripping via Icechunk, which fails with an error inside xarray's zarr decoders... I don't really understand why this matters, but it should be treated as a separable issue from this PR.

cc @sharkinsspatial

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised #686 to track this

chunks_dict = {
"0.0.0": {"path": "/foo.nc", "offset": 100, "length": 100},
}
Expand Down
6 changes: 3 additions & 3 deletions virtualizarr/tests/test_manifests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -394,9 +394,9 @@ def test_refuse_combine(array_v3_metadata):
with pytest.raises(NotImplementedError, match="different codecs"):
func([marr1, marr2], axis=0)

metadata_copy = metadata_common.to_dict().copy()
metadata_copy["data_type"] = np.dtype("int64")
metadata_wrong_dtype = ArrayV3Metadata.from_dict(metadata_copy)
metadata_wrong_dtype = array_v3_metadata(
shape=shape, chunks=chunks, data_type=np.dtype("int64")
)
marr2 = ManifestArray(metadata=metadata_wrong_dtype, chunkmanifest=chunkmanifest2)
for func in [np.concatenate, np.stack]:
with pytest.raises(ValueError, match="inconsistent dtypes"):
Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_parsers/test_dmrpp.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,7 +354,7 @@ def test_parse_variable(tmp_path):
basic_dmrpp = dmrparser(DMRPP_XML_STRINGS["basic"], tmp_path=tmp_path)

var = basic_dmrpp._parse_variable(basic_dmrpp.find_node_fqn("/data"))
assert var.metadata.dtype == "float32"
assert var.metadata.dtype.to_native_dtype() == "float32"
assert var.metadata.dimension_names == ("x", "y")
assert var.shape == (720, 1440)
assert var.chunks == (360, 720)
Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_writers/test_kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ def testconvert_v3_to_v2_metadata(array_v3_metadata):

assert isinstance(v2_metadata, ArrayV2Metadata)
assert v2_metadata.shape == shape
assert v2_metadata.dtype == np.dtype("int32")
assert v2_metadata.dtype.to_native_dtype() == np.dtype("int32")
assert v2_metadata.chunks == chunks
assert v2_metadata.fill_value == 0
compressor_config = v2_metadata.compressor.get_config()
Expand Down
7 changes: 6 additions & 1 deletion virtualizarr/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

from zarr.abc.codec import ArrayArrayCodec, BytesBytesCodec
from zarr.core.metadata import ArrayV2Metadata, ArrayV3Metadata
from zarr.dtype import parse_data_type

from virtualizarr.codecs import extract_codecs, get_codec_config
from virtualizarr.types.kerchunk import KerchunkStoreRefs
Expand Down Expand Up @@ -155,9 +156,12 @@ def convert_v3_to_v2_metadata(
# Handle filter configurations
filter_configs = [get_codec_config(filter_) for filter_ in array_filters]

native_dtype = v3_metadata.data_type.to_native_dtype()
v2_compatible_data_type = parse_data_type(native_dtype, zarr_format=2)

v2_metadata = ArrayV2Metadata(
shape=v3_metadata.shape,
dtype=v3_metadata.data_type.to_numpy(),
dtype=v2_compatible_data_type,
chunks=v3_metadata.chunks,
fill_value=fill_value or v3_metadata.fill_value,
compressor=compressor_config,
Copy link
Member Author

@TomNicholas TomNicholas Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b here is where I'm potentially unecessarily converting back and forth

Copy link
Member Author

@TomNicholas TomNicholas Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 925ffec

Expand All @@ -175,6 +179,7 @@ def kerchunk_refs_as_json(refs: KerchunkStoreRefs) -> JSON:

See https://github.com/zarr-developers/VirtualiZarr/issues/679 for context as to why this is needed.
"""

normalized_result: dict[str, JSON] = copy.deepcopy(refs)
v0_refs: dict[str, JSON] = refs["refs"]

Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/writers/icechunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,7 +344,7 @@ def write_virtual_variable_to_icechunk(
name=name,
shape=metadata.shape,
chunks=metadata.chunks,
dtype=metadata.data_type.to_numpy(),
dtype=metadata.data_type.to_native_dtype(),
filters=filters,
compressors=compressors,
dimension_names=var.dims,
Expand Down
5 changes: 4 additions & 1 deletion virtualizarr/writers/kerchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from xarray.conventions import encode_dataset_coordinates
from zarr.core.common import JSON
from zarr.core.metadata.v2 import ArrayV2Metadata
from zarr.dtype import parse_data_type

from virtualizarr.manifests import ManifestArray
from virtualizarr.manifests.manifest import join
Expand Down Expand Up @@ -161,7 +162,9 @@ def variable_to_kerchunk_arr_refs(var: Variable, var_name: str) -> KerchunkArrRe
array_v2_metadata = ArrayV2Metadata(
chunks=np_arr.shape,
shape=np_arr.shape,
dtype=np_arr.dtype,
dtype=parse_data_type(
np_arr.dtype, zarr_format=2
), # needed unless zarr-python fixes https://github.com/zarr-developers/zarr-python/issues/3253
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order="C",
fill_value=None,
)
Expand Down