Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
7ff26e7
Templates and TemplateRegistry
dmitriyrepin Jul 17, 2025
5ae7258
Merge remote-tracking branch 'upstream/v1' into v1
dmitriyrepin Jul 17, 2025
e8b95dc
Fix pre-commit issues
dmitriyrepin Jul 17, 2025
be772c7
Rever dev container changes
dmitriyrepin Jul 18, 2025
e43e4da
PR Review: address issues
dmitriyrepin Jul 18, 2025
80dd234
PR Review: register default templates at registry initialization
dmitriyrepin Jul 18, 2025
7caca55
Merge remote-tracking branch 'upstream/v1' into v1
dmitriyrepin Jul 22, 2025
2b2acf2
Dockerfile.dev
dmitriyrepin Jul 22, 2025
73d827a
segy_to_mdio_v1
dmitriyrepin Jul 22, 2025
befeca0
Clean up
dmitriyrepin Jul 23, 2025
ec057e6
Prototype review notes
Jul 23, 2025
59d12e9
Add dev comment
Jul 23, 2025
2f3eabf
Add notes that will be deleted later
dmitriyrepin Jul 23, 2025
0477bed
segy_to_mdio_v1 pass 1
dmitriyrepin Jul 26, 2025
5f7b135
indexing_v1 and blocked_io_v1
dmitriyrepin Jul 26, 2025
d6a4a35
Remove DEV notes
dmitriyrepin Jul 26, 2025
0b4c29f
Clean up
dmitriyrepin Jul 26, 2025
340f78a
Document bug location
dmitriyrepin Jul 26, 2025
2c34d34
Work around IndexError
dmitriyrepin Jul 27, 2025
93c4b30
Clean temporary code
dmitriyrepin Jul 27, 2025
324879a
More clean up
dmitriyrepin Jul 27, 2025
5debcf9
Remove *_1 infrastructure files
dmitriyrepin Jul 28, 2025
7b96b29
Restore accidently removed dask.array
dmitriyrepin Jul 29, 2025
3115299
Created an issue reproducer
dmitriyrepin Jul 29, 2025
4b1ae8f
Make the required template properties public
dmitriyrepin Jul 29, 2025
21e9e04
Simplify type converter
dmitriyrepin Jul 30, 2025
c5f9a63
Improve templates
dmitriyrepin Jul 30, 2025
b73cc68
Move test_type_converter.py
dmitriyrepin Jul 30, 2025
55315aa
Move test_type_converter.py
dmitriyrepin Jul 30, 2025
3e64fba
Revert to use the original grid
dmitriyrepin Jul 30, 2025
1f4687f
Integrate segy_to_mdio_v1_customized, fix indexing
dmitriyrepin Jul 30, 2025
e39d8c6
Add dimension coordinates in tem,plates
dmitriyrepin Jul 30, 2025
972a05d
Write statistics to Zarr
dmitriyrepin Jul 30, 2025
84ceb57
Delete factory_v1.py
dmitriyrepin Jul 30, 2025
4ff62bc
Complete integrationtest. Fix coordinates
dmitriyrepin Jul 31, 2025
543e886
Fir pre-commit errors
dmitriyrepin Jul 31, 2025
8017d98
PR review: fix trace_worker docstring
dmitriyrepin Aug 1, 2025
90754d3
Review: address some of the issue
dmitriyrepin Aug 1, 2025
f0a1c28
Fix bug
dmitriyrepin Aug 1, 2025
5d07ea4
dding todo for sum squares calculation
tasansal Aug 1, 2025
b503069
Refactor ChunkIterator
dmitriyrepin Aug 1, 2025
15febc9
Merge branch 'segy_to_mdio_v1'
dmitriyrepin Aug 2, 2025
5980ec9
Refactor ChunkIterator into ChunkIteratorV1
dmitriyrepin Aug 2, 2025
8e5f7a0
Remove segy_to_mdio_v1_customized, dataset_serializer.to_zarr
dmitriyrepin Aug 2, 2025
f0f42f3
Add support for trace headers without using _FillValue
dmitriyrepin Aug 4, 2025
a441db8
Use StorageLocation in trace_worker_v1
dmitriyrepin Aug 4, 2025
d574a47
Fix statistics attribute name
dmitriyrepin Aug 4, 2025
cf90b7e
PR review changes
dmitriyrepin Aug 4, 2025
a5ae874
PR Improvements: do a single write
dmitriyrepin Aug 4, 2025
ab08ef4
TODO: chunked write for non-dimensional coordinates and trace_mask
dmitriyrepin Aug 4, 2025
b970d74
Update StorageLocation to use fsspec
dmitriyrepin Aug 4, 2025
2f37c19
Reformat with pre-commit
dmitriyrepin Aug 4, 2025
4f30d95
Use domain name in get_grid_plan
dmitriyrepin Aug 4, 2025
71dcd0d
Fix non-dim coords and chunk_samples=False
dmitriyrepin Aug 5, 2025
1771491
Convert test_3d_import_v1 to V1
dmitriyrepin Aug 5, 2025
1f820a4
Merge-in latest 'upstream v1'
dmitriyrepin Aug 6, 2025
b52f534
Fix test_meta_dataset_read
dmitriyrepin Aug 6, 2025
7c6a38f
Merge branch 'v1' into segy_to_mdio_v1
tasansal Aug 7, 2025
ba3307f
remove whitespace
tasansal Aug 7, 2025
5e8a1c5
clean up comments
tasansal Aug 7, 2025
d03e460
update deps in lockfile
tasansal Aug 7, 2025
c8f7cff
simplify dim and non-dim coordinate handling after scan
tasansal Aug 7, 2025
047ea45
remove compatibility tests
tasansal Aug 8, 2025
08c1e70
add filtering capability to header worker
tasansal Aug 8, 2025
81af582
accept subset filter to pass to workers
tasansal Aug 8, 2025
f2d59a9
make v1 grid planner awesome
tasansal Aug 8, 2025
18726ed
double to single underscores in test names
tasansal Aug 8, 2025
75a0915
fix broken test harnesses due to incorrect Sequence import
tasansal Aug 8, 2025
174c8fd
clean up dev comment
tasansal Aug 8, 2025
63737a6
clean up whitespace
tasansal Aug 8, 2025
c55c080
use new module name
tasansal Aug 8, 2025
406a6b3
clean up segy_to_mdio_v1
tasansal Aug 8, 2025
73073e7
fix whitespace and remove unnecessary list call
tasansal Aug 8, 2025
29bbb70
these are defined as float64 in template
tasansal Aug 8, 2025
b13a57c
fix missing dimension coordinate for vertical axis
tasansal Aug 8, 2025
4d1dc8f
fix incorrect dtype comparison for time variable
tasansal Aug 8, 2025
0d410d3
simplify and fix critical bugs
tasansal Aug 8, 2025
e7ceced
rename v1 out of things and get rid of old code
tasansal Aug 8, 2025
fafe8ab
remove fixed todo
tasansal Aug 8, 2025
2b98486
remove more v1 from names
tasansal Aug 8, 2025
0517a57
rename chunk iterator
tasansal Aug 8, 2025
eb8bac7
fix dimensionality in tests due to new (missing) vertical dimension c…
tasansal Aug 8, 2025
22c4613
add todo for numpy ingestion
tasansal Aug 8, 2025
19812e9
fix references to non-v1 naming
tasansal Aug 8, 2025
00ef757
extract grid operations to its own function
tasansal Aug 8, 2025
528acb1
fix typo
tasansal Aug 8, 2025
d7b9013
add todo for simplifying storage location
tasansal Aug 8, 2025
792286c
Remove no_fill_var_names, add domain var to Seismic3DPreStackShotTemp…
dmitriyrepin Aug 8, 2025
c31bc45
Part 2 of the previous commit
dmitriyrepin Aug 8, 2025
bdde865
pre-commit formatting
dmitriyrepin Aug 8, 2025
e1405ec
remove dev mount
tasansal Aug 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions src/mdio/api/convenience.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,11 @@ def copy_mdio( # noqa: PLR0913

writer.live_mask[:] = reader.live_mask[:]

iterator = ChunkIterator(reader._traces, chunk_samples=False)
shape = reader._traces.shape
chunks = reader._traces.chunks
chunks = chunks[:-1] + (shape[-1],) # don't chunk samples

iterator = ChunkIterator(shape=shape, chunks=chunks)
progress = tqdm(iterator, unit="block")
progress.set_description(desc=f"Copying data for '{access_pattern=}'")
for slice_ in progress:
Expand Down Expand Up @@ -177,7 +181,10 @@ def create_rechunk_plan(

n_dimension = len(data_array.shape)
dummy_array = zarr.empty(shape=data_array.shape, chunks=(MAX_BUFFER,) * n_dimension)
iterator = ChunkIterator(dummy_array)

shape = dummy_array.shape
chunks = dummy_array.chunks
iterator = ChunkIterator(shape=shape, chunks=chunks)

return metadata_arrs, data_arrs, live_mask, iterator

Expand Down
6 changes: 5 additions & 1 deletion src/mdio/converters/numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
import numpy as np

from mdio.api.accessor import MDIOWriter
from mdio.converters.segy import get_compressor
from mdio.core.dimension import Dimension
from mdio.core.factory import MDIOCreateConfig
from mdio.core.factory import MDIOVariableConfig
Expand Down Expand Up @@ -137,6 +136,11 @@ def numpy_to_mdio( # noqa: PLR0913
suffix = [str(idx) for idx, value in enumerate(suffix) if value is not None]
suffix = "".join(suffix)

# TODO(Dmitrit Repin): Implement Numpy converted in MDIO v1
# https://github.com/TGSAI/mdio-python/issues/596
def get_compressor(lossless: bool, tolerance: float) -> list[str]:
pass

compressors = get_compressor(lossless, compression_tolerance)
mdio_var = MDIOVariableConfig(
name=f"chunked_{suffix}",
Expand Down
607 changes: 256 additions & 351 deletions src/mdio/converters/segy.py

Large diffs are not rendered by default.

85 changes: 85 additions & 0 deletions src/mdio/converters/type_converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
"""A module for converting numpy dtypes to MDIO scalar and structured types."""

from numpy import dtype as np_dtype

from mdio.schemas.dtype import ScalarType
from mdio.schemas.dtype import StructuredField
from mdio.schemas.dtype import StructuredType


def to_scalar_type(data_type: np_dtype) -> ScalarType:
"""Convert numpy dtype to MDIO ScalarType.

Out of the 24 built-in numpy scalar type objects
(see https://numpy.org/doc/stable/reference/arrays.dtypes.html)
this function supports only a limited subset:
ScalarType.INT8 <-> int8
ScalarType.INT16 <-> int16
ScalarType.INT32 <-> int32
ScalarType.INT64 <-> int64
ScalarType.UINT8 <-> uint8
ScalarType.UINT16 <-> uint16
ScalarType.UINT32 <-> uint32
ScalarType.UINT64 <-> uint64
ScalarType.FLOAT32 <-> float32
ScalarType.FLOAT64 <-> float64
ScalarType.COMPLEX64 <-> complex64
ScalarType.COMPLEX128 <-> complex128
ScalarType.BOOL <-> bool

Args:
data_type: numpy dtype to convert

Returns:
ScalarType: corresponding MDIO scalar type

Raises:
ValueError: if dtype is not supported
"""
try:
return ScalarType(data_type.name)
except ValueError as exc:
err = f"Unsupported numpy dtype '{data_type.name}' for conversion to ScalarType."
raise ValueError(err) from exc


def to_structured_type(data_type: np_dtype) -> StructuredType:
"""Convert numpy dtype to MDIO StructuredType.

This function supports only a limited subset of structured types.
In particular:
It does not support nested structured types.
It supports fields of only 13 out of 24 built-in numpy scalar types.
(see `to_scalar_type` for details).

Args:
data_type: numpy dtype to convert

Returns:
StructuredType: corresponding MDIO structured type

Raises:
ValueError: if dtype is not structured or has no fields

"""
if data_type is None or len(data_type.names or []) == 0:
err = "None or empty dtype provided, cannot convert to StructuredType."
raise ValueError(err)

fields = []
for field_name in data_type.names:
field_dtype = data_type.fields[field_name][0]
scalar_type = to_scalar_type(field_dtype)
structured_field = StructuredField(name=field_name, format=scalar_type)
fields.append(structured_field)
return StructuredType(fields=fields)


def to_numpy_dtype(data_type: ScalarType | StructuredType) -> np_dtype:
"""Get the numpy dtype for a variable."""
if isinstance(data_type, ScalarType):
return np_dtype(data_type.value)
if isinstance(data_type, StructuredType):
return np_dtype([(f.name, f.format.value) for f in data_type.fields])
msg = f"Expected ScalarType or StructuredType, got '{type(data_type).__name__}'"
raise ValueError(msg)
117 changes: 66 additions & 51 deletions src/mdio/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,78 +4,83 @@
from math import ceil

import numpy as np
from zarr import Array


class ChunkIterator:
"""Iterator for traversing a Zarr array in chunks.
"""Chunk iterator for multi-dimensional arrays.

This iterator yields tuples of slices corresponding to the chunk boundaries of a Zarr array.
It supports chunking all dimensions or taking the full extent of the last dimension.
This iterator takes an array shape and chunks and every time it is iterated, it returns
a dictionary (if dimensions are provided) or a tuple of slices that align with
chunk boundaries. When dimensions are provided, they are used as the dictionary keys.

Args:
array: The Zarr array to iterate, providing shape and chunk sizes.
chunk_samples: If True, chunks all dimensions. If False, takes the full extent of the
last dimension. Defaults to True.


Example:
>>> import zarr
>>> arr = zarr.array(np.zeros((10, 20)), chunks=(3, 4))
>>> it = ChunkIterator(arr)
>>> for slices in it:
... print(slices)
(slice(0, 3, None), slice(0, 4, None))
(slice(0, 3, None), slice(4, 8, None))
...
>>> it = ChunkIterator(arr, chunk_samples=False)
>>> for slices in it:
... print(slices)
(slice(0, 3, None), slice(0, 20, None))
(slice(3, 6, None), slice(0, 20, None))
...
shape: The shape of the array.
chunks: The chunk sizes for each dimension.
dim_names: The names of the array dimensions, to be used with DataArray.isel().
If the dim_names are not provided, a tuple of the slices will be returned.

Attributes: # noqa: DOC602
arr_shape: Shape of the array.
len_chunks: Length of chunks in each dimension.
dim_chunks: Number of chunks in each dimension.
num_chunks: Total number of chunks.

Examples:
>> chunks = (3, 4, 5)
>> shape = (5, 11, 19)
>> dims = ["inline", "crossline", "depth"]
>>
>> iter = ChunkIterator(shape=shape, chunks=chunks, dim_names=dims)
>> for i in range(13):
>> region = iter.__next__()
>> print(region)
{ "inline": slice(3,6, None), "crossline": slice(0,4, None), "depth": slice(0,5, None) }

>> iter = ChunkIterator(shape=shape, chunks=chunks, dim_names=None)
>> for i in range(13):
>> region = iter.__next__()
>> print(region)
(slice(3,6,None), slice(0,4,None), slice(0,5,None))
"""

def __init__(self, array: Array, chunk_samples: bool = True):
self.arr_shape = array.shape
self.len_chunks = array.chunks

# If chunk_samples is False, set the last dimension's chunk size to its full extent
if not chunk_samples:
self.len_chunks = self.len_chunks[:-1] + (self.arr_shape[-1],)

# Calculate the number of chunks per dimension
self.dim_chunks = [
ceil(len_dim / chunk)
for len_dim, chunk in zip(self.arr_shape, self.len_chunks, strict=True)
]
def __init__(
self, shape: tuple[int, ...], chunks: tuple[int, ...], dim_names: tuple[str, ...] = None
):
self.arr_shape = tuple(shape) # Deep copy to ensure immutability
self.len_chunks = tuple(chunks) # Deep copy to ensure immutability
self.dims = dim_names

# Compute number of chunks per dimension, and total number of chunks
self.dim_chunks = tuple(
[
ceil(len_dim / chunk)
for len_dim, chunk in zip(self.arr_shape, self.len_chunks, strict=True)
]
)
self.num_chunks = np.prod(self.dim_chunks)

# Set up chunk index combinations using ranges for each dimension
# Under the hood stuff for the iterator. This generates C-ordered
# permutation of chunk indices.
dim_ranges = [range(dim_len) for dim_len in self.dim_chunks]
self._ranges = itertools.product(*dim_ranges)
self._idx = 0

def __iter__(self) -> "ChunkIterator":
"""Return the iterator object itself."""
"""Iteration context."""
return self

def __len__(self) -> int:
"""Return the total number of chunks."""
"""Get total number of chunks."""
return self.num_chunks

def __next__(self) -> tuple[slice, ...]:
"""Yield the next set of chunk slices.

Returns:
A tuple of slice objects for each dimension.

Raises:
StopIteration: When all chunks have been iterated over.
"""
if self._idx < self.num_chunks:
def __next__(self) -> dict[str, slice]:
"""Iteration logic."""
if self._idx <= self.num_chunks:
# We build slices here. It is dimension agnostic
current_start = next(self._ranges)

# TODO (Dmitriy Repin): Enhance ChunkIterator to make the last slice, if needed, smaller
# https://github.com/TGSAI/mdio-python/issues/586
start_indices = tuple(
dim * chunk for dim, chunk in zip(current_start, self.len_chunks, strict=True)
)
Expand All @@ -88,7 +93,17 @@ def __next__(self) -> tuple[slice, ...]:
slice(start, stop) for start, stop in zip(start_indices, stop_indices, strict=True)
)

if self.dims: # noqa SIM108
# Example
# {"inline":slice(3,6,None), "crossline":slice(0,4,None), "depth":slice(0,5,None)}
result = dict(zip(self.dims, slices, strict=False))
else:
# Example
# (slice(3,6,None), slice(0,4,None), slice(0,5,None))
result = slices

self._idx += 1
return slices

return result

raise StopIteration
87 changes: 87 additions & 0 deletions src/mdio/core/storage_location.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
"""StorageLocation class for managing local and cloud storage locations."""

from pathlib import Path
from typing import Any

import fsspec


# TODO(Dmitriy Repin): Reuse fsspec functions for some methods we implemented here
# https://github.com/TGSAI/mdio-python/issues/597
class StorageLocation:
"""A class to represent a local or cloud storage location for SEG-Y or MDIO files.

This class abstracts the storage location, allowing for both local file paths and
cloud storage URIs (e.g., S3, GCS). It uses fsspec to check existence and manage options.
Note, we do not want to make it a dataclass because we want the uri and the options to
be read-only immutable properties.

uri: The URI of the storage location (e.g., '/path/to/file', 'file:///path/to/file',
's3://bucket/path', 'gs://bucket/path').
options: Optional dictionary of options for the cloud, such as credentials.

"""

def __init__(self, uri: str = "", options: dict[str, Any] = None):
self._uri = uri
self._options = options or {}
self._fs = None

if uri.startswith(("s3://", "gs://")):
return

if uri.startswith("file://"):
self._uri = self._uri.removeprefix("file://")
# For local paths, ensure they are absolute and resolved
self._uri = str(Path(self._uri).resolve())
return

@property
def uri(self) -> str:
"""Get the URI (read-only)."""
return self._uri

@property
def options(self) -> dict[str, Any]:
"""Get the options (read-only)."""
# Return a copy to prevent external modification
return self._options.copy()

@property
def _filesystem(self) -> fsspec.AbstractFileSystem:
"""Get the fsspec filesystem instance for this storage location."""
if self._fs is None:
self._fs = fsspec.filesystem(self._protocol, **self._options)
return self._fs

@property
def _path(self) -> str:
"""Extract the path portion from the URI."""
if "://" in self._uri:
return self._uri.split("://", 1)[1]
return self._uri # For local paths without file:// prefix

@property
def _protocol(self) -> str:
"""Extract the protocol/scheme from the URI."""
if "://" in self._uri:
return self._uri.split("://", 1)[0]
return "file" # Default to file protocol

def exists(self) -> bool:
"""Check if the storage location exists using fsspec."""
try:
return self._filesystem.exists(self._path)
except Exception as e:
# Log the error and return False for safety
# In a production environment, you might want to use proper logging
print(f"Error checking existence of {self._uri}: {e}")
return False

def __str__(self) -> str:
"""String representation of the storage location."""
return self._uri

def __repr__(self) -> str:
"""Developer representation of the storage location."""
return f"StorageLocation(uri='{self._uri}', options={self._options})"
Loading
Loading