Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
8eeddea
update project metadata and deps
tasansal Apr 29, 2025
293a714
update project metadata and deps
tasansal Apr 29, 2025
faacbaa
add schemas
tasansal Apr 29, 2025
1a1e743
Relocate quickstart notebook to tutorials directory
tasansal Apr 29, 2025
3c8ea4d
update docs dependencies
tasansal Apr 29, 2025
53ecaa9
add new docs
tasansal Apr 29, 2025
da09c4b
remove incorrect exclude
tasansal Apr 29, 2025
0d59c95
remove duplicate doc directive
tasansal Apr 29, 2025
793f3f1
fix creation notebook
tasansal Apr 29, 2025
0e64e09
Add basic unit test for v1 dataset schema validation
tasansal Apr 29, 2025
a1a312e
update lockfile
tasansal Apr 30, 2025
dfe7751
fix broken creation nb
tasansal May 7, 2025
f6b2c34
update lockfile
tasansal May 7, 2025
dbec58e
lint v1 files
tasansal May 7, 2025
ecb69d6
update lock file
tasansal May 27, 2025
9dd9fbc
schema_v1-dataset_builder-add_dimension
dmitriyrepin Jun 24, 2025
5816f83
V1 schema review (#553)
BrianMichell Jun 26, 2025
f88531e
Merge remote-tracking branch 'upstream/v1' into v1
dmitriyrepin Jun 26, 2025
1358f95
First take on add_dimension(), add_coordinate(), add_variable()
dmitriyrepin Jun 27, 2025
e5261cb
Finished add_dimension, add_coordinate, add_variable
dmitriyrepin Jun 28, 2025
95c01d8
Work on build
dmitriyrepin Jun 30, 2025
f391e23
Merge branch 'main' into v1
tasansal Jul 1, 2025
46f82f0
Generalize _to_dictionary()
dmitriyrepin Jul 1, 2025
0dc7cc8
build
dmitriyrepin Jul 1, 2025
fe4af2b
[v1] Update dependencies to latest (#567)
tasansal Jul 2, 2025
79863ac
Dataset Build - pass one
dmitriyrepin Jul 2, 2025
ec480f1
Merge the latest TGSAI/mdio-python:v1 branch
dmitriyrepin Jul 2, 2025
4062a77
unpin zarr because breaking bug fixed (#569)
tasansal Jul 7, 2025
fa81ea2
Merge branch 'v1' into v1
tasansal Jul 7, 2025
4b2b163
Revert .container changes
dmitriyrepin Jul 7, 2025
c532c3b
PR review: remove DEVELOPER_NOTES.md
dmitriyrepin Jul 7, 2025
08798cd
PR Review: add_coordinate() should accept only data_type: ScalarType
dmitriyrepin Jul 7, 2025
e8febe4
PR review: add_variable() data_type remove default
dmitriyrepin Jul 7, 2025
0a4be3f
RE review: do not add dimension variable
dmitriyrepin Jul 8, 2025
7b25d6b
PR Review: get api version from the package version
dmitriyrepin Jul 8, 2025
7ca3ed8
PR Review: remove add_dimension_coordinate
dmitriyrepin Jul 9, 2025
4d1ec9c
PR Review: add_coordinate() remove data_type default value
dmitriyrepin Jul 9, 2025
99fcf43
PR Review: improve unit tests by extracting common functionality in v…
dmitriyrepin Jul 9, 2025
0778fdd
Remove the Dockerfile changes. They are not supposed to be a part of …
dmitriyrepin Jul 9, 2025
7e74567
PR Review: run ruff
dmitriyrepin Jul 9, 2025
0aaa5f6
PR Review: fix pre-commit errors
dmitriyrepin Jul 10, 2025
1904dee
remove some noqa overrides
tasansal Jul 10, 2025
90d31a1
Implement MDIO Dataset builder to create in-memory instance of schema…
dmitriyrepin Jul 10, 2025
4c7c833
Writing XArray / Zarr
dmitriyrepin Jul 10, 2025
4b39ffa
gitignore
dmitriyrepin Jul 10, 2025
e772a4f
Merge remote-tracking branch 'upstream/v1' into v1
dmitriyrepin Jul 11, 2025
cea7308
to_zarr() fix compression
dmitriyrepin Jul 11, 2025
850135e
Fix precommit issues
dmitriyrepin Jul 11, 2025
82f1960
Use only make_campos_3d_acceptance_dataset
dmitriyrepin Jul 11, 2025
b5ee31e
PR Review: address the review comments
dmitriyrepin Jul 14, 2025
7b3ba70
Update _get_fill_value for StructuredType
dmitriyrepin Jul 14, 2025
a4ff4a9
Fix fill type issue for the Structured Types
dmitriyrepin Jul 16, 2025
81bfa76
Improve code coverage
dmitriyrepin Jul 16, 2025
0447659
Fix spelling
dmitriyrepin Jul 17, 2025
d08e2c4
Revert "Fix spelling"
dmitriyrepin Jul 17, 2025
657d2cf
extend per-file ignores for PLR2004 and remove noqa overrides in spec…
tasansal Jul 17, 2025
bfab1d7
Refactor tests: clarify Zarr-related test names, fix type hints, and …
tasansal Jul 17, 2025
9a033de
merge main into v1
tasansal Jul 17, 2025
5878e97
MDIO v1 Templates and Template Registry (#573)
dmitriyrepin Jul 22, 2025
aaf8fc6
Merge branch 'main' into v1
tasansal Aug 6, 2025
bc35bdf
update deps
tasansal Aug 6, 2025
3b62d8f
Merge branch 'main' into v1
tasansal Aug 6, 2025
aae5f51
Merge branch 'main' into v1
tasansal Aug 6, 2025
d21278c
address issues with VS Code dev containers (see issue 559) (#576)
dmitriyrepin Aug 7, 2025
d3a7da2
segy_to_mdio_v1 (#577)
dmitriyrepin Aug 8, 2025
21f64c0
Make some integration tests for work with new `segy_to_mdio` (#599)
dmitriyrepin Aug 12, 2025
4762b7d
remove developer tests
tasansal Aug 12, 2025
db0d564
Serialize text and binary headers (#600)
dmitriyrepin Aug 12, 2025
f5ee136
shot_point (#602)
dmitriyrepin Aug 13, 2025
a8b12f3
Add template: Offset + Azimuth binned CDP gathers (COCA) (#605)
tasansal Aug 15, 2025
0ceccc1
Eager memory allocation fix (#609)
BrianMichell Aug 25, 2025
4116473
Fix memory and core utilization regressions
BrianMichell Aug 26, 2025
fcffad8
Merge pull request #615 from BrianMichell/memory_regression
BrianMichell Aug 26, 2025
faeb616
Export functionality for MDIO v1 ingested files (#611)
dmitriyrepin Sep 3, 2025
db33c2a
v1 implementation of AutoChannelWrap grid override (#632)
dmitriyrepin Sep 5, 2025
379c2f5
Move to Zarr v3 as default for on disk storage format (#630)
tasansal Sep 6, 2025
b8dbb82
fix cloud i/o issue (#637)
tasansal Sep 6, 2025
22bed5c
snake-case to camelCase (#638)
tasansal Sep 6, 2025
9d38e1f
Fix output URI handling for remote stores (#639)
tasansal Sep 6, 2025
4365892
allow legacy v2 support (#640)
tasansal Sep 7, 2025
6562594
Reorganize code and simplify schemas and logic everywhere (#642)
tasansal Sep 7, 2025
642721c
First pass review and alignment of templates (#643)
tasansal Sep 7, 2025
0abe5ac
Fix ingestion of coordinates without full dimensions (#644)
tasansal Sep 8, 2025
b85cc35
Disable unimplemented tests (#647)
tasansal Sep 8, 2025
8931573
Merge branch 'main' into v1
tasansal Sep 8, 2025
e16640b
remove todo, it has correct behaviour, also rename .build_dataset `he…
tasansal Sep 8, 2025
219d354
set version to 1.0.0
tasansal Sep 8, 2025
bef28f8
unpin hardcoded version from tests
tasansal Sep 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
v1 implementation of AutoChannelWrap grid override (#632)
* AutoChannelWrap over updated-v1

* Fix test

* rename function for new behaviour and improve type hint for grid_overrides

* simplify metadata handling

* lint

* gridOverride is not required

* remove unnecessary byte order change, handled upstream.

* remove rtol adds, tests pass.

* remove expected behaviour comment

* clean up tests

* use grouped assignments to fix PLR915

* add comments to clarify

---------

Co-authored-by: Altay Sansal <tasansal@users.noreply.github.com>
  • Loading branch information
dmitriyrepin and tasansal authored Sep 5, 2025
commit db33c2aa6bc08efeb1bdb1534a28baa516f31945
55 changes: 33 additions & 22 deletions src/mdio/converters/segy.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
from mdio.segy.utilities import get_grid_plan

if TYPE_CHECKING:
from typing import Any

from segy.arrays import HeaderArray as SegyHeaderArray
from segy.schema import SegySpec
from xarray import Dataset as xr_Dataset
Expand Down Expand Up @@ -113,25 +115,30 @@ def grid_density_qc(grid: Grid, num_traces: int) -> None:


def _scan_for_headers(
segy_file: SegyFile, template: AbstractDatasetTemplate
segy_file: SegyFile,
template: AbstractDatasetTemplate,
grid_overrides: dict[str, Any] | None = None,
) -> tuple[list[Dimension], SegyHeaderArray]:
"""Extract trace dimensions and index headers from the SEG-Y file.

This is an expensive operation.
It scans the SEG-Y file in chunks by using ProcessPoolExecutor
"""
# TODO(Dmitriy): implement grid overrides
# https://github.com/TGSAI/mdio-python/issues/585
# The 'grid_chunksize' is used only for grid_overrides
# While we do not support grid override, we can set it to None
grid_chunksize = None
segy_dimensions, chunksize, segy_headers = get_grid_plan(
full_chunk_size = template.full_chunk_size
segy_dimensions, chunk_size, segy_headers = get_grid_plan(
segy_file=segy_file,
return_headers=True,
template=template,
chunksize=grid_chunksize,
grid_overrides=None,
chunksize=full_chunk_size,
grid_overrides=grid_overrides,
)
if full_chunk_size != chunk_size:
# TODO(Dmitriy): implement grid overrides
# https://github.com/TGSAI/mdio-python/issues/585
# The returned 'chunksize' is used only for grid_overrides. We will need to use it when full
# support for grid overrides is implemented
err = "Support for changing full_chunk_size in grid overrides is not yet implemented"
raise NotImplementedError(err)
return segy_dimensions, segy_headers


Expand Down Expand Up @@ -278,7 +285,7 @@ def _populate_coordinates(
return dataset, drop_vars_delayed


def _add_text_binary_headers(dataset: Dataset, segy_file: SegyFile) -> None:
def _add_segy_ingest_attributes(dataset: Dataset, segy_file: SegyFile, grid_overrides: dict[str, Any] | None) -> None:
text_header = segy_file.text_header.splitlines()
# Validate:
# text_header this should be a 40-items array of strings with width of 80 characters.
Expand All @@ -301,23 +308,27 @@ def _add_text_binary_headers(dataset: Dataset, segy_file: SegyFile) -> None:

# Handle case where it may not have any metadata yet
if dataset.metadata.attributes is None:
dataset.attrs["attributes"] = {}
dataset.metadata.attributes = {}

segy_attributes = {
"textHeader": text_header,
"binaryHeader": segy_file.binary_header.to_dict(),
}

if grid_overrides is not None:
segy_attributes["gridOverrides"] = grid_overrides

# Update the attributes with the text and binary headers.
dataset.metadata.attributes.update(
{
"textHeader": text_header,
"binaryHeader": segy_file.binary_header.to_dict(),
}
)
dataset.metadata.attributes.update(segy_attributes)


def segy_to_mdio(
def segy_to_mdio( # noqa PLR0913
segy_spec: SegySpec,
mdio_template: AbstractDatasetTemplate,
input_location: StorageLocation,
output_location: StorageLocation,
overwrite: bool = False,
grid_overrides: dict[str, Any] | None = None,
) -> None:
"""A function that converts a SEG-Y file to an MDIO v1 file.

Expand All @@ -329,6 +340,7 @@ def segy_to_mdio(
input_location: The storage location of the input SEG-Y file.
output_location: The storage location for the output MDIO v1 file.
overwrite: Whether to overwrite the output file if it already exists. Defaults to False.
grid_overrides: Option to add grid overrides.

Raises:
FileExistsError: If the output location already exists and overwrite is False.
Expand All @@ -340,12 +352,11 @@ def segy_to_mdio(
segy_settings = SegySettings(storage_options=input_location.options)
segy_file = SegyFile(url=input_location.uri, spec=segy_spec, settings=segy_settings)

# Scan the SEG-Y file for headers
segy_dimensions, segy_headers = _scan_for_headers(segy_file, mdio_template)
segy_dimensions, segy_headers = _scan_for_headers(segy_file, mdio_template, grid_overrides)

grid = _build_and_check_grid(segy_dimensions, segy_file, segy_headers)

dimensions, non_dim_coords = _get_coordinates(grid, segy_headers, mdio_template)
_, non_dim_coords = _get_coordinates(grid, segy_headers, mdio_template)
# TODO(Altay): Turn this dtype into packed representation
# https://github.com/TGSAI/mdio-python/issues/601
headers = to_structured_type(segy_spec.trace.header.dtype)
Expand All @@ -358,7 +369,7 @@ def segy_to_mdio(
headers=headers,
)

_add_text_binary_headers(dataset=mdio_ds, segy_file=segy_file)
_add_segy_ingest_attributes(dataset=mdio_ds, segy_file=segy_file, grid_overrides=grid_overrides)

xr_dataset: xr_Dataset = to_xarray_dataset(mdio_ds=mdio_ds)

Expand Down
5 changes: 5 additions & 0 deletions src/mdio/schemas/v1/templates/abstract_dataset_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,11 @@ def coordinate_names(self) -> list[str]:
"""Returns the names of the coordinates."""
return copy.deepcopy(self._coord_names)

@property
def full_chunk_size(self) -> list[int]:
"""Returns the chunk size for the variables."""
return copy.deepcopy(self._var_chunk_shape)

@property
@abstractmethod
def _name(self) -> str:
Expand Down
63 changes: 41 additions & 22 deletions tests/integration/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
from segy.schema import SegySpec


def _segy_spec_mock_4d() -> SegySpec:
"""Create a mock SEG-Y spec for 4D data."""
def get_segy_mock_4d_spec() -> SegySpec:
"""Create a mock 4D SEG-Y specification."""
trace_header_fields = [
HeaderField(name="field_rec_no", byte=9, format="int32"),
HeaderField(name="channel", byte=13, format="int32"),
Expand All @@ -31,6 +31,13 @@ def _segy_spec_mock_4d() -> SegySpec:
HeaderField(name="shot_line", byte=133, format="int16"),
HeaderField(name="cable", byte=137, format="int16"),
HeaderField(name="gun", byte=171, format="int16"),
HeaderField(name="coordinate_scalar", byte=71, format="int16"),
HeaderField(name="source_coord_x", byte=73, format="int32"),
HeaderField(name="source_coord_y", byte=77, format="int32"),
HeaderField(name="group_coord_x", byte=81, format="int32"),
HeaderField(name="group_coord_y", byte=85, format="int32"),
HeaderField(name="cdp_x", byte=181, format="int32"),
HeaderField(name="cdp_y", byte=185, format="int32"),
]
rev1_spec = get_segy_standard(1.0)
spec = rev1_spec.customize(trace_header_fields=trace_header_fields)
Expand Down Expand Up @@ -83,33 +90,45 @@ def create_segy_mock_4d( # noqa: PLR0913
channel_headers = np.tile(channel_headers, shot_count)

factory = SegyFactory(
spec=_segy_spec_mock_4d(),
spec=get_segy_mock_4d_spec(),
sample_interval=1000,
samples_per_trace=num_samples,
)

headers = factory.create_trace_header_template(trace_count)
samples = factory.create_trace_sample_template(trace_count)

for trc_idx in range(trace_count):
shot = shot_headers[trc_idx]
gun = gun_headers[trc_idx]
cable = cable_headers[trc_idx]
channel = channel_headers[trc_idx]
shot_line = 1
offset = 0

if index_receivers is False:
channel, gun, shot_line = 0, 0, 0

header_data = (shot, channel, shot, offset, shot_line, cable, gun)

fields = list(headers.dtype.names)
fields.remove("samples_per_trace")
fields.remove("sample_interval")

headers[fields][trc_idx] = header_data
samples[trc_idx] = np.linspace(start=shot, stop=shot + 1, num=num_samples)
start_x = 700000
start_y = 4000000
step_x = 100
step_y = 100

for trc_shot_idx in range(shot_count):
for trc_chan_idx in range(total_chan):
trc_idx = trc_shot_idx * total_chan + trc_chan_idx

shot = shot_headers[trc_idx]
gun = gun_headers[trc_idx]
cable = cable_headers[trc_idx]
channel = channel_headers[trc_idx]
shot_line = 1
offset = 0

if index_receivers is False:
channel, gun, shot_line = 0, 0, 0

# Assign dimension coordinate fields with calculated mock data
header_fields = ["field_rec_no", "channel", "shot_point", "offset", "shot_line", "cable", "gun"]
headers[header_fields][trc_idx] = (shot, channel, shot, offset, shot_line, cable, gun)

# Assign coordinate fields with mock data
x = start_x + step_x * trc_shot_idx
y = start_y + step_y * trc_chan_idx
headers["coordinate_scalar"][trc_idx] = -100
coord_fields = ["source_coord_x", "source_coord_y", "group_coord_x", "group_coord_y", "cdp_x", "cdp_y"]
headers[coord_fields][trc_idx] = (x, y) * 3

samples[trc_idx] = np.linspace(start=shot, stop=shot + 1, num=num_samples)

with segy_path.open(mode="wb") as fp:
fp.write(factory.create_textual_header())
Expand Down
Loading
Loading