Skip to content

open_virtual_dataset doesn't resolve Azure storage path #771

@ecamossi

Description

@ecamossi

Hello everyone,

I'm trying to use virtualizarr version 2.0.1. to virtualize the access to some netCDF data files stored on an Azure storage container, but the creation of virtual datasets fails when resolving the remote urls.

The code below is a minimal example to create a virtual dataset for a single netCDF file, raising the same error.
The remote file is accessible from the Azure storage container, and the remote url to the file is correctly resolved if registry.resolve is run out of open_virtual_dataset. When executing open_virtual_dataset to virtualise the same file, the url is mapped to local storage of the compute instance where the code is executed, which does not exist, making the resolve function fail.

code snippet with results (details are removed)

import os
import sys
import fsspec
import glob
import adlfs

import obstore as obs

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers.hdf import HDFParser
from virtualizarr.registry import ObjectStoreRegistry

bucket = "abfs://"+os.environ["AZURE_STORAGE_CONTAINER"]  # env variable for Azure storage container
store = obs.store.from_url(bucket, account_name=os.environ["AZURE_STORAGE_ACCOUNT"],skip_signature=True) # env variable for Azure storage account

parser = HDFParser()
registry = ObjectStoreRegistry({f"{bucket}": store})

f_url = f'abfs://<my_azure_storage_container>/<remote_path_to_netcdf_file>'
registry.resolve(url=f_url)

The remote url to the file is correctly resolved by the instruction above

AzureStore(container_name="<my_azure_storage_container>", account_name="<my_azure_storage_account>"),
 '<remote_path_to_netcdf_file>'

but not inside open_virtual_dataset

vds = open_virtual_dataset(
  url=_url,
  parser=parser,
  registry=registry,
  loadable_variables=[],
)

which raises this error

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[...], line 1
----> 1 vds = open_virtual_dataset(
      2   url=_url,
      3   parser=parser,
      4   registry=registry,
      5   loadable_variables=[],
      6 )

File [[...]/lib/python3.12/site-packages/virtualizarr/xarray.py:87](https://[...]lib/python3.12/site-packages/virtualizarr/xarray.py#line=86), in open_virtual_dataset(url, registry, parser, drop_variables, loadable_variables, decode_times)
     45 """
     46 Open an archival data source as an [xarray.Dataset][] wrapping virtualized zarr arrays.
     47 
   (...)     83     in `loadable_variables` and normal lazily indexed arrays for each variable in `loadable_variables`.
     84 """
     85 filepath = validate_and_normalize_path_to_uri(url, fs_root=Path.cwd().as_uri())
---> 87 manifest_store = parser(
     88     url=filepath,
     89     registry=registry,
     90 )
     92 ds = manifest_store.to_virtual_dataset(
     93     loadable_variables=loadable_variables,
     94     decode_times=decode_times,
     95 )
     96 return ds.drop_vars(list(drop_variables or ()))

File [[...]lib/python3.12/site-packages/virtualizarr/parsers/hdf/hdf.py:168]([...]lib/python3.12/site-packages/virtualizarr/parsers/hdf/hdf.py#line=167), in HDFParser.__call__(self, url, registry)
    147 def __call__(
    148     self,
    149     url: str,
    150     registry: ObjectStoreRegistry,
    151 ) -> ManifestStore:
    152     """
    153     Parse the metadata and byte offsets from a given HDF5[/NetCDF4]([...]NetCDF4) file to produce a VirtualiZarr
    154     [ManifestStore][virtualizarr.manifests.ManifestStore].
   (...)    166         A [ManifestStore][virtualizarr.manifests.ManifestStore] which provides a Zarr representation of the parsed file.
    167     """
--> 168     store, path_in_store = registry.resolve(url)
    169     reader = ObstoreReader(store=store, path=path_in_store)
    170     manifest_group = _construct_manifest_group(
    171         filepath=url,
    172         reader=reader,
    173         group=self.group,
    174         drop_variables=self.drop_variables,
    175     )

File [[...]lib/python3.12/site-packages/virtualizarr/registry.py:264]([...]lib/python3.12/site-packages/virtualizarr/registry.py#line=263), in ObjectStoreRegistry.resolve(self, url)
    262             path_after_prefix = path.lstrip("[/](https://[...].azureml.ms/)")
    263         return store, path_after_prefix
--> 264 raise ValueError(f"Could not find an ObjectStore matching the url `{url}`")

ValueError: Could not find an ObjectStore matching the url `[file:///mnt/batch/tasks/shared/LS_root/mounts/clusters/<path-to-local-storage-of-compute-instance>/abfs%3A/<my_azure_storage_container>/<remote_path_to_netcdf_file>`](file:///mnt/batch/tasks/shared/LS_root/mounts/clusters/<path-to-local-storage-of-compute-instance>/abfs%3A/<my_azure_storage_container>/<remote_path_to_netcdf_file>%60)

Any comment or suggestion is much appreciated.
Thank you!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingobstore

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions