UDF: dimensions order and shape of Xarray DataArray

margot.verhulst · 7 December 2023 11:48

Hi,

I noticed a difference in the order of the dimensions and the shape of the Xarrary DataArray of the datacube that goes into an UDF (as a XarrayDataCube) when using execute_local_udf versus applying the udf with reduce_dimension.

For completeness, minimal code for the datacube:

cube = connection.load_collection(
    collection_id="TERRASCOPE_S2_TOC_V2",
    spatial_extent={"west": 4.40, "south": 50.75, "east": 4.43, "north": 50.78},
    temporal_extent=[datetime.datetime(2018, 1, 1), datetime.datetime(2018, 12, 31)],
    bands=["B04", "B08", "SCL"])

cube_vi = compute_index(cube, "NDVI")

cube_vi_agg = cube_vi.aggregate_temporal_period(period="month", reducer="median")

In the case of execute_local_udf:

# The first lines of the udf
def apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:
    xr_dataarray: xr.DataArray = cube.get_array()
    print(f"{xr_dataarray.dims}")
    print(f"{xr_dataarray.shape}")

The 1st print statement returns ('t', 'bands', 'x', 'y')
The 2nd print statement returns (12, 1, 339, 219)

In the case of applying udf with reduce_dimension via batch job:

# The first lines of the udf
def apply_datacube(cube: XarrayDataCube, context: dict) -> XarrayDataCube:
    xr_dataarray: xr.DataArray = cube.get_array()
    inspect(message=f"{xr_dataarray.dims}")
    inspect(message=f"{xr_dataarray.shape}")

The 1st inspect statement returns ('t', 'bands', 'y', 'x')
The 2nd inspect statement returns (12, 1, 256, 256)

The ‘x’ and ‘y’ dimensions seem to be swapped
The displayed number of pixels is different (339 and 219 should be the correct numbers for this bbox)

Even though my udfs are working properly now, these differences confused me a lot during the development of the udf. So I’m still looking to understand this better:

Is this behaviour expected? Or am I doing something wrong?
If expected, is there a way to avoid these differences? It is my understanding that execute_local_udf can be used for the development/debugging of a udf. However in my case, I needed to adapt the udf anyway when moving from local execution to execution on the openeo backend. Is that normal?
Is there a specific reason for the shape returning 256, 256 instead of 339, 219?

Kind regards,
Margot

jeroen.verstraelen · 7 December 2023 16:07

Hi Margot,

We also noticed this discrepancy and it was fixed with this pull request:

The dimension ordering has been changed since version 0.24.0 of the openeo-python-client. You can install it via pip using:

pip install ‘openeo>=0.24.0’

Hope that helps! If the issue persists, feel free to let me know.

Kind regards,
Jeroen

jeroen.verstraelen · 7 December 2023 16:13

The reason that the CDSE or Terrascope backend use UDFs with 256x256 tile size is driven by specific implementation considerations.

The execute_local_udf function performs the UDF once on the entire datacube because it is usually small. On the backend the datacube may extend to, for instance, a hundred thousand pixels. That’s why they are first split into 256x256 tiles and the UDF is then executed once for every tile.

margot.verhulst · 7 December 2023 16:17

Great, thanks for the clarification! I will try again soon with the updated openeo version.

Kind regards,
Margot