Machine learning UDF on datacubes

Hi all,

I am having a little trouble understanding the UDF procedures on datacubes. My intention is to apply a ML algorithm on each pixel’s spectrum (21 bands for S3 OLCI) with hyperparameters supplied by an external .py file. I have written the algorithms by matrix calculations, thus no specific ML library is needed, it all runs on numpy. In numpy, I feed the function a 3D array (bands,lat,lon) and the hyperparameters of the ML algorithm from the model file, then I run a double nested for loop over the lat-lon dimensions, and apply per-pixel the matrix multiplications to achieve a single value for the corresponding pixel, which is a biophysical variable (e.g. LAI).

My Main questions are:

  1. How does openEO apply to UDF on the datacube? I was looking in the github source code, and examples, and it seems like I don’t need to run the double for loop to create the map, but the udf is subsequently applied on all lat-lon pixels?

  2. How would I be able to get the spectrum (e.g. a list on numbers for each pixel) which I can use in my calculations?

  3. How do I supply the external model.py file and what path do I provide for sys?

I provide below a MRE of the code. Thanks in advance

bandlist = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16', 'B17', 'B18', 'B19', 'B20', 'B21']

bbox = [
          -0.7316519637662964,
          39.043183874892605,
          0.0723422016996551,
          39.80699041715417
        ]

s3_cube = connection.load_collection(
    "SENTINEL3_OLCI_L1B",
    spatial_extent={"west": bbox[0], "south": bbox[1], "east": bbox[2], "north": bbox[3]},
    temporal_extent=["2022-03-01", "2022-03-31"],
    bands = bandlist
).reduce_temporal("mean")

udf = openeo.UDF("""

import numpy as np

sys.path.append('dir of model.py')

import model

def GPR_mapping(S3_scene, # 3D numpy array (bands,latitude,longitude)
                hyperparameter_1, # variable 1 from model.py file (list size of bands)
                hyperparameter_2, # variable 2 from model.py file (list size of bands)
                ):

    bands, ydim,xdim = S3_scene.shape
    variable_map = np.empty((ydim,xdim))

    for v in range(0,xdim): # Do I need a double nested for loop to create a map with UDFs?
        for f in range(0,ydim):

            pixel_spectra = S3_scene[:,f,v] # How do I get a list of band values from datacube?

            intermediate_result  = ((pixel_spectra - hyperparameter_1) / hyperparameter_2) # Do soomething mathematical with hyperparameter
            final_result = pow(intermediate_result,hyperparameter_1) # Do soomething mathematical with hyperparameter
            pixel_value = final_result/2

            variable_map[f,v] = pixel_value # Append variable map np.array to create map
 
    return variable_map
""")

# Pass UDF
LAI_map= s3_cube.apply(process=udf)

LAI_map.download("udf.nc")

Hi @david.kovacs,

I recently did something similar. I trained a Random Forest model with sci-kit learn, converted to ONNX and used it with an openEO UDF to perform inference. I was supported by a developer of the VITO team through the NoR sponsoring (ESA sponsors you research project with funds) https://portfolio.nor-discover.org/

You can have a look here: openEO_photovoltaic/udf_inference/openeo_pv_farms_inference_udf.ipynb at main · clausmichele/openEO_photovoltaic · GitHub

@jeroen.dries @jeroen.verstraelen @stefaan.lippens can a model be loaded from a public github repo or does it have to be uploaded in the artifactory? Could you explain here the procedure or point to some documentation?

Hi Michele,

Thanks for providing some support. Actually, I have taken a look at your codes and I still could not find how you could apply the model on each spatial pixel on the datacube. Is it handled by the udf by default?

I would like to get the bands’ value for each pixel as my model works on the spectral information provided as a “list” of values. Suppose Sentinel 3 OLCI has 21 bands, thus I’d like to get for each lat-lon pixel a list of length 21, containing the radiance for each corresponding band

Dávid

Hi David,
for what you describe, you normally need an ‘apply_dimension’ process, with dimension=“bands”.

The udf then receives an xarray with all bands and no time dimension.
It will have spatial dimensions, which you should retain, for performance reasons.

Hope that helps!

Hi Jeroen,

Thanks for your help, I managed to use apply_dimension with the bands dimension, however I am still struggling to handle the bands within the udf.

My main concern is that when I do anythin within the udf, it gives me an array of (21,256,256) shape, (I guess because of the chunking process taking place back-end). I have read everywhere in the documentation, but I still cant get it to work.

import openeo
from openeo.udf import XarrayDataCube

udf = openeo.UDF("""

def apply_datacube(cube: xarray.DataArray, context: dict) -> xarray.DataArray:

    from openeo.udf import XarrayDataCube
    import numpy as np
    import xarray as xr

    array_model = np.array([band for band in range(1,22)])
    result_map = xr.DataArray() # or np.array, needed for resulting map

    result_map = cube.values * array_model #I'd need cube.values to be an array lenght of 21

    return result_map
""")

rescaled = s3_cube.apply_dimension(process=udf,dimension="bands")
rescaled.download("udf.nc")

The main question is, how I would do mathematical array operations on the 21 bands with a 21 long array from my model? I would want to append these results to my new array (result_map)

thanks in advance!
Dávid