Machine learning UDF on datacubes

david.kovacs · 8 April 2024 09:27

Hi all,

I am having a little trouble understanding the UDF procedures on datacubes. My intention is to apply a ML algorithm on each pixel’s spectrum (21 bands for S3 OLCI) with hyperparameters supplied by an external .py file. I have written the algorithms by matrix calculations, thus no specific ML library is needed, it all runs on numpy. In numpy, I feed the function a 3D array (bands,lat,lon) and the hyperparameters of the ML algorithm from the model file, then I run a double nested for loop over the lat-lon dimensions, and apply per-pixel the matrix multiplications to achieve a single value for the corresponding pixel, which is a biophysical variable (e.g. LAI).

My Main questions are:

How does openEO apply to UDF on the datacube? I was looking in the github source code, and examples, and it seems like I don’t need to run the double for loop to create the map, but the udf is subsequently applied on all lat-lon pixels?
How would I be able to get the spectrum (e.g. a list on numbers for each pixel) which I can use in my calculations?
How do I supply the external model.py file and what path do I provide for sys?

I provide below a MRE of the code. Thanks in advance

bandlist = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16', 'B17', 'B18', 'B19', 'B20', 'B21']

bbox = [
          -0.7316519637662964,
          39.043183874892605,
          0.0723422016996551,
          39.80699041715417
        ]

s3_cube = connection.load_collection(
    "SENTINEL3_OLCI_L1B",
    spatial_extent={"west": bbox[0], "south": bbox[1], "east": bbox[2], "north": bbox[3]},
    temporal_extent=["2022-03-01", "2022-03-31"],
    bands = bandlist
).reduce_temporal("mean")

udf = openeo.UDF("""

import numpy as np

sys.path.append('dir of model.py')

import model

def GPR_mapping(S3_scene, # 3D numpy array (bands,latitude,longitude)
                hyperparameter_1, # variable 1 from model.py file (list size of bands)
                hyperparameter_2, # variable 2 from model.py file (list size of bands)
                ):

    bands, ydim,xdim = S3_scene.shape
    variable_map = np.empty((ydim,xdim))

    for v in range(0,xdim): # Do I need a double nested for loop to create a map with UDFs?
        for f in range(0,ydim):

            pixel_spectra = S3_scene[:,f,v] # How do I get a list of band values from datacube?

            intermediate_result  = ((pixel_spectra - hyperparameter_1) / hyperparameter_2) # Do soomething mathematical with hyperparameter
            final_result = pow(intermediate_result,hyperparameter_1) # Do soomething mathematical with hyperparameter
            pixel_value = final_result/2

            variable_map[f,v] = pixel_value # Append variable map np.array to create map
 
    return variable_map
""")

# Pass UDF
LAI_map= s3_cube.apply(process=udf)

LAI_map.download("udf.nc")

michele.claus · 17 April 2024 09:09

Hi @david.kovacs,

I recently did something similar. I trained a Random Forest model with sci-kit learn, converted to ONNX and used it with an openEO UDF to perform inference. I was supported by a developer of the VITO team through the NoR sponsoring (ESA sponsors you research project with funds) https://portfolio.nor-discover.org/

You can have a look here: openEO_photovoltaic/udf_inference/openeo_pv_farms_inference_udf.ipynb at main · clausmichele/openEO_photovoltaic · GitHub

@jeroen.dries @jeroen.verstraelen @stefaan.lippens can a model be loaded from a public github repo or does it have to be uploaded in the artifactory? Could you explain here the procedure or point to some documentation?

david.kovacs · 22 April 2024 13:17

Hi Michele,

Thanks for providing some support. Actually, I have taken a look at your codes and I still could not find how you could apply the model on each spatial pixel on the datacube. Is it handled by the udf by default?

I would like to get the bands’ value for each pixel as my model works on the spectral information provided as a “list” of values. Suppose Sentinel 3 OLCI has 21 bands, thus I’d like to get for each lat-lon pixel a list of length 21, containing the radiance for each corresponding band

Dávid

jeroen.dries · 22 April 2024 14:55

Hi David,
for what you describe, you normally need an ‘apply_dimension’ process, with dimension=“bands”.

The udf then receives an xarray with all bands and no time dimension.
It will have spatial dimensions, which you should retain, for performance reasons.

Hope that helps!

david.kovacs · 23 April 2024 12:43

Hi Jeroen,

Thanks for your help, I managed to use apply_dimension with the bands dimension, however I am still struggling to handle the bands within the udf.

My main concern is that when I do anythin within the udf, it gives me an array of (21,256,256) shape, (I guess because of the chunking process taking place back-end). I have read everywhere in the documentation, but I still cant get it to work.

import openeo
from openeo.udf import XarrayDataCube

udf = openeo.UDF("""

def apply_datacube(cube: xarray.DataArray, context: dict) -> xarray.DataArray:

    from openeo.udf import XarrayDataCube
    import numpy as np
    import xarray as xr

    array_model = np.array([band for band in range(1,22)])
    result_map = xr.DataArray() # or np.array, needed for resulting map

    result_map = cube.values * array_model #I'd need cube.values to be an array lenght of 21

    return result_map
""")

rescaled = s3_cube.apply_dimension(process=udf,dimension="bands")
rescaled.download("udf.nc")

The main question is, how I would do mathematical array operations on the 21 bands with a 21 long array from my model? I would want to append these results to my new array (result_map)

thanks in advance!
Dávid