Passing user defined parameters to an UDF

valentina.premier · 5 December 2023 11:26

I have been trying to pass an user-defined parameter to my UDF but unsuccessfully.
I have tried to simply pass a dictionary as context. I read the collection “FRACTIONAL SNOW COVER” and I want to binarize the datset based on a threshold of snow cover fraction.

If this is my UDF (I know that probably it could be replaced by an OpenEO process, but I want to understand how it works for future developments)

from openeo.udf import XarrayDataCube
from openeo.udf import OpenEoUdfException
import numpy as np
import xarray as xr
 
 
def apply_datacube(cube: XarrayDataCube,
                   context: dict) -> XarrayDataCube:
 
 
    snowT = context['snowT']

    array = cube.get_array()
    array = xr.where((array >= snowT) & (array <= 100), array, 0)
    array = xr.where(array < snowT, array, 100)
    array = array.astype(np.int16)
 
    return XarrayDataCube(array)

saved in a script called udf-binarize.py, I tried to pass the context in this way

binarize = openeo.UDF(Path('udf-binarize.py').read_text())
binarize_dict = {'snowT':20}
scf_test = scf.apply(process=binarize, context=binarize_dict)

But I get this error:

OpenEO batch job failed: UDF Exception during Spark execution: File “/opt/venv/lib64/python3.8/site-packages/openeo/udf/run_code.py”, line 180, in run_udf_code result_cube = func(cube=data.get_datacube_list()[0], context=data.user_context) File “”, line 38, in apply_datacube openeo.udf.OpenEoUdfException: Missing snowT in dictionary

I dont’ understand how to pass that parameter “snowT”, I had a look online and the only solution I found is to do like this guy

that is fixing the parameter (e.g., C=0) and then replacing it in the text later on. But is there a more elegant solution?

Also, according to the documentation here

https://open-eo.github.io/openeo-python-client/api.html#openeo.udf.run_code.execute_local_udf

if I want to run the UDF locally by using execute_local_udf, I cannot pass the context to that function, isn’t it?!

Last question, if I put a print() in my UDF for debugging , how can I visualize the output in the Web Editor?

Thanks in advance
Valentina

stefaan.lippens · 5 December 2023 13:03

Hi Valentina,

It’s currently a badly documented feature, but the trick is to also make your UDF context aware, like this:

binarize = openeo.UDF(..., context={"from_parameter": "context"})

I hope this already solves your problem. However some more notes:
If you load your UDF from a file, it’s typically a lot easier to use UDF.from_file() instead of doing Path().read_text():

binarize = openeo.UDF.from_file("udf-binarize.py", context={"from_parameter": "context"})

I also think there is a bug in your UDF at the moment:

array = xr.where((array >= snowT) & (array <= 100), array, 0)
array = xr.where(array < snowT, array, 100)

The first statement will map all values outside of the rangesnowT-100 to 0 and the second statement will map this 0 value to 100. As this can be done with a single where, I guess there is something wrong here.

valentina.premier · 5 December 2023 13:47

Thank you Stefaan for your quick and helpful reply!

So finally, it was working in this way:

binarize = openeo.UDF.from_file('udf-binarize.py', context={"from_parameter": "context"})
scf_test = scf.apply(process=binarize, context={"snowT": 20})

However, still it is not working if I try to run the UDF locally, since the execute_local_udf does not accept context as input. Is there a possibility to pass it somehow?

Regarding the UDF, it is actually doing what I want since according to the documentation of xarray.where, when in range snowT-100, the original array is kept and is set to 0 when the condition is False, while 100 replaces all values greater than the threshold. BTW, I agree there should be a nicer way to do it, but it was just to understand the UDF sintax

Thanks a lot!

stefaan.lippens · 5 December 2023 14:19

execute_local_udf indeed does not support a user specified context indeed. I made a feature request here: Add context support to `execute_local_udf` · Issue #514 · Open-EO/openeo-python-client · GitHub
Possible workaround (assuming that you are using execute_local_udf just for local debugging purposes): give the context argument in your UDF function a default value (e.g. def apply_datacube(..., context:dict = {"snowT": 20}) which will allow you to use the same UDF code locally as with a real openeo service. Customizing the snowT value will however not be possible locally.

stefaan.lippens · 5 December 2023 15:28

I’m still confused, but this is the result you want:

input value below snowT → output 0
input value between snowT and 100 → output 100
input value above 100 → output 0

If this is correct: you don’t need a UDF to do a thresholding operation like this, you should be able to use the “band math” feature. For example something along the lines of

cover = cube.band("FSCTOC")
scf_test = 100.0 * (cover >= 20) * (cover <= 100)

valentina.premier · 5 December 2023 16:33

Thank you very much for this helpful suggestion!
Yes, exactly apart that I want to keep the cloud values (205) as it is - sorry I did not enter in the detail of the function since as I said, it was only a dummy example.

So, finally the solution would be

scf_test = 100.0 * (cover >= 20) * (cover <= 100) + 205.0 * (cover == 205)

However, it returns -51 instead of 205, while the value is correct when running it locally.

stefaan.lippens · 5 December 2023 18:47

This looks like a case of signed/unsigned confusion: unsigned byte 205 is the same as signed byte -51 .

Can you share your full python script or workflow to get some more relevant details (openeo connection url, connection id, spatiotemporal extent, …)?

valentina.premier · 6 December 2023 09:16

Yes sure! Here is the code


# authentication
eoconn = openeo.connect("https://openeo-dev.vito.be")
eoconn.authenticate_oidc()
eoconn.describe_account()

# load the Copernicus fractional snow cover collection
scf = eoconn.load_collection(
    "FRACTIONAL_SNOW_COVER",
    spatial_extent  = {'west':10.728539,'east':11.039333,'south':46.647281,'north':46.796379, 'crs':4326},
    temporal_extent=['2023-08-02','2023-08-07'],
    bands=["FSCTOC"]
)

scf_test = 100.0 * (scf >= 20) * (scf <= 100) + 205.0 * (scf == 205)

stefaan.lippens · 6 December 2023 09:53

and how do you download scf_test? synchronously or as batch job, which file format?

valentina.premier · 6 December 2023 10:17

I tried both ways

scf_test.download(base_path + os.sep + 'scf_test.nc')

or through a batch job

scf_test = scf_test.save_result(format='netCDF')
job = scf_test.create_job(title='scf_binary')
job.start_job()
results = job.get_results()
results.download_files('scf_bin.nc')

anyway I always get the same

valentina.premier · 6 December 2023 10:19

While by using the UDF - even though not that convenient maybe - I get the expected result

def apply_datacube(cube: XarrayDataCube,
                   context: dict) -> XarrayDataCube:
    """
    If a pixel value is greater or equal then a threshold, will set up as 
    100. If smaller, will be set up as 0.
    
    FSCTOC (Copernicus) cloud values are set as 205 -> this value is kept
    0 (no snow) is set as no data
    """
    
    snowT = context['snowT']
    
    array = cube.get_array()
    
    # valid pixel, no cloud, SCF between snowT and 100 : set as 100 
    condition1 = array.notnull() & (array >= snowT) & (array <= 100) & (array!=205)
    array = xr.where(condition1, 100, array)

    # valid pixel, no cloud, SCF between 0 and snowT : set as 0 
    condition2 = array.isnull() | ((array >= 0) & (array < snowT) & (array!=205))
    array = xr.where(condition2, 0, array)

    
    array = array.astype(np.int16)

    return XarrayDataCube(array)

stefaan.lippens · 6 December 2023 16:49

I spent quite some time trying to figure out how to escape the signed byte range that converts the 205 to -51, but didn’t find a solution yet. I made it a bug report here: FRACTIONAL_SNOW_COVER: how to escape from signed bytes? · Issue #601 · Open-EO/openeo-geopyspark-driver · GitHub

For the time being you have these workarounds:

use the UDF solution
map the 205 value to something below 127 (but above 100), e.g. 120, which will be preserved properly, e.g.:
```
fsctoc = scf.band("FSCTOC")
scf_test = 100 * (fsctoc >= 20) * (fsctoc <= 100) + 120 * (fsctoc == 205)
```

valentina.premier · 6 December 2023 17:13

Thanks a lot for your time

valentina.premier · 5 February 2024 09:55

Hi @stefaan.lippens,

sorry if I am coming back to this discussion but I have recently noted one more issue.
When applying my udf to a larger time range, I noticed that for the second time step of the Datacube there are unexpected no data values. This is the input collection:

scf = eoconn.load_collection(
    "FRACTIONAL_SNOW_COVER",
    spatial_extent  = {'west':bbox[0],
                       'east':bbox[2],
                       'south':bbox[1],
                       'north':bbox[3],
                       'crs':4326},
    temporal_extent=['2023-08-02','2023-08-15'],
    bands=["FSCTOC"]
)

Other code remains the same.
If I am not wrong, nan should be replaced with 0 according to condition2 in the udf.

BTW, this is the output I get

There is a square that seems to be 256x256 large and also a strip on the lower part of the image (not clear in this picture) that are not correctly replaced with 0. Given the size of those area and given the fact that this issue doesn’t arise when running the udf locally, I suspect this is linked to the default chuck size when using apply, isn’t it? What’s the way to deal with this problem?!

Thanks a lot
Valentina

stefaan.lippens · 7 February 2024 13:51

The back-end you are using processes the whole requested extent in smaller chunks (e.g. tiles of 256x256 size) to make the processing scalable. In this case, it might be that that particular tile is missing in the source data, or the backend detected that all pixels are no-data at load time and decided to skip it for performance reasons. As a result your UDF is never applied to that region. When stitching the data back together in a single, the missing region is filled in with no-data values like you see in your result if I understand correctly.

I don’t think there is currently a workaround for this problem.
However I vaguely remember discussions about specifying a custom fill-value for no-data pixels in the save_result process to be used when stitching a sparse cube to a single file. Maybe @jeroen.dries knows more

jeroen.dries · 7 February 2024 15:06

I think I can mostly confirm the analysis, but we don’t have something to customize the nodata value in the output.
To make it more consistent, I would recommend that the UDF does not do conversion to int16, and leaves the datatype in floating point, so using nan for nodata.
Then, after the UDF, add a ‘linear_scale_range’ process, with an output range and input range that are the same, and that fit into the int16 range. This should result in an int16 output file, with at least a consistent nodata value.