ValueError when loading trained model in UDF

margot.verhulst · 13 November 2023 10:40

Hi,
I am working on a UDF to apply an externally trained scikit-learn model. I work in PyCharm with conda environments. So far, the UDF works when I apply it locally with “execute_local_udf”. But when I switch to executing the UDF regularly, in my case with reduce_dimension(), the model won’t load. I used the inspect logging function to verify that this is where the udf fails.

The model is loaded with the following code:

clf = pickle.load(urllib.request.urlopen(url_model))

The model is currently hosted on GitHub:
url_model = "https://raw.githubusercontent.com/MargotVerhulst/raw/master/results_bestModel-modelRF-norun1.pickle"

The error I get is (job_id=‘j-231105a259794a02ab7a3c6e7bf6e12c’)
OpenEO batch job failed: UDF Exception during Spark execution: File "/opt/venv/lib64/python3.8/site-packages/openeo/udf/run_code.py", line 180, in run_udf_code result_cube = func(cube=data.get_datacube_list()[0], context=data.user_context) File "<string>", line 29, in apply_datacube File "sklearn/tree/_tree.pyx", line 661, in sklearn.tree._tree.Tree.__setstate__ ValueError: Did not recognise loaded array dytpe

It maybe seems like some kind of dependency issue where the versions of the relevant libraries do not match? I read this documentation here which might be relevant, but I do not fully understand how to apply it. Can you help me with how I should proceed?

Thank you in advance.

Kind regards,
Margot

jeroen.dries · 14 November 2023 11:08

Hi Margot,
the problem is usually indeed a version mismatch, especially a model in ‘pickle’ format is typically somewhat tied to specific python versions and also the libraries.

Our python version is currently 3.8, and we also have:
numpy==1.22.4
xarray~=0.16.2
scikit-learn==0.24.2

So you could try to replicate this environment. (We also have public docker images with that environment, but that’s probably more tedious to use.)
For scikit-learn, I notice our version has gotten rather old, we could also consider to upgrade that on our side.

Hope that helps!
Jeroen

margot.verhulst · 14 November 2023 14:00

Hi Jeroen,

Thanks a lot, that certainly helps!

I am still wondering about two things:

Would it have been possible for me to check these versions myself?
If pickle is especially prone to this problem; is there another (recommended) way to save/load (scikit-learn) models?

If possible, it would indeed be nice to have a more recent scikit-learn version.

Kind regards,
Margot

jeroen.dries · 17 November 2023 10:47

Hi Margot,
I looked around a bit for solutions, and the most promising method for model portability is ONNX. We already have tested this for more complex deeplearning models, but they claim that it also works with scikit-learn:
https://onnx.ai/sklearn-onnx/auto_examples/plot_convert_model.html

We are also preparing an example for how to use this in openEO:

github.com

Open-EO/openeo-community-examples/blob/ecf85dc7294de5f36ffbb62e70f216b903201f61/python/OnnxUdf/onnx_udf.py

import sys
from openeo.udf import XarrayDataCube
from typing import Dict
import xarray as xr
from openeo.udf.debug import inspect

sys.path.insert(0, "onnx_deps")
import onnxruntime as ort


def apply_datacube(cube: XarrayDataCube, context: Dict) -> XarrayDataCube:
    input_data = cube.get_array().isel(t=0).values  # Only perform inference for the first date.
    input_data = input_data[None, ...]  # Neural network expects shape (1, 1, 256, 256)
    inspect(input_data, "input data")
    ort_session = ort.InferenceSession("onnx_models/test_model.onnx")
    ort_inputs = {ort_session.get_inputs()[0].name: input_data}
    ort_outputs = ort_session.run(None, ort_inputs)
    output_data = xr.DataArray(ort_outputs[0])
    output_data = output_data.rename({"dim_0": "t", "dim_1": "bands", "dim_2": "y", "dim_3": "x"})
    inspect(output_data, "output data")

This file has been truncated. show original

The very nice thing would be that this makes models independent of scikit-learn/tensorflow/pytorch/… and we also do not end up in this huge dependency management problem, which is often complex to solve properly.
We can for instance also upgrade scikit-learn, which would make it work now, but then if our version changes again at some point, your model will break.

Is that an option to explore further?

Regarding finding our versions, it is possible, but hard for end users, as they are a bit spread around in our open source modules, e.g.
https://raw.githubusercontent.com/Open-EO/openeo-geopyspark-driver/master/setup.py

margot.verhulst · 23 November 2023 13:22

Hi Jeroen,
Thank you for the suggestion, I will look into Onnx.
In the meantime, it would indeed still be interesting if your scikit-learn version could be upgraded (even if it will probably break in the future).
Kind regards,
Margot

jeroen.dries · 24 November 2023 19:04

Ok, I upgraded scikit-learn, it’s still in our pipelines but should come through by monday.

have a nice weekend,
Jeroen

kato.vanpoucke · 3 December 2023 14:36

Can I ask what the new version of scikit-learn is? Because as of today I experienced the following error when trying to use my udf:

ValueError: node array from the pickle has an incompatible dtype: - expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64} - got : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

When Googling this error, it seems that having different versions of scikit-learn could be the problem. As I have used this same udf earlier this month (20th of November) without any problems, I think it might have to do with this recent update of the scikit-learn version.

Kind regards,
Kato

kato.vanpoucke · 3 December 2023 15:25

I solved the error by updating the scikit-learn version in my own environment (where I pickle the model) from 1.2.2 to 1.11.4. Hope this helps anyone that runs into the same problem!

Kato

margot.verhulst · 5 December 2023 10:40

Hi Jeroen,

Posting here, because it is related to same UDF, but I ran into a new problem… Suddenly I get an error at an earlier point.

I load and apply the UDF like this:

udf_inference = pathlib.Path('udf_ndvi.py').read_text()

test_cube_vi_masked_inv_int_predictions = cube_vi_masked_inv_int.reduce_dimension(reducer=udf_inference, dimension='t')

It receives a DataArray with (t, bands, x, y) dimensions and returns one with only (bands, x, y) dimensions.

The error is:
OpenEO batch job failed: java.lang.IllegalArgumentException: Unsupported operation: followed by the UDF text.

The related job ID is: j-2312045b8d5c4bf89d9a14670dad8a7e

Any idea what might cause this?

Kind regards,
Margot

stefaan.lippens · 5 December 2023 12:23

margot.verhulst:

udf_inference = pathlib.Path('udf_ndvi.py').read_text()

test_cube_vi_masked_inv_int_predictions = cube_vi_masked_inv_int.reduce_dimension(reducer=udf_inference, dimension='t')

Hi Margot,
this is not the right way to use a UDF: with pathlib.Path(...).read_text(),
you load your UDF as a Python string and pass it directly to reducer in reduce_dimension. The backend will interpret this whole UDF string as a process id, which does not work of course.

It is recommended to load and use UDFs like:

import openeo
...
udf_inference = openeo.UDF.from_file("udf_ndvi.py")

... = reduce_dimension(reducer=udf_inference, dimension='t')

margot.verhulst · 5 December 2023 12:40

Hi Stefaan,

Thanks, that makes a lot of sense! This was probably the result of an inattentive copy-paste (I was using pathlib.Path(...).read_text() when executing the udf locally with execute_local_udf).

Kind regards,
Margot