Issue with batch job

Hi, I am trying to move all my download operations from “execute” to the batch processing.
I have encounter an error that I don’t know how to solve. Can you help me with that?

Your batch job 'vito-6471a90f-8fa9-4474-b8d4-7217c854789b' failed.
Logs can be inspected in an openEO (web) editor or with `connection.job('vito-6471a90f-8fa9-4474-b8d4-7217c854789b').logs()`.

Printing logs:
[{'id': 'error', 'level': 'error', 'message': 'Traceback (most recent call last):\n  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1690, in get_log_entries\n    with (self.get_job_output_dir(job_id) / "log").open(\'r\') as f:\n  File "/usr/lib64/python3.8/pathlib.py", line 1221, in open\n    return io.open(self, mode, buffering, encoding, errors, newline,\n  File "/usr/lib64/python3.8/pathlib.py", line 1077, in _opener\n    return self._accessor.open(self, flags, mode)\nFileNotFoundError: [Errno 2] No such file or directory: \'/data/projects/OpenEO/6471a90f-8fa9-4474-b8d4-7217c854789b/log\'\n'}]
Traceback (most recent call last):

  Input In [8] in <cell line: 14>
    job.start_and_wait()

  File ~\.conda\envs\openeo\lib\site-packages\openeo\rest\job.py:222 in start_and_wait
    raise JobFailedException("Batch job {i!r} didn't finish successfully. Status: {s} (after {t}).".format(

JobFailedException: Batch job 'vito-6471a90f-8fa9-4474-b8d4-7217c854789b' didn't finish successfully. Status: error (after 0:01:23).

Clearly, If you need it I can give you the full program structure.

Hi Paolo,
we noticed this as well in other jobs, the team is working to resolve it asap! Will give an update when it’s resolved.

Jeroen

1 Like

The problem should be resolved now

can you try again to run your batch job?

I still receive an error but it is different this time:

Printing logs:
[{'id': 'error', 'level': 'error', 'message': 'error processing batch job\nTypeError: float() argument must be a string or a number, not \'list\'\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "batch_job.py", line 319, in main\n    run_driver()\n  File "batch_job.py", line 292, in run_driver\n    run_job(\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/openeogeotrellis/utils.py", line 43, in memory_logging_wrapper\n    return function(*args, **kwargs)\n  File "batch_job.py", line 388, in run_job\n    assets_metadata = result.write_assets(str(output_file))\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/openeo_driver/save_result.py", line 234, in write_assets\n    self.to_netcdf(filename)\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/openeo_driver/save_result.py", line 315, in to_netcdf\n    array.to_netcdf(filename,encoding=encoding)\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/core/dataset.py", line 1644, in to_netcdf\n    return to_netcdf(\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/backends/api.py", line 1111, in to_netcdf\n    dump_to_store(\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/backends/api.py", line 1158, in dump_to_store\n    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/backends/common.py", line 250, in store\n    variables, attributes = self.encode(variables, attributes)\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/backends/common.py", line 339, in encode\n    variables, attributes = cf_encoder(variables, attributes)\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 773, in cf_encoder\n    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 773, in <dictcomp>\n    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 258, in encode_cf_variable\n    var = ensure_dtype_not_object(var, name=name)\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 216, in ensure_dtype_not_object\n    data = _copy_with_dtype(data, dtype=_infer_dtype(data, name))\n  File "/data2/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_13282/container_e5041_1655189542545_13282_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 174, in _copy_with_dtype\n    result[...] = data\nValueError: setting an array element with a sequence.\n'}]
Traceback (most recent call last):

  Input In [8] in <cell line: 14>
    job.start_and_wait()

  File ~\.conda\envs\openeo\lib\site-packages\openeo\rest\job.py:222 in start_and_wait
    raise JobFailedException("Batch job {i!r} didn't finish successfully. Status: {s} (after {t}).".format(

JobFailedException: Batch job 'vito-294f74ab-2584-41fc-b64a-7f900e6f6ed3' didn't finish successfully. Status: error (after 0:24:05).

Hi Paolo,
this error happens when writing the netCDF, but I have the impression that an older code path is used.j
Which format options do you use for your netCDF output?

thanks,
Jeroen

This is the instruction I use to download the file:

                res=C0.save_result(format="netCDF")
                job=res.create_job(title="C")
                job.start_and_wait()
                job.get_results().download_files()

Is there a more direct way?

Paolo

The more direct way would be:

job=C0.execute_batch(out_format="netCDF",title="C")
job.get_results().download_files()

but it does the same in the end, can you just run it again and give me the new job id? Then I can see the logs a bit better. (The old app got removed already.)

Here it is

Your batch job 'vito-aea46783-1b77-4ce6-b39c-feaa038188ee' failed.
Logs can be inspected in an openEO (web) editor or with `connection.job('vito-aea46783-1b77-4ce6-b39c-feaa038188ee').logs()`.

Printing logs:
[{'id': 'error', 'level': 'error', 'message': 'error processing batch job\nTypeError: float() argument must be a string or a number, not \'list\'\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "batch_job.py", line 319, in main\n    run_driver()\n  File "batch_job.py", line 292, in run_driver\n    run_job(\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/openeogeotrellis/utils.py", line 43, in memory_logging_wrapper\n    return function(*args, **kwargs)\n  File "batch_job.py", line 388, in run_job\n    assets_metadata = result.write_assets(str(output_file))\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/openeo_driver/save_result.py", line 234, in write_assets\n    self.to_netcdf(filename)\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/openeo_driver/save_result.py", line 315, in to_netcdf\n    array.to_netcdf(filename,encoding=encoding)\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/core/dataset.py", line 1644, in to_netcdf\n    return to_netcdf(\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/backends/api.py", line 1111, in to_netcdf\n    dump_to_store(\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/backends/api.py", line 1158, in dump_to_store\n    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/backends/common.py", line 250, in store\n    variables, attributes = self.encode(variables, attributes)\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/backends/common.py", line 339, in encode\n    variables, attributes = cf_encoder(variables, attributes)\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 773, in cf_encoder\n    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 773, in <dictcomp>\n    new_vars = {k: encode_cf_variable(v, name=k) for k, v in variables.items()}\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 258, in encode_cf_variable\n    var = ensure_dtype_not_object(var, name=name)\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 216, in ensure_dtype_not_object\n    data = _copy_with_dtype(data, dtype=_infer_dtype(data, name))\n  File "/data1/hadoop/yarn/local/usercache/openeo/appcache/application_1655189542545_31869/container_e5041_1655189542545_31869_01_000001/venv/lib/python3.8/site-packages/xarray/conventions.py", line 174, in _copy_with_dtype\n    result[...] = data\nValueError: setting an array element with a sequence.\n'}]
Traceback (most recent call last):

  Input In [7] in <cell line: 12>
    job=C0.execute_batch(out_format="netCDF",title="C")

  File ~\.conda\envs\openeo\lib\site-packages\openeo\rest\datacube.py:1620 in execute_batch
    return job.run_synchronous(

  File ~\.conda\envs\openeo\lib\site-packages\openeo\rest\job.py:139 in run_synchronous
    self.start_and_wait(

  File ~\.conda\envs\openeo\lib\site-packages\openeo\rest\job.py:222 in start_and_wait
    raise JobFailedException("Batch job {i!r} didn't finish successfully. Status: {s} (after {t}).".format(

JobFailedException: Batch job 'vito-aea46783-1b77-4ce6-b39c-feaa038188ee' didn't finish successfully. Status: error (after 0:14:03).

Thank you for the support

Hi Paolo,
ok, I found out your writing an aggregated timeseries, so my previous remark about netCDF code path was incorrect. It does work with csv output by the way. Was not yet able to check if it works for synchronous calls.
Is there any chance you can provide a smaller example that shows the same error? That would allow me to iterate faster to reproduce and resolve the issue.

thanks,
Jeroen

Ok, I now was able to find a much smaller case myself, and logged an issue:

I think it occurs when there’s Nan’s in the result. xarray doesn’t seem to deal with that very well.

Hi Paolo,
a fix for this issue was committed, it will become available on openeo-dev.vito.be by monday at latest.

best regards,
Jeroen

Thank you Jeroen. Still, I switched to csv in order to make it quicker. I still cannot verify if it works because an error

[404] CollectionNotFound: Collection 'SENTINEL2_L2A_SENTINELHUB' does not exist. (ref: ad7e7e1f-25e7-478e-aefd-70d627532976)

is generated.
I think that it is just a temporary error, therefore I’ll just keep trying!

Thanks again

Paolo

Hi Paolo,
just, that was temporary, there seems to be another load related issue for the moment.
I also noticed a lot of your batch jobs failed, we’re fixing that, so you can try again.

Can you also complete this procedure to ensure we can properly assign resources to your jobs?

Hi Jeroen,
I tried to follow the procedure but I run into an error:


Do you perhaps know the reason?

Hi Paolo,
there’s a migration ongoing of the authentication system, and apparently there’s some hickups.
We’re waiting for EGI to help us out and fix this.

thanks,
Jeroen

Hi Jeroen,
I don’t know why, but from yesterday my requested batches go to the “queued” status for several hours before just fail the process and never actually execute it.

This is the ID of the one from this morning (already 2 hours and 17 minutes of running).

vito-j-17636960ca5f4115a43a199e86e234e3

I wanted to run the full program one last time to get the final results for my report, but in the last month I never have never been able to complete a single run of it.

Do you perhaps know the reason?

Paolo

Hi Paolo,
the error seems to occur in the UDF, see below, does that help to find the issue?

/container_e5046_1656207140061_2506_01_000012/venv/lib/python3.8/site-packages/openeo/udf/run_code.py", line 178, in run_udf_code
    result_cube = func(data.get_datacube_list()[0], data.user_context)
  File "<string>", line 40, in apply_datacube
ValueError: operands could not be broadcast together with shapes (73,256,256) (141,256,256)

It seems a bit strange to me, since the error is generated way before the input of udf and, moreover, it never start to run:

0:15:38 Job ‘vito-j-f4863213ae10451596117d5343fd336a’: queued (progress N/A)

Hi Paolo,
is this last job still marked as queued for you? Because it has run and finished successfully after 10 minutes.

thanks,
Jeroen

Hi Paolo, I believe I found an issue on our side. Should be solved now, let me know if jobs still remain in ‘queued’ forever.