ESA_WORLDCOVER_10M_2021_V2 download - Job cancelled runaway job cancelled after PT15M

I am trying to download this data:

datacube = session.load_collection(
    'ESA_WORLDCOVER_10M_2021_V2',
    spatial_extent={"west": 15.594850540000039, "south": 68.35111000000018, "east": 31.065246582000157, "north": 71.18811035200008},
    temporal_extent = ["2021-01-01", "2021-12-31"],
)

but the datacube.download fails with this error:

OpenEoApiError: [500] Internal: Server error: Job 61742 cancelled runaway job 61742 cancelled after PT15M (ref: r-435fe01688ca49878a20cf28462da78a)

Hi,
the area you request is rather large, so fails to be retrieved in 15 minutes, which is our limit for synchronous requests via ‘download’.
Can you try using execute_batch instead?

thanks,
Jeroen

Hi Jeroen,

Do you mean something like that?

datacube = datacube.save_result(format="GTiff")
job = datacube.create_job()

indeed, but don’t forget to also start that job

For example see Batch Jobs — openEO Python Client 0.14.0a1 documentation for more information on creating, starting and polling batch jobs

Thanks @stefaan.lippens for the tip, my job was created but not running

job.start_job()

The batch job eventually failed also:

Your openEO batch job failed during Spark execution: 'Job aborted due to stage failure: Task 2 in stage 10.0 failed 4 times, most recent failure: Lost task 2.3 in stage 10.0 (TID 1741) (epod076.vgt.vito.be executor 54): ExecutorLostFailure (executor...'

and before that I can see a lot of Error communicating with MapOutputTracker errors and:
Connection to epod127.vgt.vito.be/192.168.207.227:46385 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.shuffle.io.connectionTimeout if this is wrong.

How can I increase the timout?

As it happens, we’re actively working on a related issue:

Luckily, there is a simple workaround of increasing the memory. In the mentioned ticket, I had to use settings given below:

cube.execute_batch( job_options={"executor-memory":"7G","executor-memoryOverhead":"2G"})

More background on job options:

I made the requested area smaller (i.e., a single county instead of two) and there are already errors showing up in the job log:

  • several Error communicating with MapOutputTracker (with only an ID)

  • a series of Exception occurred while reverting partial writes to file /data2/hadoop/yarn/local/usercache/openeo/appcache/application_1674538064532_18244/blockmgr-756006f5-18d3-4e63-abc1-e8972780fa37/1e/temp_shuffle_c0d920a7-beca-47cb-a931-c9c8cd923e87, null, starting about 3mn

Hi,
I see that you now have a job with increased memory, which does seem to be running without issues for now.
Note that our logging can sometimes show error messages, while the job may still succeed. This is because we have some built-in resiliency against failures.
I also advice following this small additional registration procedure:

it helps with getting resources on the cluster.

Note that your job is currently taking some time, this is also due to the issue mentioned above, that is under investigation.

best regards,
Jeroen