Your openEO batch job failed during Spark execution: 'Job aborted due to stage failure: Task 2 in stage 10.0 failed 4 times, most recent failure: Lost task 2.3 in stage 10.0 (TID 1741) (epod076.vgt.vito.be executor 54): ExecutorLostFailure (executor...'
and before that I can see a lot of Error communicating with MapOutputTracker errors and: Connection to epod127.vgt.vito.be/192.168.207.227:46385 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.shuffle.io.connectionTimeout if this is wrong.
I made the requested area smaller (i.e., a single county instead of two) and there are already errors showing up in the job log:
several Error communicating with MapOutputTracker (with only an ID)
a series of Exception occurred while reverting partial writes to file /data2/hadoop/yarn/local/usercache/openeo/appcache/application_1674538064532_18244/blockmgr-756006f5-18d3-4e63-abc1-e8972780fa37/1e/temp_shuffle_c0d920a7-beca-47cb-a931-c9c8cd923e87, null, starting about 3mn
I see that you now have a job with increased memory, which does seem to be running without issues for now.
Note that our logging can sometimes show error messages, while the job may still succeed. This is because we have some built-in resiliency against failures.
I also advice following this small additional registration procedure:
it helps with getting resources on the cluster.
Note that your job is currently taking some time, this is also due to the issue mentioned above, that is under investigation.