ESA_WORLDCOVER_10M_2021_V2 download - Job cancelled runaway job cancelled after PT15M

jeani · 27 January 2023 10:03

I am trying to download this data:

datacube = session.load_collection(
    'ESA_WORLDCOVER_10M_2021_V2',
    spatial_extent={"west": 15.594850540000039, "south": 68.35111000000018, "east": 31.065246582000157, "north": 71.18811035200008},
    temporal_extent = ["2021-01-01", "2021-12-31"],
)

but the datacube.download fails with this error:

OpenEoApiError: [500] Internal: Server error: Job 61742 cancelled runaway job 61742 cancelled after PT15M (ref: r-435fe01688ca49878a20cf28462da78a)

jeroen.dries · 27 January 2023 10:57

Hi,
the area you request is rather large, so fails to be retrieved in 15 minutes, which is our limit for synchronous requests via ‘download’.
Can you try using execute_batch instead?

thanks,
Jeroen

jeani · 27 January 2023 11:16

Hi Jeroen,

Do you mean something like that?

datacube = datacube.save_result(format="GTiff")
job = datacube.create_job()

stefaan.lippens · 27 January 2023 11:24

indeed, but don’t forget to also start that job

For example see Batch Jobs — openEO Python Client 0.14.0a1 documentation for more information on creating, starting and polling batch jobs

jeani · 27 January 2023 11:29

Thanks @stefaan.lippens for the tip, my job was created but not running

job.start_job()

jeani · 27 January 2023 12:35

The batch job eventually failed also:

Your openEO batch job failed during Spark execution: 'Job aborted due to stage failure: Task 2 in stage 10.0 failed 4 times, most recent failure: Lost task 2.3 in stage 10.0 (TID 1741) (epod076.vgt.vito.be executor 54): ExecutorLostFailure (executor...'

and before that I can see a lot of Error communicating with MapOutputTracker errors and:
Connection to epod127.vgt.vito.be/192.168.207.227:46385 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.shuffle.io.connectionTimeout if this is wrong.

How can I increase the timout?

jeroen.dries · 27 January 2023 13:07

As it happens, we’re actively working on a related issue:

github.com/Open-EO/openeo-geotrellis-extensions

worldcover extract requires too much memory

opened 06:29AM - 02 Dec 22 UTC

jdries

I had to set memory very high for this one: ``` wc = c.load_collection("ESA_WORL…DCOVER_10M_2021_V2", bands="MAP", temporal_extent=["2020-12-30", "2022-01-01"]) statsfile = "cropland_mean_laea_2021.json" if (not Path(statsfile).exists()): (wc.band("MAP") == 40).aggregate_spatial( "https://artifactory.vgt.vito.be/auxdata-public/grids/LAEA-20km-EU27.geojson",reducer=lambda x:array_create(mean(x),count(x))).execute_batch( statsfile, title="Worldcover stats LAEA", job_options={"executor-memory":"7G","executor-memoryOverhead":"2G"}) ``` This stage is the problem: flatMap at FileLayerProvider.scala:758 ![Image](https://user-images.githubusercontent.com/5937096/205230100-8cd98343-ea6d-404e-9e85-28c8188924ff.png) Thread dumps mostly seem to be stuck in tileToLayout. Worldcover products are quite large (36000, 36000), and have a block size of 1024x1024.

Luckily, there is a simple workaround of increasing the memory. In the mentioned ticket, I had to use settings given below:

cube.execute_batch( job_options={"executor-memory":"7G","executor-memoryOverhead":"2G"})

More background on job options:

jeani · 27 January 2023 13:19

I made the requested area smaller (i.e., a single county instead of two) and there are already errors showing up in the job log:

several Error communicating with MapOutputTracker (with only an ID)
a series of Exception occurred while reverting partial writes to file /data2/hadoop/yarn/local/usercache/openeo/appcache/application_1674538064532_18244/blockmgr-756006f5-18d3-4e63-abc1-e8972780fa37/1e/temp_shuffle_c0d920a7-beca-47cb-a931-c9c8cd923e87, null, starting about 3mn

jeroen.dries · 27 January 2023 14:38

Hi,
I see that you now have a job with increased memory, which does seem to be running without issues for now.
Note that our logging can sometimes show error messages, while the job may still succeed. This is because we have some built-in resiliency against failures.
I also advice following this small additional registration procedure:

it helps with getting resources on the cluster.

Note that your job is currently taking some time, this is also due to the issue mentioned above, that is under investigation.

best regards,
Jeroen