Error spark execution Java heap space, AGERA5

Hi all,
I am trying to download 1 year of AGERA5 data on 30 polygon of size (1280m*1280m). With my script I am already able to download Sentinel 2 and Sentinel 1 data. In the log file from AGERA5 job, I have the following error: “OpenEO batch job failed: Exception during Spark execution: Java heap space”. From what I understand, this error can be solved by customizing the batch job resources, through job_options. It seems that I should try to increase : "driver-memoryOverhead" or "driver-memory". However, I find it difficult to assess, how to correctly set this job_options. I had similar issue with S2 and setting the job_options eventually worked:

    job_options = {
        "executor-memory": "3G",
        "executor-memoryOverhead": "10G", #default 2G
        "executor-cores": 2,
        "task-cpus": 1,
        "executor-request-cores": "400m",
        "max-executors": "100",
        "driver-memory": "12G",
        "driver-memoryOverhead": "10G",
        "driver-cores": 5,
        "udf-dependency-archives":[],
        "logging-threshold": "info"
    }

I would like to have a better understanding on how to set this job_options. Has someone already run Batch Job with AGERA5 on large period of data ? Is it a problem that I can solve by modifying job options ?

Thanks in advance,

Iris

Hi Iris,
the job option are indeed not the most trivial topic, so good idea to ask here!
For this error, it mentions ‘Java heap space’ so I would try increasing executor-memory instead of executor-memory-overhead. Be careful not to increase too much as it does impact overall resources that you will be able to acquire, and thus performance.
You can also set executor-cores to 1, to have also more memory per (parallel) task without increasing overall memory.

For AGERA5, I recently extracted a year of that for 11000 polygons covering all of Europe, so in general it should work. I may however be more specific if you can share a batch job id, or your script.

best regards,
Jeroen

Thank you very much for your answer. I will try your suggestions. Otherwise, here is a job-id : “vito-j-b0f222979ae246cbad4f39450c2a8ca7”
Best regards,
Iris

Eventually, I was able to download AGERA5 data with the following configuration :

    job_options = {
        "executor-memory": "10G",
        "executor-memoryOverhead": "20G",  # default 2G
        "executor-cores": 1,
        "task-cpus": 1,
        "executor-request-cores": "400m",
        "max-executors": "100",
        "driver-memory": "12G",
        "driver-memoryOverhead": "10G",
        "driver-cores": 5,
        "udf-dependency-archives": [],
        "logging-threshold": "info"
    }

Thank you for your help

Thanks for the info!
This is of course a lot of memory, so I started investigating in this issue:

Typical short term workarounds for this kind of thing is extracting the different bands separately, or trying to use bytes or shorts as datatype instead of float32.
Also requesting the patches in separate jobs that write a single netcdf may help, as that code path is a bit more efficient memory wise.

Thank you very much !

I was able to reduce the high memory usage:

  • I have as you suggested download the agora5_data per band
  • Then, in my code I was doing resample_cube_spatial on the entire S2 datacollection. Now I am first applying datacube_s2.filter_spatial before exploiting it in the method resample_cube_spatial.

Hi,
that’s good news. To solve the original problem, I would have to make the ‘sample_by_features’ to netCDF functionality a bit more efficient, and that would take some time.
Do let me know if this is somehow high priority, or if you can manage with the workaround.

Also important aspect is that agera5 is very low resolution, so you’re probably extracting chunks with a constant value per day. We then often just extract the average value at that location, and construct the full cube aligned to Sentinel-2 only when we actually need it.

best regards,
Jeroen

Thank you a lot for your feedbacks. I can manage currently with the workaround, so I do not think this is high priority.
Best regards,
Iris