Error spark execution Java heap space, AGERA5

iris.dumeur · 2 May 2023 17:09

Hi all,
I am trying to download 1 year of AGERA5 data on 30 polygon of size (1280m*1280m). With my script I am already able to download Sentinel 2 and Sentinel 1 data. In the log file from AGERA5 job, I have the following error: “OpenEO batch job failed: Exception during Spark execution: Java heap space”. From what I understand, this error can be solved by customizing the batch job resources, through job_options. It seems that I should try to increase : "driver-memoryOverhead" or "driver-memory". However, I find it difficult to assess, how to correctly set this job_options. I had similar issue with S2 and setting the job_options eventually worked:

    job_options = {
        "executor-memory": "3G",
        "executor-memoryOverhead": "10G", #default 2G
        "executor-cores": 2,
        "task-cpus": 1,
        "executor-request-cores": "400m",
        "max-executors": "100",
        "driver-memory": "12G",
        "driver-memoryOverhead": "10G",
        "driver-cores": 5,
        "udf-dependency-archives":[],
        "logging-threshold": "info"
    }

I would like to have a better understanding on how to set this job_options. Has someone already run Batch Job with AGERA5 on large period of data ? Is it a problem that I can solve by modifying job options ?

Thanks in advance,

Iris

jeroen.dries · 2 May 2023 18:21

Hi Iris,
the job option are indeed not the most trivial topic, so good idea to ask here!
For this error, it mentions ‘Java heap space’ so I would try increasing executor-memory instead of executor-memory-overhead. Be careful not to increase too much as it does impact overall resources that you will be able to acquire, and thus performance.
You can also set executor-cores to 1, to have also more memory per (parallel) task without increasing overall memory.

For AGERA5, I recently extracted a year of that for 11000 polygons covering all of Europe, so in general it should work. I may however be more specific if you can share a batch job id, or your script.

best regards,
Jeroen

iris.dumeur · 3 May 2023 08:42

Thank you very much for your answer. I will try your suggestions. Otherwise, here is a job-id : “vito-j-b0f222979ae246cbad4f39450c2a8ca7”
Best regards,
Iris

iris.dumeur · 3 May 2023 14:15

Eventually, I was able to download AGERA5 data with the following configuration :

    job_options = {
        "executor-memory": "10G",
        "executor-memoryOverhead": "20G",  # default 2G
        "executor-cores": 1,
        "task-cpus": 1,
        "executor-request-cores": "400m",
        "max-executors": "100",
        "driver-memory": "12G",
        "driver-memoryOverhead": "10G",
        "driver-cores": 5,
        "udf-dependency-archives": [],
        "logging-threshold": "info"
    }

Thank you for your help

jeroen.dries · 3 May 2023 15:11

Thanks for the info!
This is of course a lot of memory, so I started investigating in this issue:

github.com/Open-EO/openeo-geopyspark-driver

investigate high memory usage for agera5 at sentinel-2 resolution

opened 02:48PM - 03 May 23 UTC

jdries

job id: j-f1b1efdb2c6e4fc680f1ddedde0b5f91 user had to set executor memory very …high Jobs were crashing when writing the actual netcdfs. filter_spatial was used, so all data for a single netcdf ends up on an executor. The size of netcdf's was: 256*256*366*4*8/(1024*1024) = 732MB relevant logging: Creating layer for AGERA5 with load params {'temporal_extent': ('2020-01-01', '2020-12-31'), 'spatial_extent': {'west': 3.016675851392828, 'south': 44.25828932897241, 'east': 4.37861196995034, 'north': 45.15020643463023, 'crs': 'EPSG:4326'}, 'global_extent': {'west': 3.016675851392828, 'south': 44.25828932897241, 'east': 4.37861196995034, 'north': 45.15020643463023, 'crs': 'EPSG:4326'}, 'bands': ['dewpoint-temperature', 'precipitation-flux', 'solar-radiation-flux', 'temperature-max', 'temperature-mean', 'temperature-min', 'vapour-pressure', 'wind-speed'], 'properties': {}, 'aggregate_spatial_geometries': <shapely.geometry.multipolygon.MultiPolygon object at 0x7febd6690520>, 'sar_backscatter': None, 'process_types': {<ProcessType.FOCAL_SPACE: 6>}, 'custom_mask': {}, 'data_mask': None, 'target_crs': {'$schema': 'https://proj.org/schemas/v0.2/projjson.schema.json', 'type': 'GeodeticCRS', 'name': 'AUTO 42001 (Universal Transverse Mercator)', 'datum': {'type': 'GeodeticReferenceFrame', 'name': 'World Geodetic System 1984', 'ellipsoid': {'name': 'WGS 84', 'semi_major_axis': 6378137, 'inverse_flattening': 298.257223563}}, 'coordinate_system': {'subtype': 'ellipsoidal', 'axis': [{'name': 'Geodetic latitude', 'abbreviation': 'Lat', 'direction': 'north', 'unit': 'degree'}, {'name': 'Geodetic longitude', 'abbreviation': 'Lon', 'direction': 'east', 'unit': 'degree'}]}, 'area': 'World', 'bbox': {'south_latitude': -90, 'west_longitude': -180, 'north_latitude': 90, 'east_longitude': 180}, 'id': {'authority': 'OGC', 'version': '1.3', 'code': 'Auto42001'}}, 'target_resolution': [10, 10], 'resample_method': 'cubic', 'pixel_buffer': None} Loading with params DataCubeParameters(256, {}, FloatingLayoutScheme, ByDay, 6, None, CubicConvolution, 0.0, 0.0) and bands dewpoint-temperature;precipitation-flux;solar-radiation-flux;temperature-max;temperature-mean;temperature-min;vapour-pressure;wind-speed initial layout: LayoutDefinition(Extent(501310.0, 4898170.0, 613950.0, 5000570.0),CellSize(10.0,10.0),22x20 tiles,11264x10240 pixels) Cube partitioner index: SparseSpaceTimePartitioner 1656 true Datacube is sparse: true, requiring 46 keys out of 420.

Typical short term workarounds for this kind of thing is extracting the different bands separately, or trying to use bytes or shorts as datatype instead of float32.
Also requesting the patches in separate jobs that write a single netcdf may help, as that code path is a bit more efficient memory wise.

iris.dumeur · 3 May 2023 15:18

Thank you very much !

iris.dumeur · 3 May 2023 17:14

I was able to reduce the high memory usage:

I have as you suggested download the agora5_data per band
Then, in my code I was doing resample_cube_spatial on the entire S2 datacollection. Now I am first applying datacube_s2.filter_spatial before exploiting it in the method resample_cube_spatial.

jeroen.dries · 4 May 2023 06:29

Hi,
that’s good news. To solve the original problem, I would have to make the ‘sample_by_features’ to netCDF functionality a bit more efficient, and that would take some time.
Do let me know if this is somehow high priority, or if you can manage with the workaround.

Also important aspect is that agera5 is very low resolution, so you’re probably extracting chunks with a constant value per day. We then often just extract the average value at that location, and construct the full cube aligned to Sentinel-2 only when we actually need it.

best regards,
Jeroen

iris.dumeur · 4 May 2023 08:59

Thank you a lot for your feedbacks. I can manage currently with the workaround, so I do not think this is high priority.
Best regards,
Iris