Memory - overhead problem

sulova.andrea · 1 July 2022 15:13

Hey all,

This is an example of the processing ID’s which failed multiple times:

j-03fb0ef385fd4e0abd809c740416db7c
j-fd9d5d47599541999e73456928f4e353

The local error message: **JobFailedException**: Batch job 'j-03fb0ef385fd4e0abd809c740416db7c' didn't finish successfully. Status: error (after 3:11:54).

and the Web editor shows the error information: Uour batch job failed because workers used too much Python memory. The same task was attempted multiple times. Consider increasing executor-memoryOverhead or contact the developers to investigat.

Thus, I believe there is only the memory problem. This worklfow was tested successfully using synchronous downalod for a smaller spatial and temporal extent. Could you please confirm that this is only related to memory problem?
It is good to say that, I was running the completely same batch job (j-ef42883d631f4784812fc9116b5e6a86) without one extra step and it finished perfectly. The extra step that was added at the end of the not working workflow is: cube_threshold = s2_cube.mask(s2_cube.apply(lambda x: lt(x,0.75)))). Is this process problematic?

jeroen.dries · 4 July 2022 06:09

Hi Andrea,

there’s indeed a more efficient approach to filter on values within the same datacube. The idea is to use the ‘if-else’ openeo process, and a callback on the bands of the datacube. The snippet below tries to illustrate it:

from openeo.processes import if_
s2_cube_a = s2_cube.apply_dimension(dimension="bands", process=lambda x: if_(x < 0.75, x))

The other option is to configure memory, as your job is indeed reaching a level of complexity where the default settings may not be sufficient. The example below shows how to do that. Your job was failing on too few “executor-memory”, so that would be the setting to increase gradually. Of course, increasing memory also increases the cost of the job.

job_options = {
        "executor-memory": "3G",
        "executor-memoryOverhead": "4G",
        "executor-cores": "2"
}
cube.execute_batch(  out_format="GTiff",
        job_options=job_options)

sulova.andrea · 4 July 2022 08:13

Thanks Jeroen, it works fine with ‘if-else’.

sulova.andrea · 5 July 2022 08:39

Hey again,
I can see the same memory problem in other jobs. FX: the calculation of NDVI and NDWI for a couple of scenes. Is there more efficient approach how to calculate this? The current approach:

s2_cube = append_index(s2_cube,"NDWI") ## index 8
s2_cube = append_index(s2_cube,"NDVI") ## index 9

You can check jobs based on these IDs:

j-f134e2de601942cba7ace33ceccb3689' ,
j-ad7c818280b34b5b9b5e7fb7c1372e8d

jeroen.dries · 6 July 2022 07:03

Hi Andrea,

it’s already quite efficient. One trick we can add is to change the data type to something smaller. I’m guessing you now get float32 output, and we can scale this for instance to shorts. If your current range is between 0 and 1, you could do:
cube.linear_scale_range(0.0,1.0,0,10000)

It is however nog guaranteed to work, so simply increasing memory is also an option. You can try the settings below to add 1GB to each worker:

jeroen.dries:

job_options = {
        "executor-memory": "3G",
        "executor-memoryOverhead": "3G",
        "executor-cores": "2"
}
cube.execute_batch(  out_format="GTiff",
        job_options=job_options)

sulova.andrea · 13 July 2022 09:23

Hey Jeroen,

Could you please advice me if the job_option is applied correctly in send_job:

job_options = {
        "executor-memory": "3G",
        "executor-memoryOverhead": "3G",
        "executor-cores": "2"}

s2_cube_save = s2_cube_swf.save_result(format='netCDF') #GTiff #netCDF
my_job  = s2_cube_save .send_job(title="s2_cube", job_options=job_options)
results = my_job.start_and_wait().get_results()
results.download_files("s2_cube")

jeroen.dries · 14 July 2022 10:54

Hi Andrea,
yes, this looks correct. You can increase memory values if jobs still give a memory related error.
If you have a job id, I can always check deeper if the settings are really picked up.

sulova.andrea · 15 August 2022 09:09

I have tried to scale it for a bigger spatial extent, however it doesn not work:

start_date           = '2021-06-01'
spatial_extent  = {'west': -74, 'east': -73, 'south': 4, 'north': 5, 'crs': 'epsg:4326'} #colombia

## Get the Sentinel-2 data for a 3 month window.
start_date_dt_object = datetime.strptime(start_date, '%Y-%m-%d')
end_date             = (start_date_dt_object + relativedelta(months = +1)).date() ## End date, 1 month later (1st Feb. 2021)
start_date_exclusion = (start_date_dt_object + relativedelta(months = -2)).date() ## exclusion date, to give a 3 month window.

bands                = ['B02', 'B03', 'B04', 'B08', 'CLP', 'SCL' , 'sunAzimuthAngles', 'sunZenithAngles'] 

s2_cube_scale = connection.load_collection(
    'SENTINEL2_L2A_SENTINELHUB',
    spatial_extent  = spatial_extent,
    temporal_extent = [start_date_exclusion, end_date],
    bands           = bands)

job_options = {
        "executor-memory": "5G",
        "executor-memoryOverhead": "5G",
        "executor-cores": "3"}

s2_cube_scale_save = s2_cube_scale.save_result(format='netCDF') #GTiff #netCDF
my_job  = s2_cube_scale_save .send_job(title="s2_cube_scale", job_options=job_options)
results = my_job.start_and_wait().get_results()
results.download_files("s2_cube_scale")

This was error message:
Traceback (most recent call last): File “/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py”, line 1727, in get_log_entries with (self.get_job_output_dir(job_id) / “log”).open(‘r’) as f: File “/usr/lib64/python3.8/pathlib.py”, line 1221, in open return io.open(self, mode, buffering, encoding, errors, newline, File “/usr/lib64/python3.8/pathlib.py”, line 1077, in _opener return self._accessor.open(self, flags, mode) FileNotFoundError: [Errno 2] No such file or directory: ‘/data/projects/OpenEO/j-1bf5a22fd5674494b1e295c3a9654d16/log’

sulova.andrea · 15 August 2022 09:27

Also I have tried with a bit smaller area and bigger memory but not suceess, can you advice me what should be improved?

start_date           = '2021-06-01'
spatial_extent  = {'west': -74, 'east': -73.5, 'south': 4.5, 'north': 5, 'crs': 'epsg:4326'} #colombia

## Get the Sentinel-2 data for a 3 month window.
start_date_dt_object = datetime.strptime(start_date, '%Y-%m-%d')
end_date             = (start_date_dt_object + relativedelta(months = +1)).date() ## End date, 1 month later (1st Feb. 2021)
start_date_exclusion = (start_date_dt_object + relativedelta(months = -2)).date() ## exclusion date, to give a 3 month window.

bands                = ['B02', 'B03', 'B04', 'B08', 'CLP', 'SCL' , 'sunAzimuthAngles', 'sunZenithAngles'] 

s2_cube_scale = connection.load_collection(
    'SENTINEL2_L2A_SENTINELHUB',
    spatial_extent  = spatial_extent,
    temporal_extent = [start_date_exclusion, end_date],
    bands           = bands)

job_options = {
        "executor-memory": "100G",
        "executor-memoryOverhead": "100G",
        "executor-cores": "4"}

s2_cube_scale_save = s2_cube_scale.save_result(format='netCDF') #GTiff #netCDF
my_job  = s2_cube_scale_save .send_job(title="s2_cube_scale", job_options=job_options)
results = my_job.start_and_wait().get_results()
results.download_files("s2_cube_scale")

sulova.andrea · 15 August 2022 09:51

I have tried with: s2_cube_scale = s2_cube_scale.linear_scale_range(0.0,1.0,0,10000) but still no positive answer.

Printing logs:
[{'id': 'error', 'level': 'error', 'message': 'Traceback (most recent call last):\n  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py", line 1727, in get_log_entries\n    with (self.get_job_output_dir(job_id) / "log").open(\'r\') as f:\n  File "/usr/lib64/python3.8/pathlib.py", line 1221, in open\n    return io.open(self, mode, buffering, encoding, errors, newline,\n  File "/usr/lib64/python3.8/pathlib.py", line 1077, in _opener\n    return self._accessor.open(self, flags, mode)\nFileNotFoundError: [Errno 2] No such file or directory: \'/data/projects/OpenEO/j-3c44cd000ba648a8b6350a0765d8e48c/log\'\n'}]

sulova.andrea · 18 August 2022 12:15

@stefaan.lippens if you have time could you please have a look on this

stefaan.lippens · 18 August 2022 13:23

From our internal logs I see this happened Mon 15th of Aug around 11/12 AM. I remember we had “full disk” problems earlier this week. Do still get this same error if you rerun now or something else?
Can you also share the job id of your latest attempt?

sulova.andrea · 18 August 2022 18:39

The same error than before:

FileNotFoundError: [Errno 2] No such file or directory: '/data/projects/OpenEO/j-9d9a934535844b4eace8926e06307e18/log'

stefaan.lippens · 18 August 2022 20:48

for j-9d9a934535844b4eace8926e06307e18, I find this in the logs:

Exception in thread “main” java.lang.IllegalArgumentException:
Required executor memory (102400 MB), offHeap memory (0) MB, overhead (102400 MB), and PySpark memory (0 MB)
is above the max threshold (52224 MB) of this cluster! Please check the values of ‘yarn.scheduler.maximum-allocation-mb’ and/or ‘yarn.nodemanager.resource.memory-mb’.

apparently these job options are way too high

We should improve the error message for this (see Better error for user when requesting too much batch job memory or cpu · Issue #203 · Open-EO/openeo-geopyspark-driver · GitHub)

sulova.andrea · 19 August 2022 09:07

Will try that
A general question: I was running simultaneously a few scripts yesterday but now I am getting this message **OpenEoApiError** : [429] unknown: max connections reached: 1 . Does it mean that I can run only one script?

sulova.andrea · 22 August 2022 07:04

The job options were changed to the smaller numbers and It has processed data after ca. 6h.

I am wondering what are the most correct values in job_options if I want to run (1deg x 1 deg ):
spatial_extent = {'west': -74, 'east': -73, 'south': 4, 'north': 5, 'crs': 'epsg:4326'}?

sulova.andrea · 22 August 2022 08:10

Additionally, I would like to obtain satellite data for some country using a vector layer (.shp, .geojson) and then process that spatial extent further so is it possible to insert a geojson/shp file for a spatial_extent or it always needs to be in this format: spatial_extent ={'west':bbox[0],'east':bbox[2],'south':bbox[1],'north':bbox[3],'crs':4326}?

sulova.andrea · 22 August 2022 11:10

Could you please confirm that there is the size limitation?

This message was obtained after requiring S2 scenes for the Columbia site:

OpenEoApiError: [400] Internal: Requested area 48653514887301.67 m² for collection SENTINEL2_L2A_SENTINELHUB exceeds maximum of 1000000000000.0 m². (ref: r-6bd4565ad8c14288a33570bcd8d3dc87)

stefaan.lippens · 22 August 2022 12:00

No, it should be possible to run multiple scripts/notebooks in parallel.
However, depending on your usage patterns you might hit a “rate limiting” threshold like that.

stefaan.lippens · 22 August 2022 12:02

The answer to that is “it depends” unfortunately.
It depends on how many collections and bands you load in tha AOI, whether you use some more heavy operations (e.g. UDFs), etc.