Job splitting

sulova.andrea · 21 September 2022 18:28

There is the size limitation of 1 million sq km for SentinelHub powered collections. Therefore, I would like to split up my biger spatial extent.

Could you please quide me how to run a biger job by splitting up datacube processing in UTM based tiles of 50km by 50km?

start_date           = '2021-06-01'
spatial_extent  = {'west': -74.5, 'east': -73, 'south': 4.5, 'north': 5, 'crs': 'epsg:4326'} #colombia

## Get the Sentinel-2 data for a 3 month window.
start_date_dt_object = datetime.strptime(start_date, '%Y-%m-%d')
end_date             = (start_date_dt_object + relativedelta(months = +1)).date() ## End date, 1 month later (1st Feb. 2021)
start_date_exclusion = (start_date_dt_object + relativedelta(months = -1)).date() ## exclusion date, to give a 3 month window.

bands                = ['B02', 'B03', 'B04', 'B08', 'CLP', 'SCL' , 'sunAzimuthAngles', 'sunZenithAngles'] 

s2_cube_scale = connection.load_collection(
    'SENTINEL2_L2A_SENTINELHUB',
    spatial_extent  = spatial_extent,
    temporal_extent = [start_date_exclusion, end_date],
    bands           = bands)


job_options = {
        "tile_grid": "utm-50km",
        "executor-memory": "5G",
        "executor-memoryOverhead": "6G",
        "executor-cores": "4"}

s2_cube_scale_save = s2_cube_scale.save_result(format='netCDF') #GTiff #netCDF
my_job  = s2_cube_scale_save .send_job(title="s2_cube_scale", job_options=job_options)
results = my_job.start_and_wait().get_results()
results.download_files("s2_cube_scale")

This is not working unfortunately. I would like to run this for Columbia, Is it possible?

stefaan.lippens · 22 September 2022 09:01

What back-end connection are you using?

the tile_grid job option feature is currently only available on openEO Platform (connection url openeo.cloud)

more info at managed job splitting in openEO Platform

sulova.andrea · 22 September 2022 10:43

It actually works with openeo-dev.vito.be, however, the output seems wrong.

For example:
S2, 2021-05-03, {‘west’: -74.5, ‘east’: -73, ‘south’: 4.5, ‘north’: 5, ‘crs’: ‘epsg:4326’}
Online searching shows that S2 data should be only on the left site. The right site (a yellow mark) should not have any data for that day (2021-05-03). The S2 satellite did not over pass the right area.

The openeo output shows data on right site (green mark) which is not correct I believe.

Please review the Notebook. The ouput is shown there.

stefaan.lippens · 22 September 2022 12:23

your job indeed finished, but it was not split up internally in smaller jobs based on the utm-50km tile_grid as you specified in the job options.

sulova.andrea · 22 September 2022 12:40

Thanks for letting me know.

Now, I can see sub-tasks running on openeo.cloud

The question is how can we use your implementions from openeo-dev.vito.be to openeo.cloud when we scale our job?

stefaan.lippens · 22 September 2022 13:11

I tried your code with a smaller bbox (to save on executio time), but so far I can not reproduce that problem: data is missing properly in my resulting netcdf for the areas that are not visited.

Features available on openeo-dev.vito.be become available on openeo.cloud eventually. What feature are you concerned about in particular?

sulova.andrea · 23 September 2022 13:39

First of all, Outputs are the multiple nc files based on grid area. Is it possible to composite all files as a part of postprocessing when all sub-jobs are finished? It will be better for a user to get final product as one nc.file.

Secondly, I am getting the same gap problem as it was mentioned before. Can you try to run the same spatial and temporal extent?
This is an ouput for 2021-05-03 when all sub-files are loaded:

stefaan.lippens · 23 September 2022 13:54

That’s indeed a feature that we are considering. However, there are reasons to prefer separate files per subjob:

files are smaller, allowing to download a subet for inspection
allow to inspect partial results of finished subjobs while other subjobs are still running
less confusion about final results when some subjobs fail and the end result in incomplete

sulova.andrea · 23 September 2022 14:07

Those are really good reasons. However, If a user does not need any inspection as he/she is confident about correctness of results then it will be nice to have this option. Thus, What about to leave this decision on a user?

Maybe it will be reasonable to have a default option as “False” but if the user’s aim will be postprocessing, then it can be set to True as a part of job_options specification.

For example:

job_options = {
        "tile_grid": "utm-50km",
        "postprocessing": "True"}

stefaan.lippens · 23 September 2022 14:14

indeed, something like that.
Feature request ticket is here: Option to stitch results of partitioned jobs · Issue #75 · Open-EO/openeo-aggregator · GitHub

sulova.andrea · 23 September 2022 15:42

Super!
What about my second question?

sulova.andrea · 26 September 2022 07:47

Regarding a gap, the job was scaled to a bigger area:
spatial_extent = {'west': -75, 'east': -72, 'south': 3, 'north': 5, 'crs': 'epsg:4326'}

But getting the same problem with a right site, which should not show the shape of big square:

but rather as it is the left site of image: smooth line.

This is an example of the Sentinel-2 overpass on 2021-05-04:

jeroen.verstraelen · 28 September 2022 12:41

Hey Andrea,
I copy-pasted the code from your first post and only changed the temporal extent to [2021-04-28, 2021-05-08]. As a result I also receive 12 netCDF files (one for each tile).

If I open these netCDF files in QGIS, each having a band rendering of singleband gray for t=11445 (=2021-05-03) I obtain this image:

Notice on the left that I turned 0011-openEO, 0010-openEO, and 0009-openEO off because they do not have a band with t=11445.

There does appear to be something off with the tile in question though, similar artifacts occur when I look at the other dates.
Are you sure that these three files have a band for t=11445 on your end? I just want to verify that on that topic there are no inconsistencies, then I have a better idea of the problem that needs to be solved.

jeroen.verstraelen · 29 September 2022 12:14

I located the origin of the problem and created a new issue for it on Github:

github.com/Open-EO/openeo-geopyspark-driver

Sentinelhub batch processing returns incorrect results for large areas

opened 09:25AM - 29 Sep 22 UTC

closed 04:49PM - 28 Oct 22 UTC

JeroenVerstraelen

When loading the SENTINEL2_L2A_SENTINELHUB collection with an area larger than 5…0x50 km² several issues arise. - Incorrect dates * Dates are introduced that do not exist for that area in the [eo-browser](https://apps.sentinel-hub.com/eo-browser/?zoom=7&lat=4.58738&lng=-75.10254&themeId=DEFAULT-THEME) * These extra dates cause subsequent dates in the netCDF to be offset * E.g. when looking at three dates in a netCDF: * netCDF = 2021-04-28, eo-browser = 2021-04-28 (correct) * netCDF = 2021-04-30, eo-browser = 2021-05-03 (non-existent) * netCDF = 2021-05-03, eo-browser = 2021-05-08 (offset) - Swath edge tiles seem to consist out of a combination of RDDs from different dates * E.g. The third tile from the left here is on the edge of a swath and it appears to contain data from three different dates. ![Screenshot from 2022-09-28 15-18-01](https://user-images.githubusercontent.com/3028262/192990560-4baf3aff-1b27-4bdd-b22c-cd7bfc19b7a7.png) - NetCDF shows data on the same date from two swaths taken at different dates. * Looking at the image above the data on the leftmost tile is likely taken at a different date compared to the data on the rightmost tile. * Since the swath ends at the third tile from the left, there should be no data in the rightmost tile. The cause for the issue can be found on this line: https://github.com/Open-EO/openeo-geopyspark-driver/blob/master/openeogeotrellis/backend.py#L1492 Short-term circumventions can be done in two ways: 1. Avoiding sentinelhub batch processing: `job_options={"sentinel-hub": {"input": "sync"}}` 2. Avoid processing areas larger than 50x50 km²: Use backend openeocloud.vito.be + `job_options = { "tile_grid": "utm-40km" } ` Smallest example that reproduces the issue (includes images): https://gist.github.com/JeroenVerstraelen/dfcf979640ffcdce6f4454d271faabec

In a nutshell, the VITO backend splits its request to SentinelHub into batches when the requested area is larger than 50x50 km², and there is an issue when these batches are stitched back together. This will be worked on as soon as possible and should be fixed in about two weeks.

For now, there are two ways to circumvent the issue.

You can tell the backend directly to not batch requests to SentinelHub:

job_options={“sentinel-hub”: {“input”: “sync”}}

You can use a tile grid on openeo.cloud with tiles smaller than 50x50 km²:

job_options = { “tile_grid”: “utm-40km” }

jan.van.den.bosch · 28 October 2022 16:53

Fixed on openeo-dev and artifacts are no longer present. Please try again.

sulova.andrea · 31 October 2022 12:46

Can you show me your results for this area and band shows t=11445?

My seems not good (t=11445), the data are comming from today.

I have tried with openeo.cloud

sulova.andrea · 31 October 2022 20:29

Using the same code in openeo-dev.vito.be does not show any results because the job was waiting in a queue for hours.

1:10:25 Job 'j-531ca07e21a940a3b4798c302f6c9465': queued (progress N/A)

I have tried to run it again (31/10/2022) later however the same problem occurred, the job was hanging in queue

4:39:56 Job 'j-2763dca0258a45fb81e4bb1788c9be90': queued (progress N/A)

sulova.andrea · 1 November 2022 07:54

Eventually, I have obtained the ouput (after 7hours) usingopeneo-dev.vito.be and seems correct.
The ouput is also merged into one tile which is perfect. Thanks for fixing the problem!

sulova.andrea · 1 November 2022 08:22

There is a problem with a scene as it seems there are any mask applied? @jeroen.dries
I believe it should not be like that:

jan.van.den.bosch · 2 November 2022 11:18

Hi Andrea.

Glad to see that you’re making progress. Some notes:

openeo.cloud points to our production environment and the fix I committed is currently only available on openeo-dev.vito.be, as you have noticed; we’ll do our best to get the fix on openeo.cloud as soon as possible;
please refrain from submitting the same job multiple times if they stay “queued” longer than usual; they will be handled eventually and submitting more jobs won’t make them run earlier (the opposite is true);
@jeroen.dries is on holiday this week but I took a look at the scene above and it seems that those “masks” are already visible in the source data: Sentinel Hub EO Browser; you might want to contact Sentinel Hub directly in case you have any questions regarding this behaviour.

Cheers,
Jan