Job splitting

There is the size limitation of 1 million sq km for SentinelHub powered collections. Therefore, I would like to split up my biger spatial extent.

Could you please quide me how to run a biger job by splitting up datacube processing in UTM based tiles of 50km by 50km?

start_date           = '2021-06-01'
spatial_extent  = {'west': -74.5, 'east': -73, 'south': 4.5, 'north': 5, 'crs': 'epsg:4326'} #colombia

## Get the Sentinel-2 data for a 3 month window.
start_date_dt_object = datetime.strptime(start_date, '%Y-%m-%d')
end_date             = (start_date_dt_object + relativedelta(months = +1)).date() ## End date, 1 month later (1st Feb. 2021)
start_date_exclusion = (start_date_dt_object + relativedelta(months = -1)).date() ## exclusion date, to give a 3 month window.

bands                = ['B02', 'B03', 'B04', 'B08', 'CLP', 'SCL' , 'sunAzimuthAngles', 'sunZenithAngles'] 

s2_cube_scale = connection.load_collection(
    'SENTINEL2_L2A_SENTINELHUB',
    spatial_extent  = spatial_extent,
    temporal_extent = [start_date_exclusion, end_date],
    bands           = bands)


job_options = {
        "tile_grid": "utm-50km",
        "executor-memory": "5G",
        "executor-memoryOverhead": "6G",
        "executor-cores": "4"}

s2_cube_scale_save = s2_cube_scale.save_result(format='netCDF') #GTiff #netCDF
my_job  = s2_cube_scale_save .send_job(title="s2_cube_scale", job_options=job_options)
results = my_job.start_and_wait().get_results()
results.download_files("s2_cube_scale")

This is not working unfortunately. I would like to run this for Columbia, Is it possible?

What back-end connection are you using?

the tile_grid job option feature is currently only available on openEO Platform (connection url openeo.cloud)

more info at managed job splitting in openEO Platform

It actually works with openeo-dev.vito.be, however, the output seems wrong.

For example:
S2, 2021-05-03, {‘west’: -74.5, ‘east’: -73, ‘south’: 4.5, ‘north’: 5, ‘crs’: ‘epsg:4326’}
Online searching shows that S2 data should be only on the left site. The right site (a yellow mark) should not have any data for that day (2021-05-03). The S2 satellite did not over pass the right area.

The openeo output shows data on right site (green mark) which is not correct I believe.

Please review the Notebook. The ouput is shown there.

your job indeed finished, but it was not split up internally in smaller jobs based on the utm-50km tile_grid as you specified in the job options.

Thanks for letting me know.

Now, I can see sub-tasks running on openeo.cloud

The question is how can we use your implementions from openeo-dev.vito.be to openeo.cloud when we scale our job?

I tried your code with a smaller bbox (to save on executio time), but so far I can not reproduce that problem: data is missing properly in my resulting netcdf for the areas that are not visited.

Features available on openeo-dev.vito.be become available on openeo.cloud eventually. What feature are you concerned about in particular?

First of all, Outputs are the multiple nc files based on grid area. Is it possible to composite all files as a part of postprocessing when all sub-jobs are finished? It will be better for a user to get final product as one nc.file.
image

Secondly, I am getting the same gap problem as it was mentioned before. Can you try to run the same spatial and temporal extent?
This is an ouput for 2021-05-03 when all sub-files are loaded:

That’s indeed a feature that we are considering. However, there are reasons to prefer separate files per subjob:

  • files are smaller, allowing to download a subet for inspection
  • allow to inspect partial results of finished subjobs while other subjobs are still running
  • less confusion about final results when some subjobs fail and the end result in incomplete

Those are really good reasons. However, If a user does not need any inspection as he/she is confident about correctness of results then it will be nice to have this option. Thus, What about to leave this decision on a user?

Maybe it will be reasonable to have a default option as “False” but if the user’s aim will be postprocessing, then it can be set to True as a part of job_options specification.

For example:

job_options = {
        "tile_grid": "utm-50km",
        "postprocessing": "True"}

indeed, something like that.
Feature request ticket is here: Option to stitch results of partitioned jobs · Issue #75 · Open-EO/openeo-aggregator · GitHub

Super!
What about my second question?