215 credits to sample 10 points over a collection. Am I doing it wrong?

giacomo.cappellini.8 · 21 August 2024 18:30

Hello!

I’ve just registered on the platform, and I’m running my first processes to test out the system. It works nicely both on the web editor and via Python API, but I got worried by the large amount of credits required for a simple point sampling task.

Here’s my trivial test process: get time series of Vegetation Indices for N pixels (point geometries) for year 2018

this is the equivalent Python code

dc_collection = connection.load_collection(collection_id = "VEGETATION_INDICES", spatial_extent = {"west": 8.058428808185612, "east": 9.90150590839975, "south": 38.82259352166011, "north": 41.291280668691115}, temporal_extent = ["2018-01-01T00:00:00Z", "2018-12-31T23:59:59Z"])

def reducer_first(data, context = None):
    return process("first", data = data)

gdf_json = json.loads(gdf.to_crs(4326).sample(10).to_json())

vc_aggregation = dc_collection.aggregate_spatial(geometries = gdf_json, reducer = reducer_first)
vc_save = vc_aggregation.save_result(format = "NETCDF")

if N == 1 it requires 17 credits
if N == 10 it requires 215 credits

In both cases the result is correct.

Considering that the 10 points are spanned over just 2 sentinel tiles, I was expecting to see cost to double up and then don’t increment much more, but I got a x12.5 credits cost increment.

Considering that I will need to sample 50k points in same area and that GEE does that quite easily, I think I’m doing something wrong, as it would be 1075000 credits, and that would be quite a lot of money

Could you please explain me if my algorithm is correct? I’m bounding the load_collection both in space and time, and I tried using filter_spatial instead of aggregate_spatial but I got false positives (process finished successfully, but downloading result is a HTTP 404, but credit are consumed)

Thanks

emile.sonneveld · 22 August 2024 08:10

Hi Giacomo,

Can you share a job id so we can investigate what is going on?
Maybethe form of the geojson is demanding.
VEGETATION_INDICES also refers to external data in sentinelhub, wich makes execution more heavy.

Emile

giacomo.cappellini.8 · 22 August 2024 14:35

I’ll share the job IDs as soon as EGI Check-in let me back in. I keep getting Sorry, the authentication server is not available right now.

Side question: how do I know where the data is and where is it more convenient to process? Can I read that by shuffling https://hub.openeo.org/ ?

giacomo.cappellini.8 · 22 August 2024 18:34

here’s the job IDs executed on openeo.cloud

sample 1 point
vito-j-2408215a0f254cd9b84a8e68c4cf3834: 17 credits

sample 10 points
vito-j-240821f0ce034b33ac40596af5c25581: 215 credits

as suggested, I’ve executed the same process (10 points version) on openeo.dataspace.copernicus.eu on equivalent collection COPERNICUS_VEGETATION_INDICES and it succeeded spending only 10 credits.

How could I have avoided this? How do I know where is more convenient to run processes based on collections, if those are present in multiple instances but at very different processing cost?

jeroen.dries · 22 August 2024 18:54

Hi Giacomo,
there’s not really a way to tell this up front. Basically, you were correct in spotting the relatively high usage, and raising the issue on the forum.
This is actually one of the advantages of a standardized processing API: you can easily try the same workflow on different platforms. So the whole comparison could be made relatively fast.

giacomo.cappellini.8 · 22 August 2024 19:53

I do agree that it is really convenient to use same API to talk to different geoprocessing servers, yet I’m quite puzzled how there’s no concept of “distance” or “cost” involved in linking collections to providers, especially if there’s such a large cost gap just with a simple 1 vs 10 point sample on datacube of shape (144,7)

The comparison is not-so-fast, as running a very simple process can easily consume all the credits available. I know that is possible to set to stop automatically when cost threshold is reached, but I don’t know the expected cost “for the happy path”.

For example, if now it cost 10 credits instead of 215 to sample 10 points on openeo.dataspace.copernicus.eu, what would be the expected cost to sample 50k points on same tiles? Should I set the cost threshold to 50, 100, 200, or 2000?

Thanks

giacomo.cappellini.8 · 22 August 2024 21:36

I tried running same sampling algorithm on COPERNICUS_VEGETATION_INDICES for 53236 points instead of 10 on openeo.dataspace.copernicus.eu.

It failed with error (I guess timeout error, but first error line in log says Fail to tell driver that we are starting decommissioning)

Jod ID: j-240822544c1143598c803557e716230e

It says it costed 59 credits, but I don’t see the count going down on https://portal.terrascope.be/billing like when experimenting with openeo.cloud. Where can I see the the credit count when operating on openeo.dataspace.copernicus.eu?

Thanks

jeroen.dries · 23 August 2024 07:42

Hi Giacomo,
for questions on other platforms, could you ask it on the appropriate forum:

Or refer to the documentation:
https://documentation.dataspace.copernicus.eu/Applications/PlazaDetails/Reporting.html

The job seems to have encountered an issue when converting the output to netcdf. I will try to send an intermediate csv, but you could also try using csv to avoid the issue if you want to run other jobs.

giacomo.cappellini.8 · 23 August 2024 15:31

Thanks for the redirection. I was not fully aware how openeo community was splitted over the different instances.

What do you mean by I will try to send an intermediate csv? Why exactly netcdf failed in this context? CSV tend to grow to very large size and is generally the least efficient option, is it the go-to format to avoid problems when working with openeo? What about geoparquet (which I also used as input geometry format)?

thanks

jeroen.dries · 23 August 2024 16:54

Hi Giacomo,
you are right about csv, but as it happens, this is indeed the intermediate format that currently still gets used by the backend you were using. This is not a property of openEO as a whole, and thus subject to change.

Trying with geoparquet as an alternative is definetly also a good idea, and the problem might anyway go away by increasing the driver-memory job option.

giacomo.cappellini.8 · 23 August 2024 18:17

Thanks for the hint. I was not aware of the job options

In the meanwhile, the job that failed with output NETCDF succeeded with CSV!

j-240823a9ee4746aa9a28591a2afb2009

It would be nice to learn how I could have tweaked the job_options to make this process run better/faster.

Thanks for all the insights.