Filter images and reducing spatially

jaaplangemeijer · 28 February 2022 11:12

Hi all,

I would like to calculate some statistics on pixels within my datacube, in this case 75th percentile of the green band. I would then like to save this result per image. Later on in my algorithm I want to filter entire images based on this percentile.

I have two problems:

The percentile operation is not commutative. Can I reduce all pixels in one go (so not first reduce_dimension("x") then reduce_dimension("y"), but both at the same time (all the green band values in one vector?).
How can I filter entire images based on the result of a calculation? I cannot add some metadata per image. filter_temporal only seems to work based on dates. Is there a generic filter method of some kind?

Thanks,
Jaap

m.mohr · 28 February 2022 13:32

Hi @jaaplangemeijer !

I think you are looking for is the process reduce_spatial: openEO API Processes Unfortunately, it is not available yet in openEO Platform. A workaround could be another process called flatten_dimensions, which is currently worked on, but also still needs a bit of time until it’s available.
openEO doesn’t work on images, it instead works on data cubes. I don’t think you’ll be able to filter images, but maybe one of the back-ends knows a workaround?

Best,
Matthias

jaaplangemeijer · 28 February 2022 14:02

Thanks for answering @m.mohr !

For point 1: do you have a time estimate for me so I can think about workarounds if that is needed? Perhaps using UDFs: two reduce steps, saving the intermediate data as an array (python UDF list?).

I hope there is a solution for point 2, filtering images for quality is key for the algorithm to perform well! Let me think of some workarounds…

Jaap

michele.claus · 1 March 2022 10:09

Hi Jaap,

could you expand on what you mean with filter entire images based on the result of a calculation ?

From what I understand, it could be something like:

Computing the average over the datacube, reducing the spatial dimensions (the image would be the data of each timestep in the datacube)
If the average is above a threshold keep it otherwise discard it.

Basically, something similar to what a back-end is doing for filtering out the dates with too many clouds Filter less cloudy image - #5 by michele.claus

I will see if it is currently possible with the available processes, maybe reduce_spatial could be replaced with two reduce_dimension in a row, firstly over x and secondly over y or viceversa.

jaaplangemeijer · 1 March 2022 15:02

Hi @michele.claus ,

Your understanding is correct. It would be reducing spatial dimension using the percentile reducer, then thresholding above a certain value.

The problem with two reduce_dimensions is that doing two percentile reducers in a row will not result in the same value compared to taking all pixel values in the image at once. Perhaps not a major issue for now though. I can also discuss if we can find a filtering method that we can do “stepwise” over x and y.

Thanks for your example, filtering on cloud cover is super useful! Is there a method to view this image metadata (or properties) in the datacube?

Jaap

michele.claus · 2 March 2022 09:00

Maybe @jeroen.dries or @bart.driessen can tell you more about the available properties of the collection.

m.mohr · 2 March 2022 12:03

The collection metadata gives you hints about the metadata and you can filter on several properties in load_collection, but after you have loaded the imagery into a data cube, you’ll not be able to filter on metadata any longer as then you have a data cube and not separate images any longer.

jeroen.dries · 7 March 2022 07:47

Hi Jaap,
at VITO, we’ve done similar things using either apply_neighborhood, or chunk_polygon. The best choice depend on what you define as an ‘image’. Can it be arbitrarily large, like would you take the 75% percentile of a whole continent? Or can we also define it as a 512x512 pixel region for instance? Or with chunk_polygon, an ‘image’ can be defined by a bounding polygon.

A different approach, that would take the whole data cube into account, would be to split your workflow in two requests: a first request that reduces your full datacube into percentiles per date, and then a second step that loads a new datacube with only the dates you selected based on percentile.
Only, I believe that for this second step we would need filter_labels, which is not fully supported yet.

What would work for you?

jaaplangemeijer · 7 March 2022 08:34

Hi Jeroen!

Great suggestions. In this case I look at all the images of a certain timerange (let say 2-3 years of imagery) that overlap a water reservoir. And then taking this percentile of the whole image’s (or collection of the image’s) green band. So depending on the how efficient this is the chunck_polygon or apply_neighborhood is, both would work. What would you suggest?

jeroen.dries · 7 March 2022 10:33

Hi Jaap,
for this case, I would certainly suggest chunk_polygon, because it will give you all the pixels within the reservoir. You can then use xarray to compute percentiles, and simply drop out the dates you don’t want, or set the pixels to nodata.

jaaplangemeijer · 7 March 2022 10:52

Thanks for the suggestion! I will try this out

stefaan.lippens · 7 March 2022 15:48

If I understand correctly you could then also use aggregate_spatial then with a geometry corresponding to this water reservoir. aggregate_spatial by design reduces along x and y dimension at the same time (like reduce_spatial mentioned above)

jaaplangemeijer · 11 March 2022 10:27

@jeroen.dries I switched to the VITO backend, as it supports chunk _polygon.
I guess I now need to chunk_polygon(my_geometry, ?udf?), or how would I apply the percentile reducer here?
I tried: dc.band(quality_band).chunk_polygon(geometry, lambda data: processes.quantiles(data, probabilities=[score_percentile]))
which ended up in some java class cast error (see job ec0d892d-fd46-4083-8ada-3ebc3c4300bf in VITO backend). I think I am not yet grasping your suggestion yet.

jeroen.dries · 15 March 2022 08:14

Hi @jaaplangemeijer ,
yes, this would need to be fully UDF based, I don’t have many examples in open source yet, except for this unit test:

github.com

Open-EO/openeo-python-client/blob/a97c1ac8e423f8ff05840d2038d0fa4fde2655ea/tests/rest/datacube/test_datacube100.py#L600

      
        
                        'result': True
                    }
            
            
    with pytest.raises(ValueError, match="Invalid dimension 'wut'"):
                    s22.reduce_dimension(dimension="wut", reducer="sum")
            
            

            
def test_chunk_polygon_basic(con100: Connection):
                img = con100.load_collection("S2")
                polygon: shapely.geometry.Polygon = shapely.geometry.box(0, 0, 1, 1)
                process = lambda data: data.run_udf(udf="myfancycode", runtime="Python")
                result = img.chunk_polygon(chunks=polygon, process=process)
                assert sorted(result.flat_graph().keys()) == ['chunkpolygon1', 'loadcollection1']
                assert result.flat_graph()["chunkpolygon1"] == {
                    'process_id': 'chunk_polygon',
                    'arguments': {
                        'chunks': {
                            'type': 'Polygon',
                            'coordinates': (((1.0, 0.0), (1.0, 1.0), (0.0, 1.0), (0.0, 0.0), (1.0, 0.0)),)
                        },
                        'data': {'from_node': 'loadcollection1'},

Basically, you can work on multiple geometries at the same time, and the UDF will be invoked once per polygon.
The UDF will receive the full timeseries stack, so you can first use xarray to do the percentile reduction, and then apply your threshold filtering. You can simply set pixels to nodata if you want to leave them out, or leave out out specific labels entirely.

nicola.ciapponi · 18 March 2024 15:31

I just faced almost the same problem.
I have a computed layer with x,y dimension, and i want to calculate a global statistic (i.e. the 35° percentile ) for the whole extent of my area
Up to now i don’t consider the temporal dimension but in future i will.
The calculated value(s) will be than used to mask another layer.

My problem connects well also with the percentile/quantile topic How to use quantiles on a 2d data cube (x,y) - openEO Platform - EODC Discussions

I tried with both aggregate_spatial using a polygon as big as the image and with reduce_spatial ( that as pointed in the posts above con be done only one dimension by time that is not so correct )

Are there any updates on the implementation of reduce_spatial in 2 dimension at the same time ?

Any updates on filter_labels ?
@stefaan.lippens @jeroen.dries

stefaan.lippens · 18 March 2024 19:53

To be sure what you mean here: you plan to calculate a spatially global statistic, so you get a 1D timeseries, which you want to use for picking or dropping observations from another band/collection. E.g. global statistic is 42 on Feb 4 → use observations of Feb 4, global statistic is 7 on Feb 14 → drop Feb 14 observations, …?

And did that work, or what error do you get?

If it doesn’t work, have you already tried to resample your data down heavily before aggregating? If you just want quantiles (or statistics in general), downsampling is probably not going change the results much.

I don’t think we have short term implementation plans for “reduce_spatial”.

Also note that a generic reduce_spatial is far from trivial in a big EO data system, where you want to split the work in separate chunks. Simple reducers like sum, count and mean are easily doable because they trivially allow splitting the calculation. Quantiles (as requested here apparently) are a lot harder to do in a splittable way. Long story short: we will probably have to re-implement (if even possible) various reducer processes for usage in reduce_spatial, so it’s a big task to tackle.

nicola.ciapponi · 19 March 2024 14:25

Hi stefaan , thanks for the quik answer.

My plan i calculate a spatially global statistic for each date ( 1D timeseries ) in order to mask on the spatial dimensions the pixel above this statistics ( and in the future expand this process to a timeseries instead of a single date, that is what i’m doing now for testing )
ex: global statistis for 4 feb= 0.3654, masking all pixel in this date above ( or below ) this treshold, for 5 feb = 0.590 , masking all pixel inthis date above and continuing.

No , here the details j-2403196ce1834f56a02c73f7f239fb2a , but thank you for the suggestion i’m trying now and le you know

peterjames.zellner · 20 March 2024 16:05

Hi,
Thanks @nicola.ciapponi for the explanation of the use case. Thanks @stefaan.lippens for digging into this!
Here are two tests I have conducted, more on the technical side.

Question: How to use the result of aggregate_spatial as a threshold in the following processes?

Example: Directly use the result of aggregate_spatial and connect it to the context of apply with a callback gt that will be used for masking.

Process Graph:

{
  "process_graph": {
    "aggregate2": {
      "arguments": {
        "data": {
          "from_node": "load1"
        },
        "geometries": {
          "coordinates": [
            [
              [
                11.887589773041645,
                46.429362477197685
              ],
              [
                11.837292446283431,
                46.429362477197685
              ],
              [
                11.837292446283431,
                46.449001496027336
              ],
              [
                11.887589773041645,
                46.449001496027336
              ],
              [
                11.887589773041645,
                46.429362477197685
              ]
            ]
          ],
          "type": "Polygon"
        },
        "reducer": {
          "process_graph": {
            "mean1": {
              "arguments": {
                "data": {
                  "from_parameter": "data"
                }
              },
              "process_id": "mean",
              "result": true
            }
          }
        }
      },
      "process_id": "aggregate_spatial"
    },
    "apply6": {
      "arguments": {
        "context": {
          "from_node": "aggregate2"
        },
        "data": {
          "from_node": "load1"
        },
        "process": {
          "process_graph": {
            "gt2": {
              "arguments": {
                "x": {
                  "from_parameter": "x"
                },
                "y": {
                  "from_parameter": "context"
                }
              },
              "process_id": "gt",
              "result": true
            }
          }
        }
      },
      "process_id": "apply"
    },
    "load1": {
      "arguments": {
        "bands": [
          "B01"
        ],
        "id": "SENTINEL2_L2A",
        "spatial_extent": {
          "east": 11.907770486313042,
          "north": 46.46882281034837,
          "south": 46.422061000608664,
          "west": 11.817009058680437
        },
        "temporal_extent": [
          "2016-12-01T00:00:00Z",
          "2016-12-10T00:00:00Z"
        ]
      },
      "process_id": "load_collection"
    },
    "mask7": {
      "arguments": {
        "data": {
          "from_node": "load1"
        },
        "mask": {
          "from_node": "apply6"
        }
      },
      "process_id": "mask"
    },
    "save8": {
      "arguments": {
        "data": {
          "from_node": "mask7"
        },
        "format": "GTIFF"
      },
      "process_id": "save_result",
      "result": true
    }
  }
}

ID: j-2403181c3b004908a799c68699c942c9
Error:

OpenEO batch job failed: AttributeError("'AggregateSpatialResultCSV' object has no attribute '_get_object_id'")
+46s 430msERROR
ID: [1710756252951, 707133]

I have also tried to pipe the result of aggregate_spatial into array_create+first. ID:j-240318d6291347088a04cfe9a290aae4
Error:

OpenEO batch job failed: ProcessParameterInvalidException(status_code=400, code='ProcessParameterInvalid', message="The value passed for parameter 'data' in process 'array_create' is invalid: Expected <class 'list'> but got <class 'openeo_driver.save_result.AggregatePolygonResult'>.", id='no-request')
+15s 128msERROR
ID: [1710756645019, 169994]

jeroen.dries · 20 March 2024 16:23

I think I’m working on something similar.
The approach attempted above indeed won’t work yet, but what I’m doing is a vector_to_raster to convert the aggregated value back into pixels, so that thresholding can then be performed in raster space.
This also won’t work yet, but we’re solving that as we speak.

peterjames.zellner · 26 March 2024 13:07

@jeroen.dries, thanks for the update! let us know as soon as you have a prototype or snippet or whatever