Issue with masking procedure

Hi everyone,

I need to apply a kernel to a Sentinel-2 dataset and to mask the clouds. I applied all the operation, but it seems that there is an issue on masking the dataset after the kernel application.

This is my code:

# Import required packages
import xskillscore
import matplotlib.pyplot as plt
import openeo
from openeo.processes import is_nan, not_, sd, lte,  mean, array_create, quantiles, sum, add, multiply,gte 
import numpy as np
from netCDF4 import Dataset
import time
from datetime import datetime,timedelta
import xarray
from openeo.udf.xarraydatacube import XarrayDataCube
import os
import pandas
from openeo.rest.datacube import DataCube
from scipy.stats import rankdata


# %% Initialization
pc = os.getcwd()
pc=str.replace(pc,'\\','/')
n=str.find(pc,'Dropbox')
pc=pc[0:n]

band=["B08","B04"]
maskband=["CLD","CLP"]
maskthr=50
filt= (np.ones([5,5])/25).tolist()

# Input variables
t0= ["2016-10-01","2017-10-01","2018-10-01","2019-10-01","2020-10-01"]
s0=["2018-09-30","2019-09-30","2020-09-30","2021-09-30","2022-09-30"]

namestat='Polag'
coord=[11.577,44.871,11.617,44.911]

i=0
temp_ext=[t0[i]+' 00:00:00',s0[i]+' 00:00:00']
print('initialization '+t0[i])
spat_extent={"west": coord[0], "south": coord[1], "east": coord[2], "north": coord[3]}
rect={
  "type":"FeatureCollection",
  "features":[
    {
      "type":"Feature",
      "geometry": {
        "type":"Polygon",
        "coordinates":[[[coord[0],coord[1]],[coord[0],coord[3]],[coord[2],coord[3]],[coord[2],coord[1]],[coord[0],coord[1]]]]
      },
      "properties":{"name":"area1"}
    }
  ]
}

#connection = openeo.connect("https://openeo.cloud")
connection = openeo.connect("https://openeo-dev.vito.be")
connection.authenticate_oidc()

# S2 datacube
datacube = connection.load_collection(collection_id = "SENTINEL2_L1C_SENTINELHUB", spatial_extent = spat_extent, temporal_extent = temp_ext, bands = band[0],properties={"eo:cloud_cover": lambda x: lte(x, 70)}).linear_scale_range(0, 10000, 0, 1)
datacubeNDVI = connection.load_collection(collection_id = "SENTINEL2_L1C_SENTINELHUB", spatial_extent = spat_extent, temporal_extent = temp_ext, bands = band,properties={"eo:cloud_cover": lambda x: lte(x, 70)}).linear_scale_range(0, 10000, 0, 1)

# mask datacube
mask0 = connection.load_collection(collection_id = "SENTINEL2_L2A_SENTINELHUB", spatial_extent = spat_extent, temporal_extent = temp_ext, bands = maskband,properties={"eo:cloud_cover": lambda x: lte(x, 70)})
mask0 = mask0 >= maskthr
mask0 = mask0.band(maskband[0]).logical_or(mask0.band(maskband[1]))
mask0 = mask0.resample_cube_spatial(datacube)

# water mask
wat = connection.load_collection(collection_id = "GLOBAL_SURFACE_WATER", spatial_extent = spat_extent, temporal_extent = ["1980-01-01", "2020-12-31"], bands = ['extent']).max_time()
wat = wat.apply(lambda val: not_(is_nan(x=val)))
wat = wat.apply_kernel(filt).resample_cube_spatial(datacube) == 0 #.apply(lambda val: eq(x=val,y=0))
land = wat == 0 #.apply(lambda val: eq(x=val,y=0))

datacube2=datacube.apply_kernel(filt)

connection = openeo.connect("https://openeo-dev.vito.be")
connection.authenticate_oidc()

datacube.download("data.nc", format="netcdf")
datacube2.download("data2.nc", format="netcdf")



cube=xarray.open_dataset('data.nc')
cube=XarrayDataCube(cube.to_array(dim='bands'))
dat=cube.get_array()
print(np.nanmax(dat.values[1,:,:,:]))

cube=xarray.open_dataset('data2.nc')
cube=XarrayDataCube(cube.to_array(dim='bands'))
dat=cube.get_array()
print(np.nanmax(dat.values[1,:,:,:]))

# cloud masking
datacube=datacube.mask(mask0)
datacube2=datacube2.mask(mask0)

                        


connection = openeo.connect("https://openeo-dev.vito.be")
connection.authenticate_oidc()

datacube.download("data.nc", format="netcdf")
datacube2.download("data2.nc", format="netcdf")



cube=xarray.open_dataset('data.nc')
cube=XarrayDataCube(cube.to_array(dim='bands'))
dat=cube.get_array()
print(np.nanmax(dat.values[1,:,:,:]))

cube=xarray.open_dataset('data2.nc')
cube=XarrayDataCube(cube.to_array(dim='bands'))
dat=cube.get_array()
print(np.nanmax(dat.values[1,:,:,:]))

and here there is the output, with the max values of the netcdf without and with the kernel application that I downloaded prior and after the cloud masking.

initialization 2016-10-01
Authenticated using refresh token.
C:\Users\paolo\miniconda3\envs\openeo\lib\site-packages\openeo\metadata.py:279: UserWarning: Band name mismatch: ['change', 'extent', 'occurrence', 'recurrence', 'seasonality', 'transitions'] != ['change', 'extent', 'occurrence', 'recurrence', 'seasonality', 'transitions', 'dataMask']
  complain("Band name mismatch: {a} != {b}".format(a=cube_dimension_band_names, b=eo_band_names))
Authenticated using refresh token.
1.0
1.0000000000000002
Authenticated using refresh token.
1.0
9.969209968386869e+36

As you can see, the max value of the dataset with the kernel application and the cloud masking is in the order of 10^36.

How can I solve this?

Hi Paolo,
in cases like this, it helps to actually see the incorrect output, and also the inputs. Now you have a sequence of operations, and it’s hard for me to tell where it starts going wrong.
If you run a batch job, you can always copy paste the url to the result, or give us a job id. Otherwise, just adding a screenshot might also help.

thanks,
Jeroen

Hi @jeroen.dries, I’m sorry that it was not clear.

This is the incorrect input: I did not use any job or batch, until now I have just used “execute” or “download” to extract the output.

to sum this up the issue is:

  1. I take a Sentinel-2 collection
  2. I apply a kernel 5*5 to the collection
  3. I download the netcdf relative to the two collection and I calculate the max value (1 and 1.0000000000000002)
  4. I apply a mask based on the cloud presence to both products download the resulting
  5. I download the netcdf relative to the two collection and I calculate the max value (1 and 9.969209968386869e+36).

I think that “where it starts to go wrong” is a combination of the kernel application and the masking procedure:
At first, after the kernel application, the max values are reasonable (around 1), while after the application of the mask, the product with kernel show impossible values.

Is it clearer this way?

Hi Paolo,
if you just use download, can you share a screenshot of inputs and outputs? (I don’t think files can be attached to posts.)
mask and apply_kernel don’t have a known relationship, so I need to figure out if the inputs are correct or not. Now you provide the maximum value, but this removes a lot of the context. Are all pixels set to this high value, or is it just a few?
I know you provide the script, but it’s quite time consuming to understand, run and debug all scripts provided by users, especially if they are more complex, so it really helps if we get more insights.

Running as a batch job by replacing download() with execute_batch() is also convenient for debugging (even though it’s slower), in the sense that this gives us a clear job id, so we can inspect logs and outputs more easily. We made it really easy to switch between these types of jobs for this purpose.

regards,
Jeroen

The correct dataset of sentinel-2 with nor kernel or mask application was downloaded with this job id:
7175c9f2-4271-467c-a3e3-c86ab2489132

The correct dataset of sentinel-2 with kernel application but without mask application has this job id:
e4537937-8580-4362-9b4e-4ce232ab2d40

The correct dataset of sentinel-2 with mask application but without kernel application has this job id:
2f85d53f-fa39-4e83-a55e-6fe7de279582

Finally, the dataset which show the bad values due to masking and kernel application has this job id:
63b568c2-532b-4cab-9af2-a9195be62591

The issue is related to just few images (in this case 8, 9, 32, 36, 41, 42, 56, 59), and it cause some areas to assume impossible values, as shown in the image (n.32).

It can be easily solved by setting to NaN all the values above a certain threshold, but I wanted let you know this