Xarray local dataset implementation

david.kovacs · 12 February 2024 11:10

Hi all

I am currently using the FuseTS package within openEO to create gap-filled, “sensor-fused” maps. I was wondering if there is an implementation of local xarray/netcdf4 datasets into openEO and how that could be done?

the FuseTS toolbox uses sensor fusion to create gap-filled maps, now my objective is to use e.g. Copernicus LAI-300m data (from VITO) and my own netcdf4/xarray datasets, to combine and obtain such gap-filled maps. I would highly appreciate any ideas that could point me to the solution on implementing local data into openEO processes

Dávid

emile.sonneveld · 13 February 2024 16:59

Hi Dávid,

If your images are available online as a stac catalog, you can use load_stac to access it from openEO.

Otherwise, if you are working on openeo.cloud or openeo.vito.be, you can upload files to the backend through a free terrascope VM:

If your data is a collection of tiffs, without stac metadata, you could use an unsupported function:

datacube = connection.load_disk_collection(
    format="GTiff",
    glob_pattern="/data/users/Public/<user-name>/...*.tif",
    options=dict(date_regex=r".*_(\d{4})-(\d{2})-(\d{2}).\.tif"),
)

Does that solve the question?

Emile

david.kovacs · 13 February 2024 17:54

Hi Emile

Thanks for your response! I would want to upload my data to Terrascope and access it via openEO, for me that looks like the most feasible option.
I have used the Terrascope VM before, however I would need some help on how I could upload my data to the Terrascope VM and thus access it via openEO?

Thanks in advance
Dávid

emile.sonneveld · 14 February 2024 08:11

Hi David,

This is probably the folder you have write access to: /data/users/Public/david.kovacs/. I saw some files in it and made an example script that loads data from here:

import openeo
openeo.connect("https://openeo.cloud").authenticate_oidc()

spatial_extent={  # Johannesburg
    "west": 27,
    "south": -27,
    "east": 30,
    "north": -26,
}

# load_disk_collection is legacy. Use load_stac when metadata is available
datacube = connection.load_disk_collection(
    format="GTiff",
    glob_pattern="/data/users/Public/david.kovacs/tifs_david/LAI/*.tif",
    options=dict(date_regex=r".*(\d{4})(\d{2})(\d{2}).tif"),
)

datacube = datacube.filter_bbox(spatial_extent)
datacube = datacube.filter_temporal("2019-01-01", "2019-12-31")
datacube.download("david_k_LAI.tif")

Emile

david.kovacs · 15 February 2024 11:12

Dear Emile,

Thank you for your help, It makes sense now!

Actually, I want to process time series with the inherent temporal metadata, so I uploaded a “.nc” file into the VM, whats the function to access it? I was searching for load_stac and load_disk_collection however, it did not specify netcdf as a format.

When I modify the code you provided with: format="netCDF" it gives me the following error message:

OpenEoApiError: [500] Internal: Server error: NotImplementedError('The format is not supported by the backend: netCDF') (ref: r-24021597483244358da73b11a3d0ec3e)

Thanks
Dávid

emile.sonneveld · 15 February 2024 15:35

Hi David,

I would recommend splitting up the CDF in multiple tiffs.
netCDF is greatly supported with load_disk_collection for the moment.
This script worked for that on my machine:

#!/bin/bash

cd /data/users/Public/david.kovacs/tifs_david/S3GPR/ || exit

infile=spain.nc
band=1
# sudo apt-get install -y cdo
for idate in $(cdo showdate $infile); do
  date="${idate:0:10}"
  # filter out non-date entries:
  if [[ ${date} =~ [0-9]{4}-[0-9]{2}-[0-9]{2} ]]; then
    echo "date: $date"
    y="${idate:0:4}"
    m="${idate:5:2}"
    d="${idate:8:2}"
    mkdir -p tiff_collection/$y/$m/$d
    # apt-get install -y build-essential proj-bin proj-data gdal-bin gdal-data libgdal-dev
    gdal_translate -co COMPRESS=DEFLATE -unscale -a_srs EPSG:32630 -ot Float32 NETCDF:$infile -b $band "tiff_collection/$y/$m/$d/${date}_S3GPR.tif"
    ((band++))
  fi
done
echo "All done"

david.kovacs · 15 February 2024 17:00

Thanks Emile for your reply.

Is there a way to access directly the netCDFs? I am not familiar with bash, also I can store my data in netCDFs and it would be fairly easier to access them, with their temporal metadata assigned. I would process several netCDFs, and this would require to “slice” them each time into geotiffs, which I would avoid if possible.

Thanks
Dávid

emile.sonneveld · 16 February 2024 13:30

Hey,

I put the script with a docker image to run it on terrascope.
These commands should run it for you:

cd /data/users/Public/emile.sonneveld/for_david/
sudo docker build -t convert_netcdf_to_gtif .
sudo docker run -it --privileged --mount type=bind,source=/data/users/Public/david.kovacs/,target=/data/users/Public/david.kovacs/ convert_netcdf_to_gtif

I can’t run it, as I can’t write to your user folder.
The first time it will run very slow, but after that it should take a few seconds.

Emile

david.kovacs · 19 February 2024 11:24

Dear Emile,

Thank you very much for your help. It works perfect!
Btw, I am not able to delete the folder/files that the script creates. It is owned by root(root) and I have no permission to delete it.

How could I fix it?
Dávid

emile.sonneveld · 19 February 2024 15:18

Hi David,

Does ‘sudo rm -rf path/to/delete’ work?

david.kovacs · 19 February 2024 15:50

It did! Thanks for all your help

have a nice day

david.kovacs · 26 March 2024 11:33

Dear Emile,

With your help I managed to create some really nice maps! Thanks for all the help.

Right now, I use exactly the same code with which I have successfully processed areas few weeks ago, but for some reason there is an issue when I try to use the MOGPR batch job:

I’ve got the following job IDs with massive error outputs, from which I do not understand anything. Please help me with this:

j-2403258c73b244a4a61242aa349c340b
j-2403213b46b5416b85f58a13c4dec53d
j-240321f96afd4609aefcfd4b7fd8ac09
j-240314dcec3146bcbd1f87ccbd2923a8

jeroen.dries · 27 March 2024 14:14

FYI, this one also got asked and answered here, jobs will need to be retried: