Error occuring when calculating S2 mosaic for some periods, others work fine

Hello!

I’m generating weekly image mosaics using the S2 NDVI images that are available in that week for the territory of flanders.

For some weeks this works fine, other weeks always end with an error, also after multiple retries.

Not sure what is the best way to get you details on what is happening, but I’ll start with the following:

  • I connect to the following backend: “https://openeo.vito.be
  • some jobs that end in an error:
    • title: ‘S2_mosaic_weekly_2022-09-05_2022-09-12_ndvi.tif’, id: j-2e516491359545198e27d43573fa7de9
    • title: ‘S2_mosaic_weekly_2022-10-10_2022-10-17_ndvi.tif’, id: j-89f4d05547744e5785484c90e887953e
    • title: ‘S2_mosaic_weekly_2022-10-17_2022-10-24_ndvi.tif’, id: j-756b622ac91541b29f1d975c12286659

This is an example of an error message:

“Error processing batch job: Py4JJavaError(‘An error occurred "
‘while calling ’
"z:org.openeo.geotrellis.geotiff.package.saveRDD.\n’, JavaObject "
‘id=o930)\n’
‘Traceback (most recent call last):\n’
’ File “batch_job.py”, line 323, in main\n’
’ run_driver()\n’
’ File “batch_job.py”, line 296, in run_driver\n’
’ run_job(\n’
’ File ’
'”/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py", ’
‘line 48, in memory_logging_wrapper\n’
’ return function(*args, **kwargs)\n’
’ File “batch_job.py”, line 398, in run_job\n’
’ assets_metadata = result.write_assets(str(output_file))\n’
’ File ’
'“/opt/venv/lib64/python3.8/site-packages/openeo_driver/save_result.py”, ’
‘line 111, in write_assets\n’
’ return self.cube.write_assets(filename=directory, ’
‘format=self.format, format_options=self.options)\n’
’ File ’
'“/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py”, ’
‘line 1620, in write_assets\n’
’ ’
‘self._get_jvm().org.openeo.geotrellis.geotiff.package.saveRDD(max_level.srdd.rdd(),band_count,str(filePath),zlevel,self._get_jvm().scala.Option.apply(crop_extent),gtiff_options)\n’
’ File ’
‘“/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py”, ’
‘line 1309, in call\n’
’ return_value = get_return_value(\n’
’ File ’
‘“/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py”, ’
‘line 326, in get_return_value\n’
’ raise Py4JJavaError(\n’
'py4j.protocol.Py4JJavaError: An error occurred while calling ’
‘z:org.openeo.geotrellis.geotiff.package.saveRDD.\n’
': org.apache.spark.SparkException: Job aborted due to stage ’
'failure: Task 135 in stage 10.0 failed 4 times, most recent ’
'failure: Lost task 135.2 in stage 10.0 (TID 325) ’
'(epod083.vgt.vito.be executor 84): ExecutorLostFailure (executor ’
'84 exited caused by one of the running tasks) Reason: Executor ’
‘heartbeat timed out after 159030 ms\n’
‘Driver stacktrace:\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351)\n’
'\tat ’
‘scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)\n’
'\tat ’
‘scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)\n’
'\tat ’
‘scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109)\n’
‘\tat scala.Option.foreach(Option.scala:407)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522)\n’
'\tat ’
‘org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n’
'\tat ’
‘org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:898)\n’
'\tat ’
‘org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)\n’
'\tat ’
‘org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)\n’
'\tat ’
‘org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)\n’
'\tat ’
‘org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)\n’
'\tat ’
‘org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)\n’
'\tat ’
‘org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n’
'\tat ’
‘org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n’
‘\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:414)\n’
‘\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1029)\n’
'\tat ’
‘org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:737)\n’
'\tat ’
‘org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n’
'\tat ’
‘org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n’
‘\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:414)\n’
'\tat ’
‘org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:736)\n’
'\tat ’
‘org.openeo.geotrellis.geotiff.package$.getCompressedTiles(package.scala:290)\n’
'\tat ’
‘org.openeo.geotrellis.geotiff.package$.saveRDDGeneric(package.scala:214)\n’
'\tat ’
‘org.openeo.geotrellis.geotiff.package$.saveRDD(package.scala:153)\n’
'\tat ’
‘org.openeo.geotrellis.geotiff.package.saveRDD(package.scala)\n’
'\tat ’
'java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native ’
‘Method)\n’
'\tat ’
‘java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n’
'\tat ’
‘java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n’
‘\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n’
'\tat ’
‘py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n’
'\tat ’
‘py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n’
‘\tat py4j.Gateway.invoke(Gateway.java:282)\n’
'\tat ’
‘py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n’
‘\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n’
'\tat ’
‘py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\n’
'\tat ’
‘py4j.ClientServerConnection.run(ClientServerConnection.java:106)\n’
‘\tat java.base/java.lang.Thread.run(Thread.java:829)\n’
‘\n’}]

Hi Pieter,

these (bad) error messages are caused by the workers going out of memory. You can solve this by increasing the memory, because we set the default limits relatively low.

This example shows how to increase the memory, could you give that a try?

job_options = { "executor-memory": "4G", "executor-memoryOverhead": "2G", "executor-cores": "2" } 
cube.execute_batch( out_format="GTiff", job_options=job_options)

Hey Jeroen,

This indeed solved my problem.

Thanks!

Hi,

I have a follow-up question since I had suddenly been getting the same kind of error - so I hope it’s okay if I post my question here.

job_options = { “executor-memory”: “4G”, “executor-memoryOverhead”: “2G”, “executor-cores”: “2” }

These above job_options settings also worked for one of my scripts but seemed insufficient for another script.
I then tried the following settings and that seemed sufficient.

job_options = { “executor-memory”: “8G”, “executor-memoryOverhead”: “4G”, “executor-cores”: “2” }

But my question is: what are the best practices for the job_options arguments? What are the possible values of the different arguments, are there any limitations?

Thanks in advance.

Hi Margot,

increasing these settings has two effects:

  • Your job cost increases, as memory and cpu are charged separately
  • These ‘executor’ settings determine the size of a worker, as it increases, it becomes harder to find a free spot on the cluster.

The overall limitation is currently set at about 32GB for the sum of executor memory and memoryOverhead, but you want to avoid reaching that.

One way to reduce memory needed per worker is to set executor-cores to 1 instead of 2, and also halving your memory settings. Setting executor-cores to 2 means that 2 tasks will run in parallel in the same worker, increasing the memory needs.

Hope that helps a bit, I’ll see if I can find a spot in our documentation for this!

best regards,
Jeroen

Hi Jeroen,

Thank you for the clarification, it definitely helps.
Just one more small follow-up question; what are the default job_options settings?

Kind regards,
Margot

Hi Margot,
apologies for the delay, but I now also documented these settings with their defaults here:

Please have a look and let me know if something could be improved!

best regards,
Jeroen

1 Like