Unexpected error interrupting batch job execution

margot.verhulst · 18 August 2022 10:47

Hi,

I have been running a script to extract Sentinel-2 time series for various years from the TERRASCOPE_S2_TOC_V2 collection. The script works with aggregate_spatial in the final step with an URL passed to the geometries argument (URL leads to publicly hosted geoJSON on Github containing a large number of circular polygons).

I had no problem getting the data for the years 2018 and 2019, but when I change it to 2020 or 2021, the batch job is aborted at some point. An example follows:

Logs for batch job with id=j-0763a926672248d3a1d0e4891cc0d5c5

error processing batch job Traceback (most recent call last): File “batch_job.py”, line 328, in main run_driver() File “batch_job.py”, line 301, in run_driver run_job( File “/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/utils.py”, line 43, in memory_logging_wrapper return function(*args, **kwargs) File “batch_job.py”, line 360, in run_job result = ProcessGraphDeserializer.evaluate(process_graph, env=env, do_dry_run=tracer) File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 320, in evaluate return convert_node(result_node, env=env) File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 326, in convert_node return apply_process( File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 1380, in apply_process args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())} File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 1380, in args = {name: convert_node(expr, env=env) for (name, expr) in sorted(args.items())} File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 331, in convert_node return convert_node(processGraph[‘node’], env=env) File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 326, in convert_node return apply_process( File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 1475, in apply_process return process_function(args=args, env=env) File “/opt/venv/lib64/python3.8/site-packages/openeo_driver/ProcessGraphDeserializer.py”, line 1050, in aggregate_spatial return cube.aggregate_spatial(geometries=geoms, reducer=reduce_pg, target_dimension=target_dimension) File “/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py”, line 1254, in aggregate_spatial return self.zonal_statistics(geometries,single_process) File “/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py”, line 1332, in zonal_statistics self._compute_stats_geotrellis().compute_average_timeseries_from_datacube( File “/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py”, line 1309, in call return_value = get_return_value( File “/opt/spark3_2_0/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py”, line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o1307.compute_average_timeseries_from_datacube. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 19436 in stage 23.0 failed 4 times, most recent failure: Lost task 19436.3 in stage 23.0 (TID 36625) (epod058.vgt.vito.be executor 137): ExecutorLostFailure (executor 137 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 168268 ms Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2403) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2352) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2351) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2351) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1109) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1109) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1109) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2591) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2533) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:898) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$collectAsMap$1(PairRDDFunctions.scala:737) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:736) at org.openeo.geotrellis.aggregate_polygon.AggregatePolygonProcess.computeMultibandCollectionTimeSeries(AggregatePolygonProcess.scala:355) at org.openeo.geotrellis.aggregate_polygon.AggregatePolygonProcess.computeAverageTimeSeries(AggregatePolygonProcess.scala:55) at org.openeo.geotrellis.ComputeStatsGeotrellisAdapter.compute_average_timeseries_from_datacube(ComputeStatsGeotrellisAdapter.scala:85) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:829)

(First time posting, so hoping that the categorization and layout are okay.)

stefaan.lippens · 18 August 2022 12:05

Have you tried rerunning and do you consistently get that error?
We had some subsystem failures this morning, so it could be just a temporary glitch.
A quick glance at that stack trace seems to suggest that)

margot.verhulst · 18 August 2022 12:31

Yes, I have tried running it several times on three different days and I consistently get this error.

Different IDs of batch jobs that generated this error:
j-a4018388c9d24ace8f05650aac8e1223
j-e5a7b61ca96f4daa961fe9ace3987680
j-0763a926672248d3a1d0e4891cc0d5c5
j-89abd10ec2e049d9bcea862136e68c76

Should I try again now?

(To check if it was an issue of data size, I have tried splitting the geojson in two, but the same error occurred. However, the error did not occur in case of only 5 polygons.)

stefaan.lippens · 18 August 2022 13:08

no, you apparently get that consistently, so it’s not a temporary glitch

I suspect out-of-memory issues, have you tried breaking up the spatiotemporal window in smaller temporal (or spatial) ranges?

margot.verhulst · 18 August 2022 13:56

I have tried:

spatially: a smaller number of polygons (covering a smaller area)
Batch job ID: j-89abd10ec2e049d9bcea862136e68c76 (16/08)
This gives the same error as above
temporally: filtering the datacube with a more narrow temporal range
Batch job ID: j-319bfb32040740a9acb5cbef179d564f (18/08, just now)
This gives a new error (below) which seems very similar to the most recent error in this post

Traceback (most recent call last): File “/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/backend.py”, line 1727, in get_log_entries with (self.get_job_output_dir(job_id) / “log”).open(‘r’) as f: File “/usr/lib64/python3.8/pathlib.py”, line 1221, in open return io.open(self, mode, buffering, encoding, errors, newline, File “/usr/lib64/python3.8/pathlib.py”, line 1077, in _opener return self._accessor.open(self, flags, mode) FileNotFoundError: [Errno 2] No such file or directory: ‘/data/projects/OpenEO/j-319bfb32040740a9acb5cbef179d564f/log’

stefaan.lippens · 18 August 2022 20:32

I’m seeing various errors when inspecting your batch job logs.

e.g. j-319bfb32040740a9acb5cbef179d564f failed to even start, while j-89abd10ec2e049d9bcea862136e68c76 did start, but got stuck somewhere.

Could you share your code (python code, or process graph dump of your algorithm (e.g. cube.to_json(), with cube being the thing you call .download() or .execute_batch() on). Send it through email if you don’t want to share publicly

margot.verhulst · 19 August 2022 10:07

j-319bfb32040740a9acb5cbef179d564f indeed failed to start. In fact, from that point on yesterday, all of my batch jobs failed to start and consistently generated this error. That problem now seems to have resolved itself… I suppose this was a different issue.

So, I have re-tried limiting the temporal range (ID: j-70de53311385432db869f2d8cff3c321)
Now the job does start but doesn’t finish successfully, the error now looks similar to the initial error from j-0763a926672248d3a1d0e4891cc0d5c5

I just sent the script by e-mail as well.

stefaan.lippens · 22 August 2022 09:52

I think I just managed to get a working job, based on your notebook, by using a small subset of (closely located) polygons.
Your whole polygon set covers the whole of Flanders if I’m not mistaken. I would suggest experimenting with smaller subsets (e.g. per province) first before upscaling to the whole of Flanders.

In the past we did manage to process “whole of Flanders” aggregation jobs successfully (with a simpler algorithm). It would be interesting to see where the breaking point of your algorithm lies.

stefaan.lippens · 22 August 2022 10:00

I also noticed that each polygon is a 64-point approximation of a circle. It might also help with performance and resource usage to just use a coarser approximation (e.g 8 or even 6 points).
For an aggregation operation, 64 points is a bit overkill, I’d think.

margot.verhulst · 22 August 2022 10:28

The script is written so that I only have to specify year = at the beginning of the script to get time series for that particular year. The peculiar thing is that the “whole of Flanders” aggregation does work if I set the year in my script to 2018 or 2019, but I get the error if I set it to 2020 or 2021.

I now have also tried to extract only NDVI instead of 9 spectral bands + NDVI for 2020 or 2021, those cases also give this error. So it seems that by limiting either the temporal dimension or either the spectral dimension, the breaking point is still met.

I will now try with a different subset of closely located polygons.

Thanks for the hint on the approximation of the circles, I will try to change that.