Processing aborted repeatedly due to "Authorization token expired"

I get this error msg when waiting for a batch job to start running:
OpenEoApiError: [500] unknown: [403] TokenInvalid: Authorization token has expired or is invalid. Please authenticate again.

After reauthentificating the same error comes up again (UC3 - Crop type feature engineering (rule-based).ipynb)

could we simply increase the time limit after which a token expires?

0:00:00 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: send ‘start’
0:00:39 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:00:45 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:00:52 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:01:01 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:01:11 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:01:24 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)

That job took only 12 minutes, which is within the time limit, so could also be something else.
I’ll need to check with Stefaan on monday!

@stefaan.lippens indicates that the actual cause of this is probably some downtime or timeout on EGI checkin itself. Hence we propose to either add some retry logic, or improve the error message.
We’ve also contacted EGI support about the downtime itself, as we expect an authentication service to be highly available.

I think this issue occurs only when using execute_batch right?
Maybe using a different syntax for starting the job would be enough:

job = datacube.send_job(title = "job title")
jobId = job.job_id
job.start_job()

Not entirely in the sense that you’ll still get authentication issues when EGI is down.
execute_batch indeed sometimes fails due to authentication or some other server unavailability, but in that case, the job can still succeed, and be inspected with:

job = connection.job(job_id)
job.status()

(which will still give you an authentication exception when EGI is down)

Anyway, for a more technical discussion and solution proposals, see this issue:

Some improvements that are implemented for this error situation:

  • on backend side (aggregator and Terrascope drivers): error will have a 503 status (“Service unavailable”), openeo error code OidcProviderUnavailable and message “OIDC Provider is unavailable”, which is a lot better than “Authorization token has expired or is invalid”. These improvement still have to be deployed in production at the moment
  • python client side: execute_batch will ignore these 503 errors (assuming it is a temporary issue), and the polling loop should not fail due to a minor connection glitch anymore. If there are too many of these “soft” errors (more than 10), the loop will fail to avoid that you end up in an endless loop. This fix is not released yet, and will be part of the next 0.9.1 release.