Processing aborted repeatedly due to "Authorization token expired"

patrick.griffiths · 11 November 2021 15:53

I get this error msg when waiting for a batch job to start running:
OpenEoApiError: [500] unknown: [403] TokenInvalid: Authorization token has expired or is invalid. Please authenticate again.

After reauthentificating the same error comes up again (UC3 - Crop type feature engineering (rule-based).ipynb)

could we simply increase the time limit after which a token expires?

0:00:00 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: send ‘start’
0:00:39 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:00:45 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:00:52 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:01:01 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:01:11 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)
0:01:24 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A)

jeroen.dries · 12 November 2021 07:23

That job took only 12 minutes, which is within the time limit, so could also be something else.
I’ll need to check with Stefaan on monday!

jeroen.dries · 15 November 2021 09:37

@stefaan.lippens indicates that the actual cause of this is probably some downtime or timeout on EGI checkin itself. Hence we propose to either add some retry logic, or improve the error message.
We’ve also contacted EGI support about the downtime itself, as we expect an authentication service to be highly available.

michele.claus · 15 November 2021 11:28

I think this issue occurs only when using execute_batch right?
Maybe using a different syntax for starting the job would be enough:

job = datacube.send_job(title = "job title")
jobId = job.job_id
job.start_job()

jeroen.dries · 15 November 2021 11:53

Not entirely in the sense that you’ll still get authentication issues when EGI is down.
execute_batch indeed sometimes fails due to authentication or some other server unavailability, but in that case, the job can still succeed, and be inspected with:

job = connection.job(job_id)
job.status()

(which will still give you an authentication exception when EGI is down)

Anyway, for a more technical discussion and solution proposals, see this issue:

github.com/Open-EO/openeo-python-driver

Better handling of HTTP issues/timeouts when resolving OIDC access tokens

opened 10:33AM - 15 Nov 21 UTC

soxofaan

Issue raised in openEO Platform forums: > 0:01:01 Job ‘vito-ec7c5c49-a54f-422…1-b659-e3e18ecb1fbd’: queued (progress N/A) > 0:01:11 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A) > 0:01:24 Job ‘vito-ec7c5c49-a54f-4221-b659-e3e18ecb1fbd’: queued (progress N/A) > OpenEoApiError: [500] unknown: [403] TokenInvalid: Authorization token has expired or is invalid. Please authenticate again. in application logs I found around time of that job: > [2021-11-11 16:55:37,660] 9 WARNING in openeo_driver.users.auth: Failed to resolve OIDC access token > ... > requests.exceptions.ConnectionError: HTTPSConnectionPool(host='aai.egi.eu', port=443): Max retries exceeded with url: /oidc/.well-known/openid-configuration (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f90df136910>: Failed to establish a new connection: [Errno 110] Connection timed out')) If aai.egi.eu is (partially) down, resolving the access token fails what could be improved: - openeo-python-driver: at least make error clearer that the problem is with the identity provider, not the access token itself - openeo-python-driver: add a bit of retry logic to cover temporary glitches - openeo-python-client: don't stop the batch job status poll loop when such a temp glitch happens

stefaan.lippens · 15 November 2021 16:12

Some improvements that are implemented for this error situation:

on backend side (aggregator and Terrascope drivers): error will have a 503 status (“Service unavailable”), openeo error code OidcProviderUnavailable and message “OIDC Provider is unavailable”, which is a lot better than “Authorization token has expired or is invalid”. These improvement still have to be deployed in production at the moment
python client side: execute_batch will ignore these 503 errors (assuming it is a temporary issue), and the polling loop should not fail due to a minor connection glitch anymore. If there are too many of these “soft” errors (more than 10), the loop will fail to avoid that you end up in an endless loop. This fix is not released yet, and will be part of the next 0.9.1 release.