There is currently some inconsistent behavior with how /batc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Some design thoughts for this approach: On the service, <code

We need to consider this case: A user submits some tasks <code

Handling batch submit/status failure for many tasks about funcx-web-service HOT 7 OPEN

knagaitsev commented on July 29, 2024

Handling batch submit/status failure for many tasks

from funcx-web-service.

Comments (7)

yadudoc commented on July 29, 2024

I agree with 200 on the full batch going through. On a partial failure, we could do 207: multi-status error code, more details here: https://evertpot.com/http/207-multi-status. In this situation, you'd try each item in the batch, and return a status code for each item that can be iterated on.

So we'd have:
200 -> All OK
207 -> Some or all tasks failed to submit, must iterate on the response to figure out what failed.
4XX/5XX -> Errors that block task submission entirely. No tasks could've been submitted.

from funcx-web-service.

knagaitsev commented on July 29, 2024

@yadudoc This seems reasonable, my only concern is about this case: user does fxc.run() for a single task, the task fails due to something that would normally cause a 207 (such as EndpointNotFound for the particular task they sent), how should the failure be displayed in a user friendly way?

If the user had called batch_run, it would be reasonable for them to expect a multi-status response. But should we raise an Exception on the client sdk if we get a 207? And if so, should we try to display it in a generic way that works for both single and multi-task requests like Submission failed for the following tasks: {failed_task_ids}. Task submission failed for the following reason(s): {list of stringified exceptions from the error responses}. Maybe Exception names like BatchSubmitFailed and BatchStatusFailed, which can then contain all the response data.

Alternatively, maybe when we get a 207 back, we just want to start iterating through it and raise the first error that we come across in the list of responses. Like if a user had multiple EndpointNotFound errors for a batch run, it would just show them the first one, they would have to fix that, then it would show them the next one if they tried again. But I'm not sure that is a great approach unless there is some way to extract the response data out of the batch_run python API call so that the user can see all the results.

from funcx-web-service.

yadudoc commented on July 29, 2024

Handling partial failures is easier in fxc.run() for a single task because anything other than status_code:200 implies that the task submission failed, and the SDK ought to inspect the response to the appropriate exception. With a single task, you can immediately raise the exception with additional info placed into the traceback.

# This raises say, EndpointAccessForbidden which is contained in the 207 response 
fxc.run(args, endpoint_id='BAD', function_id=fn_id)

batch_run is a bit more problematic because it isn't quite clear at what point we'd raise the exception. If we raise an exception at batch_run and some tasks were launched, we have to be careful not to lose the task_ids. I'm inclined to just return task_ids for the one's that failed, with the exception added to the internal table so that, and we raise it when the task status is queried.

# We raise an exception for 4xx and 5xx errors.
task_uuids = fxc.batch_run(batch)

# For 207 multi-status, we set the results for the failed tasks with their corresponding exceptions
for tid in task_uuids:
     fxc.get_result(tid)   # This might raise EndpointAccessForbidden for tasks which failed to launch in the batch.

from funcx-web-service.

knagaitsev commented on July 29, 2024

Some design thoughts for this approach:

On the service, auth_and_launch will need to save the task in redis with the Task constructor even if the task fails to launch, and needs to indicate the exact exception that made this task fail to launch (this must be different from the current Task.exception property which refers to an exception that occurred while running the task, from my understanding).
The funcx sdk error_handling_client will need to be modified to allow HTTP status 207 to pass through, where it can then be handled differently based on if fxc.run or fxc.batch_run is called.
fxc.run will need to be modified to raise an exception if there is a 207
fxc.batch_run will need to be modified to collect the task_ids from the collection of responses
the /tasks/<task_id> and /batch_status routes will be modified to understand the failed to launch Task data from above

the returned format for /submit will change from

{'status': 'Success',
'task_uuids': [],
'task_uuid': ""}

to:

[
  {
    'status': 'Success',
    'task_uuid': "",
    'http_status_code': 200
  },
  {
    'status': 'Failed',
    'code': 1,
    'task_uuid': "",
    'http_status_code': 4XX/5XX,
    ...
  },
  ...
]

from funcx-web-service.

knagaitsev commented on July 29, 2024

We need to consider this case:

A user submits some tasks a and b in a script, but doesn't bother checking the tasks later with fxc.get_result(tid) in that same script. Task b fails because an invalid endpoint was provided for that task, so a 207 is sent back and fxc.batch_run returns the list of task ids (this response contains the EndpointNotFound for task b)
The user runs a new script that checks the results with fxc.get_result, but gets a TaskNotFound for task b instead of an EndpointNotFound, because the launch failure info was not saved on the service-side

This behavior is consistent with the current model that once the client gets a state once, that state is no longer saved on our end. If the user wanted to know that b failed with EndpointNotFound, they should've checked the results with fxc.get_result after submitting the task. This behavior may not be ideal, but it seems to be the best for scale and maintaining a model where state is not stored after it has been made known to the client. We should make it clear in documentation and examples that a user should always check the results of tasks after submitting using the same funcx client.

from funcx-web-service.

knagaitsev commented on July 29, 2024

We also need to do more thinking about standardizing the format of task status objects. The current schema for task info from the service is pretty nice:

{
            'task_id': task_id,
            'status': task_status,
            'result': task_result,
            'completion_t': task_completion_t,
            'exception': task_exception
        }

though we should switch status to task_status I think. The messy part I feel is that on the sdk side, just a subset of these fields are saved in the sdk task data structure. I think it would be best to keep everything uniform and have an identical task status object on the sdk side when it is available.

Also, we need to think about if it is a bad idea to mix these with error objects, like when we are returning status info.

from funcx-web-service.

knagaitsev commented on July 29, 2024

FuncX SDK Changes

Non-async mode for single task run (default)

run() returns a single task_id for a 200 response, raises the exception if an error is sent back during submission

async mode

run() returns a single future for a 200 response (need to discuss - should it raise exception if an error is sent back during submission, or should it add the exception to the returned future with set_exception in certain cases?)

batch_run() returns a list of futures corresponding to each submitted batch item for a 200/207 response. For a 4XX/5XX response, the exception is raised, as this indicates the entirety of the submission failed. If a task in the list fails to submit, the exception can be immediately attached to that future with set_exception and we don't need to add it to the async queue.

from funcx-web-service.

Handling batch submit/status failure for many tasks about funcx-web-service HOT 7 OPEN

Comments (7)

FuncX SDK Changes

Non-async mode for single task run (default)

async mode

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent