Comments (7)
I agree with 200 on the full batch going through. On a partial failure, we could do 207: multi-status
error code, more details here: https://evertpot.com/http/207-multi-status. In this situation, you'd try each item in the batch, and return a status code for each item that can be iterated on.
So we'd have:
200 -> All OK
207 -> Some or all tasks failed to submit, must iterate on the response to figure out what failed.
4XX/5XX -> Errors that block task submission entirely. No tasks could've been submitted.
from funcx-web-service.
@yadudoc This seems reasonable, my only concern is about this case: user does fxc.run()
for a single task, the task fails due to something that would normally cause a 207 (such as EndpointNotFound
for the particular task they sent), how should the failure be displayed in a user friendly way?
If the user had called batch_run
, it would be reasonable for them to expect a multi-status response. But should we raise an Exception on the client sdk if we get a 207? And if so, should we try to display it in a generic way that works for both single and multi-task requests like Submission failed for the following tasks: {failed_task_ids}. Task submission failed for the following reason(s): {list of stringified exceptions from the error responses}
. Maybe Exception names like BatchSubmitFailed
and BatchStatusFailed
, which can then contain all the response data.
Alternatively, maybe when we get a 207 back, we just want to start iterating through it and raise the first error that we come across in the list of responses. Like if a user had multiple EndpointNotFound
errors for a batch run, it would just show them the first one, they would have to fix that, then it would show them the next one if they tried again. But I'm not sure that is a great approach unless there is some way to extract the response data out of the batch_run
python API call so that the user can see all the results.
from funcx-web-service.
Handling partial failures is easier in fxc.run()
for a single task because anything other than status_code:200
implies that the task submission failed, and the SDK ought to inspect the response to the appropriate exception. With a single task, you can immediately raise the exception with additional info placed into the traceback.
# This raises say, EndpointAccessForbidden which is contained in the 207 response
fxc.run(args, endpoint_id='BAD', function_id=fn_id)
batch_run
is a bit more problematic because it isn't quite clear at what point we'd raise the exception. If we raise an exception at batch_run
and some tasks were launched, we have to be careful not to lose the task_ids. I'm inclined to just return task_ids for the one's that failed, with the exception added to the internal table so that, and we raise it when the task status is queried.
# We raise an exception for 4xx and 5xx errors.
task_uuids = fxc.batch_run(batch)
# For 207 multi-status, we set the results for the failed tasks with their corresponding exceptions
for tid in task_uuids:
fxc.get_result(tid) # This might raise EndpointAccessForbidden for tasks which failed to launch in the batch.
from funcx-web-service.
Some design thoughts for this approach:
- On the service,
auth_and_launch
will need to save the task in redis with theTask
constructor even if the task fails to launch, and needs to indicate the exact exception that made this task fail to launch (this must be different from the currentTask.exception
property which refers to an exception that occurred while running the task, from my understanding). - The funcx sdk
error_handling_client
will need to be modified to allow HTTP status 207 to pass through, where it can then be handled differently based on iffxc.run
orfxc.batch_run
is called. fxc.run
will need to be modified to raise an exception if there is a 207fxc.batch_run
will need to be modified to collect the task_ids from the collection of responses- the
/tasks/<task_id>
and/batch_status
routes will be modified to understand the failed to launchTask
data from above
the returned format for /submit
will change from
{'status': 'Success',
'task_uuids': [],
'task_uuid': ""}
to:
[
{
'status': 'Success',
'task_uuid': "",
'http_status_code': 200
},
{
'status': 'Failed',
'code': 1,
'task_uuid': "",
'http_status_code': 4XX/5XX,
...
},
...
]
from funcx-web-service.
We need to consider this case:
- A user submits some tasks
a
andb
in a script, but doesn't bother checking the tasks later withfxc.get_result(tid)
in that same script. Taskb
fails because an invalid endpoint was provided for that task, so a 207 is sent back andfxc.batch_run
returns the list of task ids (this response contains theEndpointNotFound
for taskb
) - The user runs a new script that checks the results with
fxc.get_result
, but gets aTaskNotFound
for taskb
instead of anEndpointNotFound
, because the launch failure info was not saved on the service-side
This behavior is consistent with the current model that once the client gets a state once, that state is no longer saved on our end. If the user wanted to know that b
failed with EndpointNotFound
, they should've checked the results with fxc.get_result
after submitting the task. This behavior may not be ideal, but it seems to be the best for scale and maintaining a model where state is not stored after it has been made known to the client. We should make it clear in documentation and examples that a user should always check the results of tasks after submitting using the same funcx client.
from funcx-web-service.
We also need to do more thinking about standardizing the format of task status objects. The current schema for task info from the service is pretty nice:
{
'task_id': task_id,
'status': task_status,
'result': task_result,
'completion_t': task_completion_t,
'exception': task_exception
}
though we should switch status
to task_status
I think. The messy part I feel is that on the sdk side, just a subset of these fields are saved in the sdk task data structure. I think it would be best to keep everything uniform and have an identical task status object on the sdk side when it is available.
Also, we need to think about if it is a bad idea to mix these with error objects, like when we are returning status info.
from funcx-web-service.
FuncX SDK Changes
Non-async mode for single task run (default)
run()
returns a single task_id
for a 200 response, raises the exception if an error is sent back during submission
async mode
run()
returns a single future for a 200 response (need to discuss - should it raise exception if an error is sent back during submission, or should it add the exception to the returned future with set_exception
in certain cases?)
batch_run()
returns a list of futures corresponding to each submitted batch item for a 200/207 response. For a 4XX/5XX response, the exception is raised, as this indicates the entirety of the submission failed. If a task in the list fails to submit, the exception can be immediately attached to that future with set_exception
and we don't need to add it to the async queue.
from funcx-web-service.
Related Issues (20)
- Record task lifecycle HOT 1
- Report logs to cloudwatch
- Automate deployment via CloudFormation HOT 2
- Add secure backups
- Update Automate API with Task support HOT 1
- Return exception when deserialize is true
- Improve forwarder logging HOT 1
- Reject endpoint registration by non-owner HOT 1
- Create a route to retrieve total function invocations
- Make Web Service Scope/Client configurable
- [2.0] Deploy to EKS on new AWS account
- authorize endpoint/function behavior and handling HOT 1
- Fix Safety Check Failure on Tornado
- Liveness checks hit /v1/ routes HOT 1
- Turn on debug level logging
- Tests failing due to issue in mocking
- Make use of new error types: InvalidUUID, task group errors
- Function sharing fails to ingest to search HOT 1
- Use Flask error handler
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from funcx-web-service.