illinois-cs241 / broadway-api Goto Github PK

8.0 8.0 1.0 312 KB

This is the old repo for Broadway API. Please see the new repo for newest version of Broadway https://github.com/illinois-cs241/broadway

License: Other

Python 100.00%

autograder broadway python

broadway-api's People

Contributors

Stargazers

Watchers

Forkers

bunnybrewery

broadway-api's Issues

We need HTTPS

This is critical to support truly authenticated courses. Right now all of the infrastructure is on a separate, federated sub network, so no one should be able to listen in on traffic.

Acceptance criteria: Make sure that no requests are sent with HTTP and enable HTTPS

https://stackoverflow.com/questions/18307131/how-to-create-https-tornado-server

Depfu Error: No dependency files found

Hello,

We've tried to activate or update your repository on Depfu and couldn't find any supported dependency files. If we were to guess, we would say that this is not actually a project Depfu supports and has probably been activated by error.

Monorepos

Please note that Depfu currently only searches for your dependency files in the root folder. We do support monorepos and non-root files, but don't auto-detect them. If that's the case with this repo, please send us a quick email with the folder you want Depfu to work on and we'll set it up right away!

How to deactivate the project

Go to the Settings page of either your own account or the organization you've used
Go to "Installed Integrations"
Click the "Configure" button on the Depfu integration
Remove this repo (illinois-cs241/broadway-api) from the list of accessible repos.

Please note that using the "All Repositories" setting doesn't make a lot of sense with Depfu.

If you think that this is a mistake

Please let us know by sending an email to [email protected].

This is an automated issue by Depfu. You're getting it because someone configured Depfu to automatically update dependencies on this project.

Nice to Have: No more locking

Who uses mutexes? (Joke). Tornado design would really rather us be using callback yielding -- akin to the nodejs callback style -- instead of locking.

https://stackoverflow.com/questions/47279937/in-tornado-is-threadpoolexecutor-thread-safe

I'm pretty sure there is an easier way without making the code too ugly with yielding. Ideally we have an object that keeps track of these variables instead it being inside the service handler calls ourself.

https://stackoverflow.com/questions/25949173/how-does-yield-work-in-tornado-when-making-an-asynchronous-call

On Demand Grading?

May be nice to have on demand grading either here or in a different service. Would need to talk to the other staff if they'd want this grade any time infrastructure.

Switch to Tornado-Json Extension

We should probably switch to this extension: tornado-json
It enables us to define schemas, which leads to better error handling and simplifies the entire logic. Makes it easier to make JSON requests. Circumvents the annoying feature of tornado of providing all args as byte strings (which does not play well with python 3) and mess with lists. We will not have to urlencode and decode on both sides.

Grader Endpoint include Id

Change grader endpoints to include the worker id as part of the URL rather than it being part of the Auth Header. This will make it easier to parse tornado logs and identify calls by any grader.

For instance:
/api/v1/heartbeat becomes /api/v1/heartbeat/worker_id

Handle False Failure Detections

In an asynchronous system, it is almost impossible to have safety and liveness for failure detection. This can lead to misclassification of nodes being dead.

We currently mark a node as dead if it does not send a heartbeat in 20 seconds. A machine can hang for longer and then continue executing too. So in heartbeat handler, grading job handler and grading result handler, we should check if the request is coming from a dead node. If so mark it as alive again.

Do Not Share Roster

We will be removing the feature of providing the roster to the pre and post processing jobs so we should remove that from the scheduling logic

Grading Machine Hostname

For debugging purposes, it would be nice to know which job was run by which grading machines. Save this as a new field the job object saved in the DB

Client Side Authetication

We would want some form of authentication for the clients who can submit pipelines and schedule grading runs. Clients would be the courses using this service or if we expose this to students to schedule their own AG runs, we will have to authenticate them.

Environment Variables Structure

In the new design of Broadway, we are pre-uploading the grading run configs. The issue is that we are not giving courses the flexibility of adding the environment variables on a per run basis by pre-uploading the config. They can only change the student pipeline environment variables on a per-run basis (using the student env vars), not the pre/post-processing pipelines. So it would be best to redefine how we set environment variables as follows:

Global env vars - should be global. set in config
Stage-specific env vars - should be stage-specific. set in config
Run-specific env vars - before we had student specific env vars which were only exposed to the student pipeline on a per run basis. Now we change that to this and expose these to all pipelines. These change from run to run and are meant for vars like net_id and due dates.

We post the Run-specific env vars along with the correct auth to kick off an AG run

Nice To Have: Status Page/Notifications.

It'd be nice to have a quick and easy status page (bonus points for automatic per-job slack integration) to see when the ag is stalling or fails.

Logging

Currently the logging does not produce log files which are easier to parse. This is because tornado generates many logging information too. So probably have many log files for different info. The tornado logging is almost like a timeline of all the calls to the API so we will want that in a diff file.

Add logging and good error messages for every possible situation so that it is easier to debug

Shutdown System

Need a clean way to shut down the entire system. Notify all graders. clean up resources.

Extend API to allow for Packaged Containers

There are some classes that will not open source their autograders and don't want to.
If we don't have a private dockerhub set up for the university, could we potentially take the coursera approach and install a tar.gz docker container instead.

Instead of changing the required fields, could we also have the image be an http/s link that is wget from.

'image': 'http://path.to/docker/container.tar.gz',

and each of the graders would tar.gz. Of course there is a problem if that URL is leaked, but that would have to be a Course Instructor error, not a broadway error.

Graders would wget and get the container that way.

Thoughts?

Log API Endpoint

We should have an api endpoint to get logs for a grading run id with authentication, something like (idk if this follows the format)

/log/<id>

Should specify bind address in config

Right now api.py simply binds to 0.0.0.0

Cluster Token

Set cluster token via environment variable so it does not have to change everytime we restart the API and @redsn0w422 does not have a bad time updating it

Start Mongo Daemon with API

Possibly set-up mongo daemon from within the API start-up code and kill the process when the API shuts down. Mongo Daemon is annoyingly coupled with the MongoClient. This will make it easier to run the API. It behaves weirdly when Mongo Daemon is not running and only notifies after a long time.

Improve flags and configuration

Right now the fixed config.py doesn't work very well with docker and is also hard to generate with templates since it's not a standard data format.

We could potentially switch to yaml for configuration and add a set of environment variables for overriding flags.

Course Worker Nodes Endpoint

Define new client endpoint GET /api/v1/worker/[course_id]/all which shows the worker nodes for a specific course. Currently, all worked nodes are available to all courses but later we can have a quota or some policy as to how much of the grading cluster a course has access to.

This should return:

{
     "worker_nodes": [
             {
                  "hostname": <hostname of the machine on which the grader is running>,
                  "jobs_processed": <number of jobs processed since joining the cluster>,
                  "busy": <True/False>,
                  "alive": <True/False>
             }, ...
       ]
}

Pre-processing failures

We should stop the entire AG run if the preprocessing stage fails. We will need to figure out a way for the API to know if a job failed.

Revert to non blocking poll

Holding connections from graders while we try to poll the queue does not work well when we scale. With more graders, in production, we noticed that some of the connections would timeout which would crash the graders. With more than 5 graders, holding connections should not even make a difference or provide any performance gains.

Worker Node Display erroneous after restart of a worker node

Currently, when a new worker node joins, we create a new entry for the node in the DB as:

{
     _id: abcd
     hostname: host1
     alive: true, ....
}

Then if that worker dies and is restarted, the DB now looks like:

{
     _id: abcd
     hostname: host1
     alive: false, ....
},
{
     _id: efgh
     hostname: host1
     alive: true, ....
}

So when a course checks the system health using the /api/v1/worker/<course>/all endpoint, it returns the DB contents which has two entries for the same host.

We want to keep the dead workers on the DB since grading jobs have a field worker_id which contains the id of the worker node which executed that job. This is for debugging purposes so we can see if a worker node is not functioning properly (if all grading jobs from that worker nodes have unexpected results).

One possible way we can solve this by making the _id of the worker node as the hostname. This will enforce that a hostname can only have one worker node (which usually is preferred since we want the workers to have as much CPU access as possible).

Startup script

Create a startup script that separately starts mongo then the API, then shuts down mongo then if you want that

Nasty Race Condition

Just realized that when we truly have this distributed and two jobs finish at the same time by two different graders and they both send a job update message, there might be a race condition in updating the grading run object in the database. Ref. We could either lock this or use mongo features like "$decrement" etc to do our job.

"Noisy" containers lead to uncaught exception

I've been having one of our pipeline stages print the stdout of ModelSim, so our nohup files will have a recording of the exact output of their simulation run (completely not necessary, so I'll remove it), but this seems to be causing an unhandled error for Mongo

Relevant Logs:

ERROR:tornado.application:Uncaught exception POST /api/v1/grading_job/grader2
HTTPServerRequest(protocol='http', host='xxx.xxx.xxx.xxx:1470', method='POST', uri='/api/v1/grading_job/grader2', version='HTTP/1.1', remote_ip='xxx.xxx.xxx.xxx')
Traceback (most recent call last):
 File "/home/nbleier3/.local/lib/python3.5/site-packages/tornado/web.py", line 1592, in _execute
   result = yield result
 File "/home/nbleier3/.local/lib/python3.5/site-packages/tornado/gen.py", line 1133, in run
   value = future.result()
 File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
   raise self._exception
 File "/home/nbleier3/.local/lib/python3.5/site-packages/tornado/gen.py", line 326, in wrapper
   yielded = next(result)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/tornado_json/schema.py", line 160, in _wrapper
   output = rh_method(self, *args, **kwargs)
 File "/home/nbleier3/broadway-api/broadway_api/handlers/worker.py", line 178, in post
   job_log_dao.insert(job_log)
 File "/home/nbleier3/broadway-api/broadway_api/daos/grading_job_log.py", line 23, in insert
   return self._collection.insert_one(document)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/collection.py", line 693, in insert_one
   session=session),
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/collection.py", line 607, in _insert
   bypass_doc_val, session)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/collection.py", line 595, in _insert_one
   acknowledged, _insert_command, session)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 1248, in _retryable_write
   return self._retry_with_session(retryable, func, s, None)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/mongo_client.py", line 1201, in _retry_with_session
   return func(session, sock_info, retryable)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/collection.py", line 590, in _insert_command
   retryable_write=retryable_write)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/pool.py", line 584, in command
   self._raise_connection_failure(error)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/pool.py", line 745, in _raise_connection_failure
   raise error
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/pool.py", line 579, in command
   unacknowledged=unacknowledged)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/network.py", line 128, in command
   name, size, max_bson_size + message._COMMAND_OVERHEAD)
 File "/home/nbleier3/.local/lib/python3.5/site-packages/pymongo/message.py", line 961, in _raise_document_too_large
   " bytes." % (doc_size, max_size))
pymongo.errors.DocumentTooLarge: BSON document too large (42553799 bytes) - the connected server supports BSON document sizes up to 16793598 bytes.

This isn't crashing the API, nor is it crashing Mongo, but it does appear to terminate the entire grading run.

Suggested fixes include adding a config option to disable logging to MongoDB and, in the case that logging is activated but the output is too large, to catch the exception, print some type of warning, and continue the grading run

Add --cap-add to statge config

Hi, I recently added unshare support in https://github.com/illinois-cs241/docker-tools/pull/8 to containerize each tester, but that would require docker to be run with --cap-add SYS_ADMIN --cap-add NET_ADMIN -e USE_UNSHARE=true.

Is it possible to allow stage config to change the capability of the user inside docker? e.g. add an entry in stage config like this capability: [ SYS_ADMIN, ... ]

Documentation

README: general folder level info with the top level being a tldr; of the entire project
Docstrings document functions, classes, and what the APIs look like for api function
Wiki includes how to contribute, more in depth getting started, release notes. Endpoints description and how to use them.

Too many connections

The grader keeps polling the queue on the server every two seconds or so if the server does not give it a job. Which might not be the best design. Ideally, these servers would be up all day and such requests would fill our logs with useless requests. See if one connection can be held and waited on till a job comes. Something like epoll.

This also becomes an issue while hosting the server for testing using ngrok or something because the number of connections is restricted to like 20/minute.

Need timestamp in logging

^ title

Refactor Database Interaction

Currently the database interactions are "baked into" the api -- meaning that while the api is processing it makes database requests.

Just as a bow to code cleanliness, we should keep these interactions entirely in another module -- especially if we see a need to switch to another database provider in the future (mongo is generally more performant than other databases in a vacuum, but if the deployer already has a campus cluster of ferpa compliant MySQL instances, I doubt that they'll spend engineering effort conjuring up and maintaining a new mongo cluster).

Stage timeout configurable

We should be able to specify the timeout for any stage we define.

The following fields should be configurable for any stage (its corresponding container):

image
entry point
timeout
environment
networking enable
hostname

Possible Post Processing Failure

If a grader is working on one of the last few student jobs and it dies, there are chances that when the handler for lost grader decrements the number of student jobs left, it will be the last student job (it will decrement the count to 0). We do not check the count is zero or not in that handler. If this occurs, the post processing job will never be triggered.

Refactor Module Src Appropriately

We should rename module src more appropriately to something like broadway_api

Log job output in DB

We need to save the stderr and stdout from every job so that it is easier to see what went wrong on a per job basis. This needs to be saved in another DB (not the one which keeps the history of all AG runs and jobs). This additional DB can be deleted via a cron job on a regular basis since it will grow a lot across AG runs.

This follows the issue illinois-cs241/broadway-grader#8