illinois-cs241 / broadway-grader Goto Github PK
View Code? Open in Web Editor NEWThis is the old repo for Broadway grader. Please see the new repo for newest version of Broadway https://github.com/illinois-cs241/broadway
License: Other
This is the old repo for Broadway grader. Please see the new repo for newest version of Broadway https://github.com/illinois-cs241/broadway
License: Other
The worker routine sometimes randomly crashes. The heartbeat routine goes on which makes it look like the grader is alive but it is not and the job has failed.
We need to kill the grader if the worker routine fails. Else it will sit with the job forever.
Having to go inside the application logic to edit a constant feels wrong. We should have a config.py
(or equivalent) that gets imported into the application instead.
A good example is how the grade distributor (Warden) does it.
After firing a SIGINT to a worker node, it sleeps for a maximum interval of HEAERTBEAT_INTERVAL before shutting down. I was thinking maybe we can replace time.sleep with a conditional wait(Event.wait) so that SIGINT handler can interrupt the sleep
Currently there are four endpoints for workers:
POST /api/v1/worker/<worker_id>
: Worker registration
GET /api/v1/grading_job/<worker_id>
: Request grading job
POST /api/v1/grading_job/<worker_id>
: Uploading result
POST /api/v1/heartbeat/<worker_id>
: Heartbeat
Since there are lots of waiting and polling going on on both sides to synchronize the states,
I was wondering if we can replace these four endpoints with a single Websocket endpoint, say /api/v1/worker-ws/<worker_id>
, with the following behavior(all messages in JSON):
open
: Register/update worker_id
, save the connection to a global mapon_message
: (worker sending back result) validate and save the resulton_close
: Mark worker worker_id
as deadAnd we would also need a new thread for job-assigning and load-balancing on the API side(or just a function called when a job is finished(on_message
) and a new job has arrived).
The main benefit is that we can (ideally) improve throughput by reducing the time spent in waiting and re-polling jobs, and reduce the number of threads.
Probably rewrite the docker runner using docker-py to reduce dependencies, make it easier to test and simplify things. and most importantly get rid of Node.
Chainlink uses a non-async function to pull images which would block the "keep-alive" coroutine in the websocket library.
Instead of using /tmp
for holding the temp directory for a job, use the cwd.
There is a chance that if the job temp directory is in /tmp
it could be a tmpfs and filling it up would cause all incoming ssh connections to get blocked until reboot.
Record and save the following for each stage (respective container):
This should be added to the INFO array returned to the API.
Cleanly disconnect grader from the system and clean up resources. Probably will need to add another endpoint to the API.
As we scale and our single threaded API gets clutered with excessive calls, there are high chances that at time the Job Poll requests to the API (which is currently set to block for 15 seconds) might timeout (20 seconds).
We have to gaurd against timeouts and possibly request a longer request timeout.
Link
We are currently pulling the docker images everytime a student job runs. We might not want to pull each time. As a result, the number of pulls on Docker Hub will keep growing a lot with each AG run.
When the chainlink instance crashes, a logging error occurs in the main broadway-grader process. Ultimately there is no large impact, since the grader can recover and process new jobs, however the log generated by the chainlink exception is mixed in with the logging error.
Found in the process of investigating illinois-cs241/chainlink#20
INFO:root:Starting job 5d0560b9c3e0992fd5ef770b
Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/api/client.py", line 256, in _raise_for_status
response.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/images/create?tag=0.2.1&fromImage=jasonbilas%2Fquartus_build
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 69, in _pull_image
client.images.pull(image)
File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/models/images.py", line 441, in pull
repository, tag=tag, stream=True, **kwargs
File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/api/image.py", line 400, in pull
self._raise_for_status(response)
File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/api/client.py", line 258, in _raise_for_status
raise create_api_error_from_http_exception(e)
File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("manifest for jasonbilas/quartus_build:0.2.1 not found: manifest unknown: manifest unknown")
--- Logging error ---
Traceback (most recent call last):
File "run.py", line 95, in worker_routine
chain = Chainlink(job[api_keys.STAGES], workdir=os.getcwd())
File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 27, in __init__
self._pull_images()
File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 64, in _pull_images
raise ValueError("Failed to pull all images")
ValueError: Failed to pull all images
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1034, in emit
msg = self.format(record)
File "/usr/lib/python3.7/logging/__init__.py", line 880, in format
return fmt.format(record)
File "/usr/lib/python3.7/logging/__init__.py", line 619, in format
record.message = record.getMessage()
File "/usr/lib/python3.7/logging/__init__.py", line 380, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/usr/lib/python3.7/threading.py", line 885, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 80, in _worker
work_item.run()
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "run.py", line 98, in worker_routine
logger.critical("Grading job failed with exception:\n{}", ex)
Message: 'Grading job failed with exception:\n{}'
Arguments: (ValueError('Failed to pull all images'),)
--- Logging error ---
Traceback (most recent call last):
File "run.py", line 95, in worker_routine
chain = Chainlink(job[api_keys.STAGES], workdir=os.getcwd())
File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 27, in __init__
self._pull_images()
File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 64, in _pull_images
raise ValueError("Failed to pull all images")
ValueError: Failed to pull all images
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/logging/__init__.py", line 1034, in emit
msg = self.format(record)
File "/usr/lib/python3.7/logging/__init__.py", line 880, in format
return fmt.format(record)
File "/usr/lib/python3.7/logging/__init__.py", line 619, in format
record.message = record.getMessage()
File "/usr/lib/python3.7/logging/__init__.py", line 380, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/usr/lib/python3.7/threading.py", line 885, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 80, in _worker
work_item.run()
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "run.py", line 98, in worker_routine
logger.critical("Grading job failed with exception:\n{}", ex)
Message: 'Grading job failed with exception:\n{}'
Arguments: (ValueError('Failed to pull all images'),)
INFO:root:Finished job 5d0560b9c3e0992fd5ef770b
INFO:root:Job stdout:
The container crashed
INFO:root:Job stderr:
Failed to pull all images
INFO:root:Sending job results
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.