Coder Social home page Coder Social logo

illinois-cs241 / broadway-grader Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 105 KB

This is the old repo for Broadway grader. Please see the new repo for newest version of Broadway https://github.com/illinois-cs241/broadway

License: Other

Python 100.00%
autograder broadway nodejs python

broadway-grader's People

Contributors

ayushr2 avatar bhuvy2 avatar nmagerko avatar rod-lin avatar zhengyao-lin avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

broadway-grader's Issues

Worker Routine Crash

The worker routine sometimes randomly crashes. The heartbeat routine goes on which makes it look like the grader is alive but it is not and the job has failed.
We need to kill the grader if the worker routine fails. Else it will sit with the job forever.

Separate configuration from application logic

Having to go inside the application logic to edit a constant feels wrong. We should have a config.py (or equivalent) that gets imported into the application instead.

A good example is how the grade distributor (Warden) does it.

Replacing worker endpoints with a single websocket endpoint

Currently there are four endpoints for workers:
POST /api/v1/worker/<worker_id>: Worker registration
GET /api/v1/grading_job/<worker_id>: Request grading job
POST /api/v1/grading_job/<worker_id>: Uploading result
POST /api/v1/heartbeat/<worker_id>: Heartbeat

Since there are lots of waiting and polling going on on both sides to synchronize the states,
I was wondering if we can replace these four endpoints with a single Websocket endpoint, say /api/v1/worker-ws/<worker_id>, with the following behavior(all messages in JSON):

  • open: Register/update worker_id, save the connection to a global map
  • on_message: (worker sending back result) validate and save the result
  • on_close: Mark worker worker_id as dead

And we would also need a new thread for job-assigning and load-balancing on the API side(or just a function called when a job is finished(on_message) and a new job has arrived).

The main benefit is that we can (ideally) improve throughput by reducing the time spent in waiting and re-polling jobs, and reduce the number of threads.

Docker-py

Probably rewrite the docker runner using docker-py to reduce dependencies, make it easier to test and simplify things. and most importantly get rid of Node.

Do not use /tmp

Instead of using /tmp for holding the temp directory for a job, use the cwd.

There is a chance that if the job temp directory is in /tmp it could be a tmpfs and filling it up would cause all incoming ssh connections to get blocked until reboot.

Log more results

Record and save the following for each stage (respective container):

  • running time
  • stdout
  • stderr

This should be added to the INFO array returned to the API.

Shutdown Grader

Cleanly disconnect grader from the system and clean up resources. Probably will need to add another endpoint to the API.

Job Poll Timeout

As we scale and our single threaded API gets clutered with excessive calls, there are high chances that at time the Job Poll requests to the API (which is currently set to block for 15 seconds) might timeout (20 seconds).

We have to gaurd against timeouts and possibly request a longer request timeout.
Link

Docker image pulls

We are currently pulling the docker images everytime a student job runs. We might not want to pull each time. As a result, the number of pulls on Docker Hub will keep growing a lot with each AG run.

Logging Error when Chainlink Crashes

When the chainlink instance crashes, a logging error occurs in the main broadway-grader process. Ultimately there is no large impact, since the grader can recover and process new jobs, however the log generated by the chainlink exception is mixed in with the logging error.

Found in the process of investigating illinois-cs241/chainlink#20

Example Log
INFO:root:Starting job 5d0560b9c3e0992fd5ef770b
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/images/create?tag=0.2.1&fromImage=jasonbilas%2Fquartus_build

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 69, in _pull_image
    client.images.pull(image)
  File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/models/images.py", line 441, in pull
    repository, tag=tag, stream=True, **kwargs
  File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/api/image.py", line 400, in pull
    self._raise_for_status(response)
  File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/home/jason/Documents/broadway-grader/env/lib/python3.7/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("manifest for jasonbilas/quartus_build:0.2.1 not found: manifest unknown: manifest unknown")

--- Logging error ---
Traceback (most recent call last):
  File "run.py", line 95, in worker_routine
    chain = Chainlink(job[api_keys.STAGES], workdir=os.getcwd())
  File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 27, in __init__
    self._pull_images()
  File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 64, in _pull_images
    raise ValueError("Failed to pull all images")
ValueError: Failed to pull all images

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/logging/__init__.py", line 1034, in emit
    msg = self.format(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 880, in format
    return fmt.format(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 619, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.7/logging/__init__.py", line 380, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/usr/lib/python3.7/threading.py", line 885, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 80, in _worker
    work_item.run()
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "run.py", line 98, in worker_routine
    logger.critical("Grading job failed with exception:\n{}", ex)
Message: 'Grading job failed with exception:\n{}'
Arguments: (ValueError('Failed to pull all images'),)
--- Logging error ---
Traceback (most recent call last):
  File "run.py", line 95, in worker_routine
    chain = Chainlink(job[api_keys.STAGES], workdir=os.getcwd())
  File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 27, in __init__
    self._pull_images()
  File "/home/jason/Documents/chainlink/chainlink/__init__.py", line 64, in _pull_images
    raise ValueError("Failed to pull all images")
ValueError: Failed to pull all images

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/logging/__init__.py", line 1034, in emit
    msg = self.format(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 880, in format
    return fmt.format(record)
  File "/usr/lib/python3.7/logging/__init__.py", line 619, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.7/logging/__init__.py", line 380, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/usr/lib/python3.7/threading.py", line 885, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 80, in _worker
    work_item.run()
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "run.py", line 98, in worker_routine
    logger.critical("Grading job failed with exception:\n{}", ex)
Message: 'Grading job failed with exception:\n{}'
Arguments: (ValueError('Failed to pull all images'),)
INFO:root:Finished job 5d0560b9c3e0992fd5ef770b
INFO:root:Job stdout:
The container crashed
INFO:root:Job stderr:
Failed to pull all images
INFO:root:Sending job results
Steps to reproduce:
  1. Cause chainlink to crash
  2. Watch output for the logging error to be generated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.