msr-fiddle / blox Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
We add an attained service from the perspective of the scheduler.
This does not account the time the scheduler operations take. Modify this metric with additional information that is necessary to reflect real time the job has used to cluster.
Traceback (most recent call last):
File "blox_new_flow_multi_run.py", line 196, in
main(args)
File "blox_new_flow_multi_run.py", line 25, in main
blox_mgr = BloxManager(args)
File "/home/wxh/blox/blox/blox_manager.py", line 41, in init
node_manager_port=args.node_manager_port
AttributeError: 'Namespace' object has no attribute 'node_manager_port'
How can I evaluate the synergy algorithm or reproduce it in Blox?
Is it possible to set a checkpoint for real cluster exp? For example, we need to do 100 rounds of scheduling. The scheduler already finished the 21 rounds but we are confronted with an issue at 22 rounds. I am wondering whether we can restart the real cluster exp at 21 rounds by saving the checkpoint. Thanks!
In a version of Blox, we used to check if the GPU is free before launch. In order to support packing we disabled this check.
I want to enable a new check which makes sure job has freed the GPU before launching new jobs.
nvidia-smi -L
In shell script:
UUID_list=(`nvidia-smi -L | awk '{print $NF}' | tr -d '[)]'`)
can be used to produce a list of UUIDs on a node using nvidia-smi. Requesting addition of UUID to the cluster state so that it can be used to update gpu_df with appropriate attributes.
In current setup users typically have to go and change the parameters for job launch based on their preference for on how job is launched- https://github.com/msr-fiddle/blox/blob/main/blox/deployment/grpc_client_rm.py#L28
This is ugly. Here are a couple of alternatives, in order of preference based on some internal discussion
(i) Setup Environment variables and applications can read those environment. Need to test if environment variables are being overwritten because of subprocess environment.
(ii) Set environment in redis on node manager.
(iii) Provide a function which users can pass to blox using which users can massage the data in the form they want.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.