hu-macsy / simexpal Goto Github PK
View Code? Open in Web Editor NEWSimplifying Experimental Algorithmics
Home Page: https://simexpal.readthedocs.io
License: MIT License
Simplifying Experimental Algorithmics
Home Page: https://simexpal.readthedocs.io
License: MIT License
Pandas dataframes are quite useful to inspect/analyze/aggregate experimental data. It would be nice to have a simexpal function that exports experiments as a Pandas dataframe where the columns are the instance, the experiment name, and variations, while rows are the single experiments.
We should add a command like
simex builds clean [--all | --build <BUILD> --revision <REVISION>] -f
to delete the clone, compile and install directory of builds. Similarly for simex develop clean
.
Instead of forcing the user to use callbacks, collect_successful_results() should provide an iterable interface.
As the title says, the queue launcher does not create an .err
file, if an error occurs in simexpal.
For example: Using @INSTANCE@
in experiments with fileless instances causes RuntimeError: The instance 'foo' is fileless
(assuming the fileless instance is called foo
)
We should redirect such errors to an .err
file in the aux/_queue/
directory just as simexpal does it for experiments launched with Slurm.
(Implement this while fixing #108)
The queue launcher is currently based on a custom event loop that is used only in a single place. It needs to be revamped. The code can probably be simplified by using Python's selectors
module directly. To actually run the experiments, it is probably easier to launch simex internal-invoke
in a different process.
selectors
directly instead (we already do that in launch/common.py
).launch.common.invoke_run
, use a subprocess that executes simex internal-invoke
on a manifest file.I would be nice if the queue launcher could get some quality-of-life improvements:
simex queue status
command that asks the daemon whether it is running and/or busy or not.simex queue show
command that lists all pending jobs (and maybe the number of completed jobs?).simex queue stop
to only stop the queue after all pending jobs have been finished. Add a simex queue kill
command to immediately kill the daemon.It would be neat to tell our users that we can actually already use auto complete using python-argcomplete.
Simply install argcomplete and put the following in your ~/.bashrc
:
source ~/.bash_completion.d/python-argcomplete
eval "$(register-python-argcomplete simex)"
We should possibly add an extra_paths
variable for the experiments entries in the experiments.yml
to add further paths to the PATH environment variable as noted in #36
Launching experiments with variants yields the error:
AttributeError: 'dict' object has no attribute 'append'
which is triggered here:
simexpal/simexpal/launch/common.py
Line 173 in 0f656e7
variants_yml
is a dict
.
Also, if I drop the variants I get the error:
AttributeError: 'RunManifest' object has no attribute 'time'
triggered here:
simexpal/simexpal/launch/common.py
Line 280 in 0f656e7
Are those bugs or am I doing something wrong?
Right now, we only send SIGXCPU. However, a process can catch SIGXCPU and continue running. After some grace period (maybe 30s by default?), we should kill the process with SIGKILL to ensure that the time limit is respected.
It would be a nice to be able to manage experiments depending on their variations e.g., I want to purge all experiments that ran with with var1
set to x
and var2
set to y
with something like
simex experiments purge --var1=x --var2=y
Would that be possible?
Right now , simexpal does not support experiments, which get their input from stdin
e.g. experiments with command cat <instance> | <executable>
do not work. This is due to the fact that we are launching experiments via subprocess.Popen()
with shell=False
.
We could add that functionality by opening such instances as stdin
or offering a way to enable shell=True
.
Right now there is no (supported) way to access the prefix-dirs of other builds during the build process. In some cases this is necessary to make a working build.
To achieve this, we could add a new @-variable like @PREFIX_DIR:<name_of_other_build>@
in the build arguments that substitutes the prefix-dir of another build in the same revision.
Add a simex e compare
to compare experiments with each other. For example, simex e compare <base>
would compare experiment <base>
and all others (with one chunk of output per <other>
experiment).
Useful info to display would be:
<base>
but not by <other>
.<base>
than by <other>
.<base>
and <other>
.The Slurm launcher used to group jobs for the same experiment/revision/variation tuple (e.g. jobs that only differ in the instance) into an array job. This does not work anymore:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
100996 core bghkkn ~ grintena PD 0:00 1 (Resources)
100997 core bghkkn ~ grintena PD 0:00 1 (Priority)
100998 core bghkkn ~ grintena PD 0:00 1 (Priority)
100999 core bghkkn ~ grintena PD 0:00 1 (Priority)
101000 core bghkkn ~ grintena PD 0:00 1 (Priority)
101001 core bghkkn ~ grintena PD 0:00 1 (Priority)
101002 core bghkkn ~ grintena PD 0:00 1 (Priority)
101003 core bghkkn ~ grintena PD 0:00 1 (Priority)
101004 core bghkkn ~ grintena PD 0:00 1 (Priority)
101005 core bghkkn ~ grintena PD 0:00 1 (Priority)
101006 core bghkkn ~ grintena PD 0:00 1 (Priority)
101007 core bghkkn ~ grintena PD 0:00 1 (Priority)
101008 core bghkkn ~ grintena PD 0:00 1 (Priority)
101009 core bghkkn ~ grintena PD 0:00 1 (Priority)
Tested with commit 8d3bf03.
simexpal needs to do a decompression of some instances, e.g., for graphs originating from the KONECT repository. This should be optimized better, e.g., by using an external tar
implementation if the Python turns out to be slow.
It would be nice to support slurm_args also for variations.
Context: I needed to pass the '--exclusive' parameter to slurm only for a specific variant that used less that half of the cores of
a node and without it, slurm assigned 2 jobs in one node.
AFAICS simexpal variations have a name and several values which need to be written manually as well as extra_args
. It would be nice to have cases where a variation has a specific type (i.e., int, bool), and a list (or even better a range) of values that the variation can have.
For example, at the moment if I want to execute an experiment with 1, 2, .. x threads I need to create a single variation thread1
, thread2
, ... for each number of threads. Wouldn't be better to just have a variation number of threads
that takes values in [1, 2, ... x] ?
This would simplify the writing of the experiments.yml
file, and also the creation of an output dataframe (see #57).
It would be nice to have a command such as simex run <build name> <binary> [args]...
. This would allow much easier testing of compiled binaries using simex develop
. Right now the only way to test a binary is to define some test experiment, which is quite cumbersome if multiple command line variations need to be tested. Directly invoking the binary is also quite difficult, since simexpal sets up some environment variables the binary might depend on.
So, my proposal is to have a simex run
command that allows you to run a binary with the correct environment variables on the command line.
For experiments.yml files with many experiments and/or instances, the most critical performance bottlenecks for operations like simex e list
concerns the check of the status of experiments.
I see two ways to speed up this process:
os.scandir()
) to check whether files exist. Right now, we check for the presence of individual files. However, when we expect many files to exist, it would be faster to read directories instead. Note that this is not an improvement for a single get_status
call but only for bulk operations.os.stat()
) the individual files and compare their modification timestamps to the timestamp of the cache to avoid races. We can use os.rename()
to replace the cache atomically (after creating it as a temporary file first).Specifying launchers on the command line can be error-prone. Add some functionality to determine the launcher from experiments.yml
.
Currently simexpal supports the Konect and SNAP network repositories, which do not include many street networks. From Geofabrik it is possible to download geographic data of several countries and world areas that can be easily converted to (directed/weighted) networks (see here for an example). I think that adding Geofabrik to the supported repositories would be a very useful feature for simexpal.
Another source for undirected (weighted) networks is the 9th DIMACS implementation challenge.
Konect seems to be back online (http://konect.cc/networks/).
If simexpal cannot execute an experiment, e.g., because the program is missing, we should still create a .status
file.
When starting experiments through the queue launcher all statuses are set to submitted
as they are added into the requests
queue. When the queue launcher is closed prematurely (SIGKILL
, CTRL + C
, ...) the experiments remain submitted
.
It would be better if the experiments status is failed
, similarily to #92 and #101.
As implementation strategy we could use the approach mentioned in #92, i.e. store some kind of job id and query the queue launcher with them to verify their status.
It would be nice to add a progress bar for downloads. This is especially useful when downloading large files.
It would be nice to let simexpal set the OpenMP environment variable OMP_NUM_THREADS
according to num_threads
.
Since one sometimes wants to allocate a whole node for an experiment, while using only some threads, it would also be nice if the allocated number of threads and OMP_NUM_THREADS
could be set independently. Proposal: add a num_reserved_threads
variable that simexpal uses over num_threads
for allocation if the former is set (with a fallback to num_threads
if it isn't).
It would be nice if large experiments.yml
files could be broken up. We probably need some caching to make this efficient.
setup.py
does not properly specify requests
as a requirement. This is a bug.
On the other hand, the functionality of the requests
module could also be replaced by the urllib.request
module of the Python 3 standard library. Maybe we should prefer this option.
I noticed that just running simex e
on an empty file takes 300ms on my (decently fast) laptop. This seems to be mainly due to the import of external packages: base.py imports instances.py and that imports requests
, which takes quite a bit of time. Additionally, importing jsonschema
ist not for free.
requests
when it we need them should be easy.When dealing with many variations the full list of experiments printed by simex e
might be too long to visualize. It would be nice to have a "compact view" of experiments that shows for each experiment the status of each variation e.g.:
$ simex e --compact
experiment 1: 0/100 finished, 50/100 submitted, 50/100 started
experiment 2: ...
...
Does this sound reasonable?
Add unit tests and run them as a GitHub workflow.
Allow extra_args:
in repo:
blocks. For example, this would be useful to allow users to specify file formats on a per-repo basis.
simexpal currently creates lock files with 000 permissions; this is very annoying when moving directories (among other things) since mv
complains that it cannot open these files. There is no reason not to create these files either as r--r--r--
or rw-rw-r--
instead.
As konect.cc is now only accessible via a paid subscription, we probably can not offer the possibility to download konect instances anymore. Trying to download konect instances currently causes simexpal to crash with a ConnectionError
.
We should add the possibility of fileless instances as an instance might be defined through input parameters ,e.g, algorithms that generate their data themselves and only need an input parameter like --seed=X
.
This might be done by adding an args
key to instances and a new at-variable @INSTANCE_ARGS@
that resolves to the respective instance args.
This would be useful for small helper programs that reside in the same repository as an experiments.yml
file.
Support builds that do not use git. Note that since revisions need some kind of commit identifier, we can only support git-less builds for dev revisions.
Maybe it would be nice to add more run selection options based on the status. Currently it is only possible to select the status
--failed
and --unfinished
(which consists of [NOT_SUBMITTED, IN_SUBMISSION, SUBMITTED, STARTED]
).
When running a considerable amount of experiments, the script crashed because of "too many open files". Probably, a call to close()
is missing somewhere.
Full error output here:
File "/home/angriman/.local/bin/simex", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/home/angriman/projects/simexpal/scripts/simex", line 447, in <module>
do_main()
File "/home/angriman/projects/simexpal/scripts/simex", line 47, in do_main
do_experiments(args)
File "/home/angriman/projects/simexpal/scripts/simex", line 199, in do_experiments
do_experiments_launch(args)
File "/home/angriman/projects/simexpal/scripts/simex", line 285, in do_experiments_launch
submit_to_launcher(cfg, sel)
File "/home/angriman/projects/simexpal/scripts/simex", line 283, in submit_to_launcher
launcher.submit(config, run)
File "/home/angriman/projects/simexpal/simexpal/launch/fork.py", line 12, in submit
common.invoke_run(run)
File "/home/angriman/projects/simexpal/simexpal/launch/common.py", line 144, in invoke_run
stdout=stdout, stderr=subprocess.PIPE)
File "/usr/lib/python3.6/subprocess.py", line 709, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1234, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
While working on issue #81, I noticed that experiments started by simexpal keep running even after simexpal is terminated (prematurely).
We should consider catching those cases to terminate all child processes spawned by simexpal.
If no launchers are specified, the command simex e launch
warns that no launchers are specified, also when the --launch-through=fork
flag is enabled. Would that be better if simex e launch
does not complain about the launchers, and launches the experiment on the local machine by default?
Simexpal complains about variants without extra_args even though they are perfectly reasonable (e.g., if they set environment variables):
simexpal: Validation error in experiments.yml at [variants][0][items][0]:
{'environ': 'OMP_NUM_THREADS':1, 'name': 't01'}
'extra_args' is a required property
The documentation on https://simexpal.readthedocs.io/ is inconsistent with the latest version of simexpal
that is on PyPI. Fields like files
for instances cause errors when running simex instances
. It would be nice to either update the version on PyPI or link to a correct version of the documentation.
Non-exhaustive list of missing documentation:
CLI usage examples for...
builds:
develop:
<list_of_builds>
)--recheckout
, --checkout
, reregenerate
, regenerate
,...)experiments:
instances:
queue:
Currently, simexpal downloads entire files to RAM (in instances.py
) and afterwards writes them to disk. Rewrite this to iteratively write smaller chunks to disk, or use something like the sendfile()
system call.
Right now, we can use Slurm to launch experiments but we cannot monitor and/or kill experiments through Slurm.
simex e
, query the batch scheduler to determine if job are still alive or not.As an implementation strategy, we could store the job IDs of experiments in some file and use that to invoke squeue
and scancel
.
Wenn man in den Namen der instances ein Slash ("/") hat, dann ist die Fehlermeldung nicht sehr hilfreich.
experiments.yml
instances:
- repo: local
items:
- name: 'trees'
- name: 'random-trees/random'
instdir: "data"
[...]
Fehlermeldung:
Traceback (most recent call last):
File "/home/kulagins/anaconda3/bin/simex", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/home/kulagins/workspace/simexpal/simexpal/scripts/simex", line 971, in <module>
main_args.cmd(main_args)
File "/home/kulagins/workspace/simexpal/simexpal/scripts/simex", line 702, in do_experiments_launch
submit_to_launcher(cfg, launcher_runs)
File "/home/kulagins/workspace/simexpal/simexpal/scripts/simex", line 700, in submit_to_launcher
launcher.submit(cfg, run)
File "/home/kulagins/workspace/simexpal/simexpal/simexpal/launch/fork.py", line 6, in submit
if not common.lock_run(run):
File "/home/kulagins/workspace/simexpal/simexpal/simexpal/launch/common.py", line 27, in lock_run
lockfd = os.open(run.aux_file_path('lock'),
FileNotFoundError: [Errno 2] No such file or directory: '/home/kulagins/workspace/tree-scheduler-refactored/aux/CP-EX-CP/random_trees/20_children.lock'
Es wäre hilfreicher, wenn die ".lock"-Datei nicht erwähnt wäre und eine Erklärung zu den nicht erlaubten Zeichen bei instances ausgegeben wäre.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.