hu-macsy / simexpal Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 16.0 540 KB

Simplifying Experimental Algorithmics

Home Page: https://simexpal.readthedocs.io

License: MIT License

Python 100.00%

algorithms benchmarking experiments research utility

simexpal's People

Contributors

Stargazers

Watchers

Forkers

angriman avdgrinten duylethanh andrechinazzo pokhrels tukl-msd cdeschryver ironps mart1nro fabratu charjon lorax66 c-bebop friedagerharz bernlu nizomovs

simexpal's Issues

Experiments to dataframe

Pandas dataframes are quite useful to inspect/analyze/aggregate experimental data. It would be nice to have a simexpal function that exports experiments as a Pandas dataframe where the columns are the instance, the experiment name, and variations, while rows are the single experiments.

Command to delete build directories

We should add a command like

simex builds clean [--all | --build <BUILD> --revision <REVISION>] -f

to delete the clone, compile and install directory of builds. Similarly for simex develop clean.

Make collect_successful_results() more pythonic

Instead of forcing the user to use callbacks, collect_successful_results() should provide an iterable interface.

queue launcher does not create an .err file if an error occurs

As the title says, the queue launcher does not create an .errfile, if an error occurs in simexpal.
For example: Using @INSTANCE@ in experiments with fileless instances causes RuntimeError: The instance 'foo' is fileless(assuming the fileless instance is called foo)

We should redirect such errors to an .err file in the aux/_queue/ directory just as simexpal does it for experiments launched with Slurm.

(Implement this while fixing #108)

Overhaul queue launcher, remove evloop.py

The queue launcher is currently based on a custom event loop that is used only in a single place. It needs to be revamped. The code can probably be simplified by using Python's selectors module directly. To actually run the experiments, it is probably easier to launch simex internal-invoke in a different process.

Remove evloop.py and use selectors directly instead (we already do that in launch/common.py).
Instead of calling into launch.common.invoke_run, use a subprocess that executes simex internal-invoke on a manifest file.
Integrate that subprocess into the event loop such that we can take additional requests from the socket while the process is running.
Add functionality to launch the queue daemon automatically. Details need to be discussed.

KONECT is publicly available again

It seems that the KONECT repository is publicly available again.
We should probably undo PR #86 and reopen Issue #12.

Thoughts?

QoL improvements for the queue launcher

I would be nice if the queue launcher could get some quality-of-life improvements:

Add a simex queue status command that asks the daemon whether it is running and/or busy or not.
Add a simex queue show command that lists all pending jobs (and maybe the number of completed jobs?).
Change the semantics of simex queue stop to only stop the queue after all pending jobs have been finished. Add a simex queue kill command to immediately kill the daemon.
Document how to use the queue launcher.

Autocomplete Documentation

It would be neat to tell our users that we can actually already use auto complete using python-argcomplete.
Simply install argcomplete and put the following in your ~/.bashrc:

source ~/.bash_completion.d/python-argcomplete
eval "$(register-python-argcomplete simex)"

Extra_paths variable to add additional paths to the PATH environment variable

We should possibly add an extra_paths variable for the experiments entries in the experiments.yml to add further paths to the PATH environment variable as noted in #36

Broken "experiments launch" command

Launching experiments with variants yields the error:

AttributeError: 'dict' object has no attribute 'append'

which is triggered here:

simexpal/simexpal/launch/common.py

Line 173 in 0f656e7

variants_yml.append({

probably because variants_yml is a dict.

Also, if I drop the variants I get the error:

AttributeError: 'RunManifest' object has no attribute 'time'

triggered here:

simexpal/simexpal/launch/common.py

Line 280 in 0f656e7

if manifest.timeout is not None and elapsed > manifest.time:

Are those bugs or am I doing something wrong?

simexpal should send SIGXCPU, then SIGKILL on timeout

Right now, we only send SIGXCPU. However, a process can catch SIGXCPU and continue running. After some grace period (maybe 30s by default?), we should kill the process with SIGKILL to ensure that the time limit is respected.

Manage experiments depending on variation

It would be a nice to be able to manage experiments depending on their variations e.g., I want to purge all experiments that ran with with var1 set to x and var2 set to y with something like

simex experiments purge --var1=x --var2=y

Would that be possible?

Missing support of experiments reading from stdin

Right now , simexpal does not support experiments, which get their input from stdin e.g. experiments with command cat <instance> | <executable> do not work. This is due to the fact that we are launching experiments via subprocess.Popen() with shell=False.

We could add that functionality by opening such instances as stdin or offering a way to enable shell=True .

@-variable to discover the prefix-dir of other builds in the same revision

Right now there is no (supported) way to access the prefix-dirs of other builds during the build process. In some cases this is necessary to make a working build.

To achieve this, we could add a new @-variable like @PREFIX_DIR:<name_of_other_build>@ in the build arguments that substitutes the prefix-dir of another build in the same revision.

cli: Add command to compare results between revisions / variants

Add a simex e compare to compare experiments with each other. For example, simex e compare <base> would compare experiment <base> and all others (with one chunk of output per <other> experiment).

Useful info to display would be:

Instances (and # of instances) finished successfully by <base> but not by <other>.
Instances (and # of instances) finished faster by <base> than by <other>.
Instances where the failure condition differs between <base> and <other>.

Grouping of experiments by Slurm launcher is broken

The Slurm launcher used to group jobs for the same experiment/revision/variation tuple (e.g. jobs that only differ in the instance) into an array job. This does not work anymore:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            100996      core bghkkn ~ grintena PD       0:00      1 (Resources)
            100997      core bghkkn ~ grintena PD       0:00      1 (Priority)
            100998      core bghkkn ~ grintena PD       0:00      1 (Priority)
            100999      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101000      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101001      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101002      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101003      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101004      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101005      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101006      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101007      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101008      core bghkkn ~ grintena PD       0:00      1 (Priority)
            101009      core bghkkn ~ grintena PD       0:00      1 (Priority)

Tested with commit 8d3bf03.

Untar and decompression of large instances is slow

simexpal needs to do a decompression of some instances, e.g., for graphs originating from the KONECT repository. This should be optimized better, e.g., by using an external tar implementation if the Python turns out to be slow.

slurm_args for variation

It would be nice to support slurm_args also for variations.

Context: I needed to pass the '--exclusive' parameter to slurm only for a specific variant that used less that half of the cores of
a node and without it, slurm assigned 2 jobs in one node.

Formatting of download bar is bugged for really small instances

instances:
  - method: url
    url: 'https://raw.githubusercontent.com/hu-macsy/simexpal/master/simexpal/schemes/@INSTANCE_FILENAME@'
    items:
      - 'experiments.json'

leads to

Variation types

AFAICS simexpal variations have a name and several values which need to be written manually as well as extra_args. It would be nice to have cases where a variation has a specific type (i.e., int, bool), and a list (or even better a range) of values that the variation can have.
For example, at the moment if I want to execute an experiment with 1, 2, .. x threads I need to create a single variation thread1, thread2, ... for each number of threads. Wouldn't be better to just have a variation number of threads that takes values in [1, 2, ... x] ?
This would simplify the writing of the experiments.yml file, and also the creation of an output dataframe (see #57).

Implement command-line local execution

It would be nice to have a command such as simex run <build name> <binary> [args].... This would allow much easier testing of compiled binaries using simex develop. Right now the only way to test a binary is to define some test experiment, which is quite cumbersome if multiple command line variations need to be tested. Directly invoking the binary is also quite difficult, since simexpal sets up some environment variables the binary might depend on.

So, my proposal is to have a simex run command that allows you to run a binary with the correct environment variables on the command line.

Faster bulk check of experiment status

For experiments.yml files with many experiments and/or instances, the most critical performance bottlenecks for operations like simex e list concerns the check of the status of experiments.

I see two ways to speed up this process:

Read directories (os.scandir()) to check whether files exist. Right now, we check for the presence of individual files. However, when we expect many files to exist, it would be faster to read directories instead. Note that this is not an improvement for a single get_status call but only for bulk operations.
Caching. Cache the state of experiments in some extra file and only re-read the status files if the cache is out-of-date. We probably still need to stat (os.stat()) the individual files and compare their modification timestamps to the timestamp of the cache to avoid races. We can use os.rename() to replace the cache atomically (after creating it as a temporary file first).

Allow specification of machine environment (e.g., launcher) in experiments.yml

Specifying launchers on the command line can be error-prone. Add some functionality to determine the launcher from experiments.yml.

New repositories: Geofabrik and DIMACS

Currently simexpal supports the Konect and SNAP network repositories, which do not include many street networks. From Geofabrik it is possible to download geographic data of several countries and world areas that can be easily converted to (directed/weighted) networks (see here for an example). I think that adding Geofabrik to the supported repositories would be a very useful feature for simexpal.

Another source for undirected (weighted) networks is the 9th DIMACS implementation challenge.

Enable downloads from Konect

Konect seems to be back online (http://konect.cc/networks/).

Handle execve() errors gracefully

If simexpal cannot execute an experiment, e.g., because the program is missing, we should still create a .status file.

Misleading status of experiments when queue launcher is closed prematurely

When starting experiments through the queue launcher all statuses are set to submitted as they are added into the requests queue. When the queue launcher is closed prematurely (SIGKILL, CTRL + C, ...) the experiments remain submitted.

It would be better if the experiments status is failed, similarily to #92 and #101.
As implementation strategy we could use the approach mentioned in #92, i.e. store some kind of job id and query the queue launcher with them to verify their status.

Add progress bar for downloads

It would be nice to add a progress bar for downloads. This is especially useful when downloading large files.

Explicit support for OpenMP

It would be nice to let simexpal set the OpenMP environment variable OMP_NUM_THREADS according to num_threads.

Since one sometimes wants to allocate a whole node for an experiment, while using only some threads, it would also be nice if the allocated number of threads and OMP_NUM_THREADS could be set independently. Proposal: add a num_reserved_threads variable that simexpal uses over num_threads for allocation if the former is set (with a fallback to num_threads if it isn't).

Allow breaking up experiments.yml into multiple files

It would be nice if large experiments.yml files could be broken up. We probably need some caching to make this efficient.

Dependency on Python package 'requests' is not set up correctly

setup.py does not properly specify requests as a requirement. This is a bug.

On the other hand, the functionality of the requests module could also be replaced by the urllib.request module of the Python 3 standard library. Maybe we should prefer this option.

Performance of no-op simex runs

I noticed that just running simex e on an empty file takes 300ms on my (decently fast) laptop. This seems to be mainly due to the import of external packages: base.py imports instances.py and that imports requests, which takes quite a bit of time. Additionally, importing jsonschema ist not for free.

Only importing instances.py and requests when it we need them should be easy.
For jsonschema, maybe we can cache validation results (however, that would also require us to cache validation error, or, alternatively, disable the caching when there are validation errors).

Compact experiments list

When dealing with many variations the full list of experiments printed by simex e might be too long to visualize. It would be nice to have a "compact view" of experiments that shows for each experiment the status of each variation e.g.:

$ simex e --compact
experiment 1: 0/100 finished, 50/100 submitted, 50/100 started
experiment 2: ...
...

Does this sound reasonable?

Add testing and CI

Add unit tests and run them as a GitHub workflow.

Allow extra_args for instance repositories

Allow extra_args: in repo: blocks. For example, this would be useful to allow users to specify file formats on a per-repo basis.

.lock files have 000 permissions

simexpal currently creates lock files with 000 permissions; this is very annoying when moving directories (among other things) since mv complains that it cannot open these files. There is no reason not to create these files either as r--r--r-- or rw-rw-r-- instead.

Download of konect instances no longer possible.

As konect.cc is now only accessible via a paid subscription, we probably can not offer the possibility to download konect instances anymore. Trying to download konect instances currently causes simexpal to crash with a ConnectionError.

Fileless Instances

We should add the possibility of fileless instances as an instance might be defined through input parameters ,e.g, algorithms that generate their data themselves and only need an input parameter like --seed=X.

This might be done by adding an args key to instances and a new at-variable @INSTANCE_ARGS@ that resolves to the respective instance args.

builds: Allow (dev-)builds from subdirectories of the repository containing experiments.yml

This would be useful for small helper programs that reside in the same repository as an experiments.yml file.

Support builds without VCS

Support builds that do not use git. Note that since revisions need some kind of commit identifier, we can only support git-less builds for dev revisions.

More (granular) run selection options based on status

Maybe it would be nice to add more run selection options based on the status. Currently it is only possible to select the status
--failed and --unfinished (which consists of [NOT_SUBMITTED, IN_SUBMISSION, SUBMITTED, STARTED]).

OSError: [Errno 24] Too many open files

When running a considerable amount of experiments, the script crashed because of "too many open files". Probably, a call to close() is missing somewhere.

Full error output here:

  File "/home/angriman/.local/bin/simex", line 7, in <module>                          
    exec(compile(f.read(), __file__, 'exec')) 
  File "/home/angriman/projects/simexpal/scripts/simex", line 447, in <module>
    do_main()                                                                                       
  File "/home/angriman/projects/simexpal/scripts/simex", line 47, in do_main
    do_experiments(args)                                                                               
  File "/home/angriman/projects/simexpal/scripts/simex", line 199, in do_experiments 
    do_experiments_launch(args)                                                                        
  File "/home/angriman/projects/simexpal/scripts/simex", line 285, in do_experiments_launch 
    submit_to_launcher(cfg, sel)                                                                    
  File "/home/angriman/projects/simexpal/scripts/simex", line 283, in submit_to_launcher               
    launcher.submit(config, run)                                                                    
  File "/home/angriman/projects/simexpal/simexpal/launch/fork.py", line 12, in submit                  
    common.invoke_run(run)                                                                             
  File "/home/angriman/projects/simexpal/simexpal/launch/common.py", line 144, in invoke_run
    stdout=stdout, stderr=subprocess.PIPE)                                                   
  File "/usr/lib/python3.6/subprocess.py", line 709, in __init__                       
    restore_signals, start_new_session)                                                           
  File "/usr/lib/python3.6/subprocess.py", line 1234, in _execute_child        
    errpipe_read, errpipe_write = os.pipe()                                                      
OSError: [Errno 24] Too many open files

Child processes of simexpal keep running even when simexpal is terminated

While working on issue #81, I noticed that experiments started by simexpal keep running even after simexpal is terminated (prematurely).

We should consider catching those cases to terminate all child processes spawned by simexpal.

Launch experiments on local machine

If no launchers are specified, the command simex e launch warns that no launchers are specified, also when the --launch-through=fork flag is enabled. Would that be better if simex e launch does not complain about the launchers, and launches the experiment on the local machine by default?

Allow variants without extra_args

Simexpal complains about variants without extra_args even though they are perfectly reasonable (e.g., if they set environment variables):

simexpal: Validation error in experiments.yml at [variants][0][items][0]:
{'environ': 'OMP_NUM_THREADS':1, 'name': 't01'}
'extra_args' is a required property

Documentation inconsistent with latest version on PyPI

The documentation on https://simexpal.readthedocs.io/ is inconsistent with the latest version of simexpal that is on PyPI. Fields like files for instances cause errors when running simex instances. It would be nice to either update the version on PyPI or link to a correct version of the documentation.

Simexpal Documentation

Non-exhaustive list of missing documentation:

CLI usage examples for...

archive

builds:

make
purge
remake

develop:

positional arguments (<list_of_builds>)
optional arguments (--recheckout, --checkout, reregenerate, regenerate,...)

experiments:

instances:

list
install
process
run-transform

queue:

Download of large instances needs optimization

Currently, simexpal downloads entire files to RAM (in instances.py) and afterwards writes them to disk. Rewrite this to iteratively write smaller chunks to disk, or use something like the sendfile() system call.

Tighter integration with batch schedulers

Right now, we can use Slurm to launch experiments but we cannot monitor and/or kill experiments through Slurm.

In simex e, query the batch scheduler to determine if job are still alive or not.
Add a command to kill currently running jobs.

As an implementation strategy, we could store the job IDs of experiments in some file and use that to invoke squeue and scancel.

Fehlermeldung beim Slash in instances verbessern

Wenn man in den Namen der instances ein Slash ("/") hat, dann ist die Fehlermeldung nicht sehr hilfreich.

Beispiel:

experiments.yml

instances:
  - repo: local
    items:
     - name: 'trees'
     - name: 'random-trees/random'
instdir: "data"

[...]

Fehlermeldung:

Traceback (most recent call last):
  File "/home/kulagins/anaconda3/bin/simex", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/kulagins/workspace/simexpal/simexpal/scripts/simex", line 971, in <module>
    main_args.cmd(main_args)
  File "/home/kulagins/workspace/simexpal/simexpal/scripts/simex", line 702, in do_experiments_launch
    submit_to_launcher(cfg, launcher_runs)
  File "/home/kulagins/workspace/simexpal/simexpal/scripts/simex", line 700, in submit_to_launcher
    launcher.submit(cfg, run)
  File "/home/kulagins/workspace/simexpal/simexpal/simexpal/launch/fork.py", line 6, in submit
    if not common.lock_run(run):
  File "/home/kulagins/workspace/simexpal/simexpal/simexpal/launch/common.py", line 27, in lock_run
    lockfd = os.open(run.aux_file_path('lock'),
FileNotFoundError: [Errno 2] No such file or directory: '/home/kulagins/workspace/tree-scheduler-refactored/aux/CP-EX-CP/random_trees/20_children.lock'

Es wäre hilfreicher, wenn die ".lock"-Datei nicht erwähnt wäre und eine Erklärung zu den nicht erlaubten Zeichen bei instances ausgegeben wäre.