materialsproject / fireworks Goto Github PK

View Code? Open in Web Editor NEW

346.0 346.0 180.0 21.25 MB

The Fireworks Workflow Management Repo.

Home Page: https://materialsproject.github.io/fireworks

License: Other

Python 45.68% CSS 10.76% HTML 1.23% Shell 0.25% JavaScript 36.01% Makefile 0.38% Less 2.83% SCSS 2.87%

fireworks's People

Contributors

Stargazers

Watchers

Forkers

davidwaroquiers gmatteo flxb wscullin bfoster-lbl ogmasoul3d dannybrowne86 image-tester zhengc55 amarsanadi gpetretto tpyang matk86 catchmonster jakirkham rousseab czhengsci aashish24 zulissi misganaa uconnhong henrus montoyjh lordzappo cci-smoketests dcossey014 kimh xhqu1981 altynbekm aykol dotsdl traviskemper albalu blondegeek sseyler saurabh02 nexomichael digitalsatori luomingzhi dwinston shenjh369 anilkunwar mkhorton ketanbhatt richardjgowers captaindasheng ikondov willtu czha168 qiwen77 rajeshprasanth djbard burbanom oliverevans96 mahdidavari mason-datamaterials zbwang hipeter utf remilehe jboes ssttv talkative smheidrich shrubag huasheng-hou mlhenderson kirstenwinther ardunn leosssssheng mhoffman tinaatucsd sivonxay dbroberg twang0686 chuckie82 samack26 tahorst tschaume jotelha mrauha cpashartis tpurcell90 samanmon vaibhavpatil123 kylebystrom water-e stephen-quiton nwinner rees-c lxf-gzu shreddd mhsiron jageo srshivani tikramer munrojm zx-sdu pombredanne mamachra

fireworks's Issues

display_wflows (previously called get_links()) is weird

Try the following:

Start with clean FW database
Add the workflow in fw_tutorials/workflow/org_wf.yaml
Type "display_wflows -i 1". The output is strange.

Note, I renamed this method from "get_links" recently and made some changes due to weird output in the original function. You can revert to a version of FW before today if you want to see what the function was doing before my changes

Frequent spurious failures from `test_rerun_timed_fws`

I have seen this test fail often and suspect there is something wrong with the test itself (particularly as it is a timed test). Here is an example of it failing when I had only made changes to docs and docstrings ( https://travis-ci.org/materialsproject/fireworks/builds/60772474 ).

======================================================================
FAIL: test_rerun_timed_fws (fireworks.core.tests.test_launchpad.WorkflowFireworkStatesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/materialsproject/fireworks/fireworks/core/tests/test_launchpad.py", line 710, in test_rerun_timed_fws
    self.assertEqual(fw_state, fw_cache_state)
nose.proxy.AssertionError: 'RUNNING' != 'READY'
- RUNNING
+ READY

Calling `lpad webgui` opens two identical tabs

As the title says, running lpad webgui seems to open the same page in two different tabs or windows.

Mechanism for using config directory for python defaults?

Is there a mechanism for using configurations one might have in ~./fireworks automatically when doing things like

import fireworks

launchpad = fireworks.LaunchPad()

instead of having to put in the hostname, database name, username, password for a non-local MongoDB? I know that using lpad from the command line automatically uses these configurations, but if there's a reasonable way of telling fireworks.LaunchPad to do the same thing that would be awesome.

Add benchmark(s) for running workflows

Storing `launcher_dir`

It would be really helpful to have access to the launcher_dir used so if data is stored in that directory it can be explored programmatically.

Resource-context settings

This is a major feature request. After struggling with trying to make a particular workflow work across multiple resources, I am of the view that fireworks badly needs a resource-context setting module. The principle that resources are to be separate from tasks is a good concept in theory, but practically, it is somewhat difficult. For example, a particular command may either have alternatives or completely different names in different resources. Or some resources may require special settings (you are allowed to use 1 process in one resource, but 8 in another). matlab can be named "matlab" on one resource and "matlab2009" in another. Yes, people can write if - else statements to cope with these, but it rapidly becomes unmanageable. E.g., if I need to use mpirun on one resource and aprun on another, I basically have to populate if-else statements in all FireTasks that uses mpi/aprun. Not to mention that this is extremely unscalable. E.g., if I have a new resource, I have to remember to modify every single if-else statement somewhere in my potentially thousands of FireTasks.

My proposal is to encapsulate resources settings. For example, something basic would be like

class ResourceA(dict):

   def __init__(self):
      self["matlab"] = "matlab2009"
      self["vasp"] = ["mpirun", "-np", "8", "vasp"]

   def at_resource(self):
      Some test for being at the resource.

This will be supported by a decorator for FireTasks. E.g.,

@supported_resources([ResourceA])
class FireTaskA(FireTaskBase):
    .....init needs to check that it is being run at a supported resource.... It will then set some default _fw_vars which is resource based.

    within the FireTaskA, you can replace where you need a particular variable settings as
    matlab_cmd = self["_fw_vars"]["matlab"]

Advantages

If I need to ever add new resources, I simply create a new class ResourceB to specify the commands I need.
No more ugly repetitive if else statements for every single FireTask which has a resource context-specific command.

Note:: This is somewhat inspired by Fabric's env settings module. I am not saying it is the best implementation out there, and certainly the above is just a skeleton of how such a system would work. Actual implementation of course needs more strategic thinking.

Accessing stored_data for each FireWork

How do I access stored_data for when a Firework returns a FWAction and stores the data?

Thanks

subprocess stdout and stderr bytes handling in python 3.*

the subprocess module in python 3 returns a byte sequence for the stdout and stderr. The problem is that the function _parse_jobid and _parse_njobs from the common_adapter.py module, treat the output like a str because that is the standard for python 2 but for python 3 those functions fail when they receive an array of bytes for input:

 return len(output_str.split('\n'))-1
TypeError: a bytes-like object is required, not 'str'

Launches

For the same firework id, there may be many different launches, how to trace the throwaway results in the workflow through the topological graph?

Provide a `kwargs` option for `PyTask`

In some cases, we may not have the arguments in order or this may not be possible. In these cases, it would be nice to have a kwargs argument to allow a dictionary to be passed through and unpacked on the other side.

FireWorks de-serializes even PyTask input objects that are themselves JSON-serialized

Consider this situation:

a user is using the PyTask to carry out an operation, and as a part of that, transfers a python object that has been serialized in JSON format (with json or jsonpickle) as an input to that PyTask. Then (it seems, based on my own testing), upon starting the job, Fireworks will de-serialize also the input object to the PyTask, resulting in errors. Perhaps Fireworks should make an effort to not alter the format of inputs to PyTasks?

Actually, after looking at my problem more, it doesn't seem like this is correct, and I am likely doing something else wrong.

Need queue ids

I have a serious need for one particular feature.
As far as I can see, there is no way for me to figure out the parity between a particular firework and the job id on a queue. There is a really roundabout way of looking at launch_dirs and doing qstat -f to look at the directories, but it really shouldn't be this hard.

Knowing the specific queue id is very useful. For example, I just had two FW randomly die on me on the queue and I want to check what FW they correspond to so that I can restart them. The problem is that all the FWs are flagged as RUNNING, even though some of them have died.

Looking at the code, it seems that the reservation id is set only in reserve mode. Is there any reason not to do it for both reserve and non-reserve mode?

P.S. Calling the private _set_reservation_id is probably not a good idea. If that is supposed to be called outside of the Launchpad module, you'd probably want to make that a public method.

Functionality to pause workflows and better doc for states

I need a way to pause a workflows. For example, my workflows run on weeks at a time. But say I encounter a scheduled maintenance on a resource in the middle of a week. I need a way to tell the workflow to pause after the current FW is done. i.e. any new additions are automatically set to a state called PAUSED until rerun_fws is called.

I am trying to see if current fireworks already allows this. But I am a bit unclear on the meanings behind all the states. Some are obvious (e.g., COMPLETED and RUNNING) but what's the difference between WAITING and READY? A glossary of all these states should be in the doc (and the code).

Performance bug with update on large workflows in rlaunch rapidfire

Running "rlaunch rapidfire" on a single node uncovered a long delay (8-10 sec.) between each task, for a large workflow of 10,000 items (5,000 sequences of 2 items). The delay had 2 sources: (1) a hostname lookup, which wasn't being cached -- this was immediately fixed, and accounted for ~5 sec. of the delay (2) a performance problem with the update after the job was launched.

To reproduce, use the script from this gist https://gist.github.com/dangunter/9939755 as build_wf.py and run

mkdir abcd
python build_wf.py --output abcd --type sequence --tasks 10000
lpad add abcd/fw_sequence_10000.yaml
rlaunch rapidfire

and note the pause between tasks..

Possible Bug using Offline Mode

Using rlaunch singleshot --offline from a qlaunch -r singleshot yields the following error:
"Traceback (most recent call last):
File "${HOME}/.local/lib/python2.7/site-packages/fireworks/core/rocket.py", line 172, in run
lp.log_message(logging.INFO, "Task started: %s." % t.fw_name)
AttributeError: 'NoneType' object has no attribute 'log_message'"

From line 69-70 of fireworks/scripts/rlaunch_run.py :
if args.command == 'singleshot' and args.offline:
launchpad = None

This makes sense. It should be reading from the FW.json file. However, on Lines 172 and 193 of fireworks/core/rocket.py , It looks for a launchpad anyway for logging and creates an error:

lp.log_message(logging.INFO, "Task started: %s." % t.fw_name).

I was able to correct this by adding "if lp:" to the logging calls, but I don't know how to write a .patch file to submit the correction to the developers.

Absolute paths in SOURCES.txt

There are absolute paths that cause pip to fail installing.

/Users/ajain/Documents/code_matgen/fireworks/scripts/lpad
/Users/ajain/Documents/code_matgen/fireworks/scripts/mlaunch
/Users/ajain/Documents/code_matgen/fireworks/scripts/qlaunch
/Users/ajain/Documents/code_matgen/fireworks/scripts/rlaunch

if you change them to relative paths (i.e. remove the string /Users/ajain/Documents/code_matgen/ from the lines) it installs well.

Arch Linux PKGBUILDs

Hi, I created some PKGBUILDs

https://gist.github.com/rmorgans/76c8285832744ec66325

All tests are passed with these versions, some greater than the requirements.txt in the git repo.
python2 2.7.9-1
python2-yaml 3.11-2
python2-pymongo 2.8-3
python2-jinja 2.7.3-1
python2-six 1.9.0-1
python2-monty 0.6.4-1
python2-dateutil 2.4.1-1

Hope these are useful.

Command line options need to be more consistent.

In general, I find the command line option naming very inconsistent.

For example, a lot of the fireworks or workflow specific commands have _fws or _wfs. But reignite and archive doesn't. get_qid as well.

In general, I think that the list of commands are getting unwieldy (not to mention the lpad_run.py script is getting super long. I think a reorg might be in order, e.g., splitting up the admin stuff from the fireworks stuff from the workflow stuff into three separate scripts.

FW state prematurely set

When running in reservation mode, even if the q submit script fails for whatever reason (e.g. a bad qadapter), the FW is still set to RESERVED. And correcting the qadapter and redoing qlaunch does not result in the FW running.

This really should not be how it works. The state should be set to reserved only upon success of the q submission.

if name causing problems in setup.py

I can't do python setup.py develop --user for example. Without if name, it works.

Fw should track rerun history

Right now, FW does not track rerun history. If a particular FW has fizzled and that fw is rerun with lpad rerun_fws, the launch history is not stored for the past fizzled attempts. This should be tracked so that people can easily debug any errors (e.g., checking if the failures are consistent across rerun attempts).

Accessing WF details from run_task

I am facing the need of having detailed information about the whole workflow during the execution of a Task. In this case it is limited to simple tasks that will make a cleanup or storing of data from all the FWs launched during the WF, but I think that it could be generally useful to have the possibility of accessing detailed information about the workflow at the run_task level.
At the moment, it is possible to do that by reading the FW id from the FW.json file and instantiating a LaunchPad with the autoload function to call the DB. However, there is a potential flaw in this approach, since the launchpad can be passed as an option to the q/m/rlaunch commands and there is no way to know that inside the Task.
There could probably be some workaround, but I would prefer to add a cleaner way of accessing those data. I can think of two different ways of doing that and I would like to have a feedback.

one way would be to make an instance of the launchpad available, along with current FW and launch ids. This would be enough to extract all sort of information from the database, that can be accessed lazily. This will give the user the control to get exactly what he needs, but it will not be available for offline mode.
alternatively, it would be possible to get the information about the workflow when starting the FW and copy them in the fw_spec. This option could be turned on at the FW level with a keyword in the spec or with a general configuration parameter. The keyword can be used to specify the level of detail of the information that will be put in the spec (i.e. just the data of the workflow, or all the datails of the FWs, or even all the launches).

Thanks

error in "Launch a single job through a queue" tutorial

I get an error in following command:

$ qlaunch singleshot
2016-09-14 16:03:28,156 INFO moving to launch_dir /home/python/queue_tests
2016-09-14 16:03:28,162 INFO submitting queue script
2016-09-14 16:03:28,246 ERROR ----|vvv|----
2016-09-14 16:03:28,246 ERROR Could not parse job id following sbatch due to error a bytes-like object is required, not 'str'...
2016-09-14 16:03:28,249 ERROR Traceback (most recent call last):
  File "/home/python/sf_box/fireworks/fireworks/user_objects/queue_adapters/common_adapter.py", line 200, in submit_to_queue
    job_id = self._parse_jobid(p.stdout.read())
  File "/home/python/sf_box/fireworks/fireworks/user_objects/queue_adapters/common_adapter.py", line 71, in _parse_jobid
    for l in output_str.split("\n"):
TypeError: a bytes-like object is required, not 'str'

2016-09-14 16:03:28,250 ERROR ----|^^^|----
2016-09-14 16:03:28,250 ERROR ----|vvv|----
2016-09-14 16:03:28,250 ERROR Error writing/submitting queue script!
2016-09-14 16:03:28,256 ERROR Traceback (most recent call last):
  File "/home/python/sf_box/fireworks/fireworks/queue/queue_launcher.py", line 133, in launch_rocket_to_queue
    launchpad.set_reservation_id(launch_id, reservation_id)
  File "/home/python/sf_install/python/3.5.2/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/python/sf_install/python/3.5.2/lib/python3.5/site-packages/monty-0.9.1-py3.5.egg/monty/os/__init__.py", line 32, in cd
    yield
  File "/home/python/sf_box/fireworks/fireworks/queue/queue_launcher.py", line 130, in launch_rocket_to_queue
    raise RuntimeError('queue script could not be submitted, check queue '
RuntimeError: queue script could not be submitted, check queue script/queue adapter/queue server status!

2016-09-14 16:03:28,256 ERROR ----|^^^|----

But i get the correct result:

$ls
FW_job-2315017.error  FW.json           fw_test.yaml  logging          my_launchpad.yaml
FW_job-2315017.out    FW_submit.script  howdy.txt     my_fworker.yaml  my_qadapter.yaml

$ cat howdy.txt
howdy, your job launched successfully!

$ cat FW_job-2315017.out 
2016-09-14 16:03:23,198 INFO Hostname/IP lookup (this will take a few seconds)
2016-09-14 16:03:23,201 INFO Launching Rocket
2016-09-14 16:03:24,258 INFO RUNNING fw_id: 1 in directory: /home/queue_tests
2016-09-14 16:03:24,265 INFO Task started: ScriptTask.
2016-09-14 16:03:24,283 INFO Task completed: ScriptTask 
2016-09-14 16:03:24,590 INFO Rocket finished

$ cat my_qadapter.yaml 
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -w /home/python/queue_tests/my_fworker.yaml -l /home/python/queue_tests/my_launchpad.yaml singleshot
ntasks: 1
cpus_per_task: 1
ntasks_per_node: 1
walltime: '00:02:00'
queue: null
account: null
job_name: null
logdir: /home/python/queue_tests/logging
pre_rocket: null
post_rocket: null

#You can override commands by uncommenting and changing the following lines:
#_q_commands_override:
#submit_cmd: my_qsubmit
#status_cmd: my_qstatus

#You can also supply your own template by uncommenting and changing the following line:
#template_file: /full/path/to/template

Am i make mistakes?

pip install FireWorks doesn't update to the latest version

Both pip install and pip install --upgrade will only give you the version 0.198.

Install from Github source code can give you 0.66 version

ImportError: No module named flask.ext.paginate

I have gone through the quick start and had a couple of problems I thought I would pass along. The first one is solved, but the second I am still confused about.

I installed mongodb via yum on CentOS 7 without trouble, but I was unable to start the database until I realized that the default location of the database is in /data/db (at least on CentOS 7). Making the directory and giving the appropriate permissions fixed this problem and the rest of the "five minute quick start" went forward without issue. It might be a useful thing to mention the permission issue in the tutorial.
I was unable to start the webgui presumably due to some missing python libraries. When I attempt to start the webgui, I encounter the error: "ImportError: No module named flask.ext.paginate". I have installed flask via python as well as flask-monoengine, but the webgui still doesn't start. Are there are other libraries that need to be installed as well?

Improved serialization of arrays

At present, serialization of NumPy arrays proceeds through conversion to strings. In order to have some way to serialize objects that may not have another strategy, this is an ok fallback. However, it is far from optimal for NumPy arrays. A much better strategy might be to cannibalize ( https://github.com/dattalab/mongowrapper ).

lpad get_wfs needs to be a lot more efficient

lpad get_wfs is extremely slow, regardless of what kind of display mode is being used (other than ids). The main reason is that the method reconstitute the entire Workflow/Firework from the database before doing a dump to dict. For the purposes of getting info, this is very inefficient, especially for large workflows.

For example, I have a workflow that comprises > 150 Fireworks (each firework spawned a few others and this continued over many iterations). Doing a simple lpad get_wfs -d more -i 2 (single workflow only) took 14 secs, with several tens of Mb transferred (each launch is pretty big given that the launch contains new FW with specs). The result is the same with "-d less". 14 secs is an eternity to wait for a single command to check status.

I have optimized the "-d less" mode to query only for the info required, and that reduced the query time from 14 secs to <0.5 secs. I am sure the "-d more" display mode can be similarly optimized.

In general, I would suggest distinguishing the query for information from the query to reconstitute whole fireworks objects. The former can be limited to only the info needed, and the latter is rarely used (e.g., why would you want to reconstitute a whole Workflow object from the database, except in the rare case where you want to modify it?). The LaunchPad should have additional methods such as get_wf_summary(list of fw_ids, mode) optimized for info access instead of object access.

FWS breaks with recent pymongo

(as reported by several users)

pymongo API seems to have changed

AttributeError with command qlaunch

python: 2.7.11
fireworks: 1.2.9 (develop model)

for the command qlaunch -r rapidfire --nlaunches infinite -m 1 --sleep 100 -b 10000
I got :

my environment
usage: qlaunch [-h] [-rh [REMOTE_HOST [REMOTE_HOST ...]]]
               [-rc REMOTE_CONFIG_DIR [REMOTE_CONFIG_DIR ...]]
               [-ru REMOTE_USER] [-rp REMOTE_PASSWORD] [-rs] [-d DAEMON]
               [--launch_dir LAUNCH_DIR] [--logdir LOGDIR] [--loglvl LOGLVL]
               [-s] [-r] [-l LAUNCHPAD_FILE] [-w FWORKER_FILE]
               [-q QUEUEADAPTER_FILE] [-c CONFIG_DIR]
               {singleshot,rapidfire} ...
qlaunch: error: unrecognized arguments: -r rapidfire

then I removed -r rapidfire, then I typed qlaunch --nlaunches infinite -m 1 --sleep 100 -b 10000
I got :

my environment
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/kl_me2/virtenv_kl_me2/bin/qlaunch", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/scripts/qlaunch", line 6, in <module>
    qlaunch()
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/scripts/qlaunch_run.py", line 173, in qlaunch
    do_launch(args)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/scripts/qlaunch_run.py", line 52, in do_launch
    reserve=args.reserve, strm_lvl=args.loglvl, timeout=args.timeout)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/queue/queue_launcher.py", line 174, in rapidfire
    start_time = datetime.now()
AttributeError: 'module' object has no attribute 'now'

Now I found 2 solutions:
in fireworks/queue/queue_launcher.py

replace import datetime with from datetime import datetime
replace datetime.now() with datetime.datetime.now()

AttributeError: 'dict' object has no attribute 'as_dict'

FireWorks version: master
OS version: Red Hat Enterprise Linux Server release 6.3 (Santiago)

Error command

lpad get_fws lpad rerun_fws -s RESERVED and similar commands for fireworks

Error message

error message of lpad get_fws

kl_me2 environment
/lustre/home/umjzhh-1/kl_me2/codes/pymatgen/pymatgen/io/vasp/sets_deprecated.py:460: DeprecationWarning: __init__ is deprecated
All vasp input sets have been replaced by equivalents pymatgen.io.sets. Will be removed in pmg 4.0.
  return DictVaspInputSet(name, loadfn(filename), **kwargs)
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/kl_me2/virtenv_kl_me2/bin/lpad", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/scripts/lpad", line 6, in <module>
    lpad()
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/scripts/lpad_run.py", line 863, in lpad
    args.func(args)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/scripts/lpad_run.py", line 208, in get_fws
    fw = lp.get_fw_by_id(id)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/launchpad.py", line 316, in get_fw_by_id
    return Firework.from_dict(self.get_fw_dict_by_id(fw_id))
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 148, in _decorator
    m_dict = func(self, *new_args, **kwargs)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/firework.py", line 325, in from_dict
    state, created_on, fw_id, updated_on=updated_on)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/firework.py", line 218, in __init__
    tasks]  # put tasks in a special location of the spec
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 161, in _decorator
    m_dict = func(self, *args, **kwargs)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 133, in _decorator
    m_dict = recursive_dict(m_dict)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 76, in recursive_dict
    return {recursive_dict(k, preserve_unicode): recursive_dict(v, preserve_unicode) for k, v in obj.items()}
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 76, in <dictcomp>
    return {recursive_dict(k, preserve_unicode): recursive_dict(v, preserve_unicode) for k, v in obj.items()}
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 79, in recursive_dict
    return [recursive_dict(v, preserve_unicode) for v in obj]
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 70, in recursive_dict
    return recursive_dict(obj.as_dict(), preserve_unicode)
  File "/lustre/home/umjzhh-1/kl_me2/codes/custodian/custodian/vasp/jobs.py", line 347, in as_dict
    default_vasp_input_set=self.default_vis.as_dict(),
AttributeError: 'dict' object has no attribute 'as_dict'

error message of lpad rerun_fws -s RESERVED

l_me2 environment
Are you sure? This will modify 40 entries. (Y/N)Y
/lustre/home/umjzhh-1/kl_me2/codes/pymatgen/pymatgen/io/vasp/sets_deprecated.py:460: DeprecationWarning: __init__ is deprecated
All vasp input sets have been replaced by equivalents pymatgen.io.sets. Will be removed in pmg 4.0.
  return DictVaspInputSet(name, loadfn(filename), **kwargs)
2016-06-03 20:31:30,242 DEBUG Processed fw_id: ...
...
2016-06-03 20:31:31,194 INFO Also rerunning duplicate fw_id: 11131
2016-06-03 20:31:31,233 DEBUG Processed fw_id: 3940
2016-06-03 20:31:31,276 INFO Also rerunning duplicate fw_id: 11110
2016-06-03 20:31:31,323 DEBUG Processed fw_id: 3947
2016-06-03 20:31:31,375 INFO Also rerunning duplicate fw_id: 11103
2016-06-03 20:31:31,416 DEBUG Processed fw_id: 3968
2016-06-03 20:31:30,242 DEBUG Processed fw_id: ...
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/kl_me2/virtenv_kl_me2/bin/lpad", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/scripts/lpad", line 6, in <module>
    lpad()
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/scripts/lpad_run.py", line 863, in lpad
    args.func(args)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/scripts/lpad_run.py", line 387, in rerun_fws
    lp.rerun_fw(int(f))
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/launchpad.py", line 966, in rerun_fw
    updated_ids = wf.rerun_fw(fw_id)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/firework.py", line 836, in rerun_fw
    self.rerun_fw(child_id, updated_ids))
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/firework.py", line 830, in rerun_fw
    m_fw._rerun()
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/launchpad.py", line 1219, in _rerun
    self.full_fw._rerun()
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/launchpad.py", line 1309, in full_fw
    self._get_launch_data(launch_field)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/launchpad.py", line 1320, in _get_launch_data
    fw = self.partial_fw  # assure stage 1
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/launchpad.py", line 1302, in partial_fw
    self._fw = Firework.from_dict(data)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 148, in _decorator
    m_dict = func(self, *new_args, **kwargs)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/firework.py", line 325, in from_dict
    state, created_on, fw_id, updated_on=updated_on)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/firework.py", line 218, in __init__
    tasks]  # put tasks in a special location of the spec
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 161, in _decorator
    m_dict = func(self, *args, **kwargs)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 133, in _decorator
    m_dict = recursive_dict(m_dict)
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 76, in recursive_dict
    return {recursive_dict(k, preserve_unicode): recursive_dict(v, preserve_unicode) for k, v in obj.items()}
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 76, in <dictcomp>
    return {recursive_dict(k, preserve_unicode): recursive_dict(v, preserve_unicode) for k, v in obj.items()}
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 79, in recursive_dict
    return [recursive_dict(v, preserve_unicode) for v in obj]
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/utilities/fw_serializers.py", line 70, in recursive_dict
    return recursive_dict(obj.as_dict(), preserve_unicode)
  File "/lustre/home/umjzhh-1/kl_me2/codes/custodian/custodian/vasp/jobs.py", line 347, in as_dict
    default_vasp_input_set=self.default_vis.as_dict(),
AttributeError: 'dict' object has no attribute 'as_dict'

.pyc files with wrong "magic number"

There are many pyc files on the delivery (tar.gz) created in Python 3.3 which cause errors running on other versions. If we delete all the pycs, then it recreates and works well.

webgui fails to start in Django 1.7

Webgui fails to start with the above error:

C:\Users\kpoman>python c:\Python34\Lib\site-packages\fireworks\scripts\lpad_run.py webgui
Process Process-1:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\django\utils\translation\trans_real.py", line 186, in _fetch
app_configs = reversed(list(apps.get_app_configs()))
File "C:\Python34\lib\site-packages\django\apps\registry.py", line 137, in get_app_configs
self.check_apps_ready()
File "C:\Python34\lib\site-packages\django\apps\registry.py", line 124, in check_apps_ready
raise AppRegistryNotReady("Apps aren't loaded yet.")
django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Python34\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Python34\lib\multiprocessing\process.py", line 93, in run
self._target(_self._args, *self.kwargs)
File "C:\Python34\lib\site-packages\django\core\management__init.py", line 115, in call_command
return klass.execute(args, *defaults)
File "C:\Python34\lib\site-packages\django\core\management\base.py", line 331, in execute
translation.activate('en-us')
File "C:\Python34\lib\site-packages\django\utils\translation__init.py", line 145, in activate
return _trans.activate(language)
File "C:\Python34\lib\site-packages\django\utils\translation\trans_real.py", line 225, in activate
_active.value = translation(language)
File "C:\Python34\lib\site-packages\django\utils\translation\trans_real.py", line 209, in translation
default_translation = _fetch(settings.LANGUAGE_CODE)
File "C:\Python34\lib\site-packages\django\utils\translation\trans_real.py", line 189, in _fetch
"The translation infrastructure cannot be initialized before the "
django.core.exceptions.AppRegistryNotReady: The translation infrastructure cannot be initialized before the apps registry is ready. Check that you don't make no
n-lazy gettext calls at import time.

C:\Users\kpoman>

This apparently is caused by some incompatibility on the way django starts, introduced in django >=1.7

Understanding how to launch jobs to a queue from Python

Up to this point, I have been able to nicely interact with code in iPython by constructing the tasks and establishing their dependencies. Finally, I launch them and can inspect the results.

Now, that I am moving to using a Sun Grid Engine queue I am little confused as to how I can do this. The documentation discuss how to do this with YAML files. However, I like the interactivity afford me by iPython and would like to continue writing code in Python.

I see that there is a rapidfire function for the queue, but I am unclear on how to get this to work. In particular, how can I set up the CommonAdapter so that it will launch this job in Python without going through a yaml file directly.

Data transfer between tasks

I want a way to transfer data between FireTasks. Right now, there is a way to do so between Fireworks. But even with mod_spec or add_spec, these changes do not get updated in the spec until all tasks are done.

Task level recovery

I really need task-level recovery for Fireworks. For example, let's say I have a Firework that runs:

Task 1: Setup input
Task 2: Run VASP [Really expensive]
Task 3: Transfer Calc (takes < 1min)
Task 4: Setup new firework (takes seconds)

Let's say for some reason, Transfer Calc fails (e.g., because there was a temporary IO outage). When I rerun the firework, I think the Firework should automatically know that Task 1 and Task 2 have completed successfully and only proceed to do Task 3.

Note that of course I know I can separate every single task into a Firework, but it is just too dumb to wait in the queue just to do a transfer that takes < 1min.

make style dependencies

How would you go about extending the workflow spec to allow make style dependencies? I would be nice to have a general way for the job to be executed only if the inputs have a date that is later than the output files.

This could also be extended to the way you define links in the specs section.

One way would be to write a task specific Firetask but have a general method in the spec would be more useful.

Is this easy to do?

q submit error messages should be more informative

Right now, if a q submission fails, I only get the following:

RuntimeError: queue script could not be submitted, check queue adapter and queue server status!

which is rather unhelpful and impossible to debug. At the minimum, fireworks should spit out the actual error message from the q submission command.

Trackers don't do final update when script throws an error?

PyTask won't return FWActions

Looking at the documentation for PyTask (https://pythonhosted.org/FireWorks/pytask.html), it looks like it's intended that it should be able to return a FWAction for message-passing etc, but I've been unable to make this work. Looking at script_task.py it looks like it will never return the output.

Inserting something like
elif isinstance(output,FWAction):
return output
at the end of the PyTask definition makes it function as I understand it should.

Am I missing something in how I should be using this?

Benchmark for loading workflows

Add a benchmark for loading large and/or multiple workflows

For `lpad webgui`, try another port if the first one is in use

Port 5000 is a common one to use. There should be a more graceful fallback than just crashing (e.g. bump the port number and try again).

 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Process Process-1:
Traceback (most recent call last):
  File "/zopt/conda/envs/nanshenv/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/zopt/conda/envs/nanshenv/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/zopt/conda/envs/nanshenv/lib/python2.7/site-packages/flask/app.py", line 772, in run
    run_simple(host, port, self, **options)
  File "/zopt/conda/envs/nanshenv/lib/python2.7/site-packages/werkzeug/serving.py", line 625, in run_simple
    inner()
  File "/zopt/conda/envs/nanshenv/lib/python2.7/site-packages/werkzeug/serving.py", line 603, in inner
    passthrough_errors, ssl_context).serve_forever()
  File "/zopt/conda/envs/nanshenv/lib/python2.7/site-packages/werkzeug/serving.py", line 512, in make_server
    passthrough_errors, ssl_context)
  File "/zopt/conda/envs/nanshenv/lib/python2.7/site-packages/werkzeug/serving.py", line 440, in __init__
    HTTPServer.__init__(self, (host, int(port)), handler)
  File "/zopt/conda/envs/nanshenv/lib/python2.7/SocketServer.py", line 420, in __init__
    self.server_bind()
  File "/zopt/conda/envs/nanshenv/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/zopt/conda/envs/nanshenv/lib/python2.7/SocketServer.py", line 434, in server_bind
    self.socket.bind(self.server_address)
  File "/zopt/conda/envs/nanshenv/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 48] Address already in use

Monitoring launches

Instead of a fixed tracker, I would like a base class to do monitoring of launches (ala Custodian handler's monitor function). There is no limitation to what a monitor can do, and tracking the last few lines is just a special case.

setting memory for PBS jobs

Hi,
I just noticed that the PBS script template doesnt have the variable defined for setting the required process memory. The line for setting the process memory in PBS looks something like this:

PBS -l pmem=1000mb

It would be great if you guys could add 'pmem' variable to the PBS template

regards

Kiran Mathew

The 'QUEUEADAPTER_LOC' is 'None' in Fireworks.

I works with Fireworks 1.2.7 with MPenv 0.1(sjtu branch) and Python 2.7.11.
when I type the following codes in python:

from fireworks.fw_config import QUEUEADAPTER_LOC, CONFIG_FILE_DIR, FWORKER_LOC, LAUNCHPAD_LOC
print QUEUEADAPTER_LOC,'\n', CONFIG_FILE_DIR,'\n', FWORKER_LOC,'\n', LAUNCHPAD_LOC

I saw these:

None
$HOME/<ENV_NAME>/config/config_SjtuPi
$HOME/<ENV_NAME>/config/config_SjtuPi/my_fworker.yaml
$HOME/<ENV_NAME>/config/config_SjtuPi/my_launchpad.yaml

I found the relevant codes in fireworks/fireworks/scripts/qlaunch_run.py
line 35-38 is to source the queueadapter_file as my_qadapater.yaml
but the line 106 set the default QUEUEADAPTER_LOC to my_queueadapater.yaml(which doesn't exist)

what i mean is in the line 143-159 of fireworks/fireworks/fw_config.py

I don't know whether the my_qadapater.yaml will work for command qlaunch? Now, I just make a soft link my_queueadapter.yaml -> my_qadapter.yaml, this will help the 'QUEUEADAPTER_LOC' print normally.

Running `rapidfire` in Python results in multiply nested directories

It isn't obvious to me from the spec that this is intended. I would expect that all directories would be at the same level as if one had run launch_rocket multiple times. Instead, it appears that the directories become deeply nested. This seems to be caused by multiple calls to os.chdir. If this is indeed intended and expected behavior, I would expect at the minimum that we would change back to the original directory after a run.

Serious bug in Workflow links validation

There is a serious bug in the links validation. Or to be more exact, there isn't links validation as far as I can see.

For example, starting from a blank launchpad, the following code will generate a workflow that can be inserted without error, but is actually invalid and get_wfs raises an error.

from fireworks import FireWork, Workflow
from fireworks.user_objects.firetasks.script_task import ScriptTask

fws = []
for i in xrange(5):
    fw = FireWork([ScriptTask(script="echo %d" % i)], fw_id=i)
    fws.append(fw)
wf = Workflow(fws, links_dict={0: [1, 2, 3], 1: [4], 2: [100]})
wf.to_file("testwf.yaml")

Even worse, if you replace the final 2:[100] with 2:[5], there is no error in workflow creation and addition and get_wfs shows that 2 is now linked to fw_id 5. The correct mapping should result ins 2: 6 and 6 is non-existent.

Error Loading Workflow from File

Hi,

First of, great library!

There is only one thing I would like to do which I cannot get to work and that is load a workflow from a file and execute it as part of a python script.

For example I would like to do something like:

from fireworks import Firework, LaunchPad, ScriptTask
from fireworks.core.rocket_launcher import launch_rocket


# set up the LaunchPad and reset it
launchpad = LaunchPad()
launchpad.reset('', require_password=False)

fw = Firework.from_file("/home/chris/dev/fireworks/fw_tutorials/org_wf.yaml")

# store workflow and launch it locally
launchpad.add_wf(fw)
launch_rocket(launchpad)

However I get the following error:
Traceback (most recent call last):
File "WorkflowEngine.py", line 12, in
fw = Firework.from_file("/levaux/tests/fireworkstest/org_wf.yaml")
File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 253, in from_file
return cls.from_format(f.read(), f_format=f_format)
File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 228, in from_format
yaml.load(f_str, Loader=Loader)))
File "/usr/local/lib/python2.7/dist-packages/fireworks/utilities/fw_serializers.py", line 145, in _decorator
m_dict = func(self, _new_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/fireworks/core/firework.py", line 311, in from_dict
tasks = m_dict['spec']['_tasks']
KeyError: u'spec'

I can however load and execute simpler single fireworks like: fw_test.yaml

Thanks in advance.

Cheers
Chris

spec is not copied when creating new FireWork

The following code would not work:

from fireworks import FireWork, Workflow, FWorker, LaunchPad
from fireworks.core.rocket_launcher import rapidfire
from fireworks.user_objects.firetasks.script_task import ScriptTask

# define four individual FireWorks used in the Workflow
task1 = ScriptTask.from_str('echo "Task 1"')
task2 = ScriptTask.from_str('echo "Task 2"')

spec = {'_category': 'cluster1'}

fw1 = FireWork(task1, fw_id=1, name='Task 1', spec=spec)
fw2 = FireWork(task2, fw_id=2, name='Task 2', spec=spec)

# assemble Workflow from FireWorks and their connections by id
workflow = Workflow([fw1, fw2])

# store workflow and launch it locally
launchpad = LaunchPad.auto_load()
launchpad.add_wf(workflow)

lpad get_fws -i 1 -d all would show:

[
    {
        "fw_id": 1,
        "state": "READY",
        "name": "Task 1",
        "created_on": "2014-03-21T17:35:57.614476",
        "spec": {
            "_tasks": [
                {
                    "use_shell": true,
                    "_fw_name": "ScriptTask",
                    "script": [
                        "echo \"Task 2\""
                    ]
                }
            ],
            "_category": "cluster1"
        }
    },
    {
        "fw_id": 2,
        "state": "READY",
        "name": "Task 2",
        "created_on": "2014-03-21T17:35:57.614507",
        "spec": {
            "_tasks": [
                {
                    "use_shell": true,
                    "_fw_name": "ScriptTask",
                    "script": [
                        "echo \"Task 2\""
                    ]
                }
            ],
            "_category": "cluster1"
        }
    }
]

Note how the tasks in FW 1 got changed to the tasks in FW 2. This is because the spec of FW 1 is changed in FW 2 initialization. I suggest to copy spec in FireWork.__init__()

qlaunch (non reservation mode) can create too many empty directories

As mentioned by Shyue, qlaunch rapidfire will create many empty directories that can greatly exceed the number of fireworks in the database. If fireworks are not added to the DB by the time the jobs start running in the queue, you will get empty directories all over your file system.