Coder Social home page Coder Social logo

julianhess / canine Goto Github PK

View Code? Open in Web Editor NEW

This project forked from getzlab/canine

0.0 0.0 0.0 787 KB

A high-performance computing solution to run jobs using SLURM

Home Page: https://broadinstitute.github.io/canine/

License: BSD 3-Clause "New" or "Revised" License

Python 99.72% Dockerfile 0.24% Shell 0.04%

canine's People

Contributors

agraubert avatar julianhess avatar marlin-na avatar

Watchers

 avatar

canine's Issues

Directory finalization crashes

Cause of this bug is unknown.

Finalizing directory structure. This may take a while...
---------------------------------------------------------------------------
FileExistsError                           Traceback (most recent call last)
<ipython-input-84-43eff7be28bf> in <module>
----> 1 R = orch.run_pipeline(output_dir = "ds_results2")

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/orchestrator.py in run_pipeline(self, output_dir, dry_run)
    173                     job_spec,
    174                     self.raw_outputs,
--> 175                     self.localizer_overrides
    176                 )
    177                 print("Job staged on SLURM controller in:", abs_staging_dir)

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/localization/local.py in localize(self, inputs, patterns, overrides)
    121                 transport
    122             )
--> 123             staging_dir = self.finalize_staging_dir(inputs.keys(), transport=transport)
    124             for src, dest, context in self.queued_gs:
    125                 self.gs_copy(src, dest, context)

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/localization/base.py in finalize_staging_dir(self, jobs, transport)
    411                         transport.makedirs(os.path.join(controller_env['CANINE_JOBS'], jobId, 'workspace'))
    412                     if not transport.isdir(os.path.join(controller_env['CANINE_JOBS'], jobId, 'inputs')):
--> 413                         transport.makedirs(os.path.join(controller_env['CANINE_JOBS'], jobId, 'inputs'))
    414                 if not transport.isdir(controller_env['CANINE_OUTPUT']):
    415                     transport.mkdir(controller_env['CANINE_OUTPUT'])

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/backends/base.py in makedirs(self, path)
    185         if not (dirname == '' or os.path.exists(dirname)):
    186             self.makedirs(dirname)
--> 187         self.mkdir(path)
    188
    189     def walk(self, path: str) -> typing.Generator[typing.Tuple[str, typing.List[str], typing.List[str]], None, None]:

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/backends/local.py in mkdir(self, path)
     43         Creates the requested directory
     44         """
---> 45         return os.mkdir(path)
     46
     47     def stat(self, path: str) -> typing.Any:

FileExistsError: [Errno 17] File exists: '94b98133-0747-46dc-8b8e-d0b485aa5390/jobs/21381/inputs'

If API is still busy, we should block

Otherwise, this happens:

In [21]: R = orch.run_pipeline(output_dir = "ds_results")
Preparing pipeline of 192 jobs
Connecting to backend...
Checking for running Slurm controller ... done
Checking for preexisting cluster nodes ... ERROR: Could not initialize cluster; attempting to tear down.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-0752d45456c3> in <module>
----> 1 R = orch.run_pipeline(output_dir = "ds_results")

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/orchestrator.py in run_pipeline(self, output_dir, dry_run)
    166         if isinstance(self.backend, RemoteSlurmBackend):
    167             self.backend.load_config_args()
--> 168         with self.backend:
    169             print("Initializing pipeline workspace")
    170             with self._localizer_type(self.backend, **self.localizer_args) as localizer:

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/backends/imageTransient.py in __enter__(self)
    246
    247             self.stop()
--> 248             raise e
    249
    250     def __exit__(self, *args):

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/backends/imageTransient.py in __enter__(self)
    144
    145             # check which worker nodes exist outside of Canine
--> 146             instances = self.list_instances_all_zones()
    147             k9_inst_idx = instances["tags"].apply(lambda x : "caninetransientimage" in x)
    148

/mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/backends/imageTransient.py in list_instances_all_zones(self)
    279         return pd.concat([
    280           list_instances(zone = x["name"], project = self.config["project"])
--> 281           for x in zone_dict["items"]
    282         ], axis = 0).reset_index(drop = True)
    283

KeyError: 'items'

In [22]: %debug
> /mnt/j/local/miniconda3/lib/python3.7/site-packages/canine/backends/imageTransient.py(281)list_instances_all_zones()
    279         return pd.concat([
    280           list_instances(zone = x["name"], project = self.config["project"])
--> 281           for x in zone_dict["items"]
    282         ], axis = 0).reset_index(drop = True)
    283

ipdb> zone_dict
{'id': '2552540594253164385', 'name': 'operation-1570034061883-593f00a8b373b-dcab8340-078f8a17', 'zone': 'https://www.googleapis.com/compute/v1/projects/broad-cga-jhess-pcawg/zones/us-east1-d', 'operationType': 'stop', 'targetLink': 'https://www.googleapis.com/compute/v1/projects/broad-cga-jhess-pcawg/zones/us-east1-
d/instances/gce-worker97', 'targetId': '2544207614669582807', 'status': 'RUNNING', 'user': 'jhess@broadinstitute.org', 'progress': 0, 'insertTime': '2019-10-02T09:34:22.270-07:00', 'startTime': '2019-10-02T09:34:22.314-07:00', 'selfLink': 'https://www.googleapis.com/compute/v1/projects/broad-cga-jhess-pcawg/zones/us-
east1-d/operations/operation-1570034061883-593f00a8b373b-dcab8340-078f8a17', 'kind': 'compute#operation'}```

Use API to launch nodes

As referenced in b3ec9c5, using gcloud to launch nodes, although it makes for much cleaner code, will return nonzero even if a single node fails to launch, e.g. if the user is hitting project guardrails like core limits. This should not stall launching of the whole cluster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.