Coder Social home page Coder Social logo

Missing tmp files about sos HOT 11 CLOSED

pgcudahy avatar pgcudahy commented on August 30, 2024
Missing tmp files

from sos.

Comments (11)

pgcudahy avatar pgcudahy commented on August 30, 2024

I changed my hosts.yml to

hosts:
  r209u16n01:
    address: [email protected]
    paths:
        home: /home/pgc29
        project: /gpfs/gibbs/project/cudahy/pgc29
    sos: /home/pgc29/project/conda_envs/sos/bin/sos
  mccleary_scavenge:
    description: McCleary day / scavenge queue
    address: [email protected]
    paths:
        home: /home/pgc29
        project: /gpfs/gibbs/project/cudahy/pgc29
    sos: /home/pgc29/project/conda_envs/sos/bin/sos
...

But sos remote test still complains that

DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias:       r209u16n01
Address:     [email protected]
Queue Type:  process
ssh:         OK
scp:         OK
sos:         OK
paths:       No path_map between local and remote host.
shared:      shared directory / not in path_map

And real jobs still fail in the same way. My notebook is running in a directory within /gpfs/gibbs/project/cudahy/pgc29 but I think it's somehow failing to write the .tmp_script.sos file?

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

The first thing to check is whether the head node and all computing nodes have access to /gpfs/gibbs/project/cudahy/pgc29.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

@pgcudahy Just to clarify, did you

  1. submit a single-node job to a working node, then let the working node submit more jobs using the task mechanism, or
  2. submit a multi-node job and let sos distribute jobs to all nodes?

The trouble with option 1 is that working nodes need ssh access to headnode, which is not always feasible.

from sos.

pgcudahy avatar pgcudahy commented on August 30, 2024

Thanks for your help with this. I checked and the directory is available to all nodes. I'm submitting a single node job with

%sosrun test -v4 \
-c /home/pgc29/.sos/hosts.yml \
-q mccleary_scavenge \
-r mccleary_scavenge \
mem="4GB" cores=1 walltime="24:00:00" nodes=1

[test]
output: f'/home/pgc29/test.out'
task: walltime='00:02:00', mem='1G', cores=1, nodes=1, 
    workdir='/home/pgc29'

run: expand=True
    touch {_output}

Fails with

$ cat .sos/workflows/wc47a7467aab75a37.err 
DEBUG: Failed to report to monitor process: cannot access local variable 'm' where it is not associated with a value
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
    script = SoS_Script(filename=args.script)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last):
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
    script = SoS_Script(filename=args.script)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/parser.py", line 862, in __init__
    content, self.sos_script = locate_script(filename, start=".")
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/utils.py", line 917, in locate_script
    raise ValueError(f"Failed to locate {filename}")
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 2847, in main
    args.func(args, workflow_args)
Traceback (most recent call last):
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 582, in cmd_run
    script = SoS_Script(filename=args.script)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/parser.py", line 862, in __init__
    content, self.sos_script = locate_script(filename, start=".")
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/utils.py", line 917, in locate_script
    raise ValueError(f"Failed to locate {filename}")
ValueError: Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 2847, in main
    args.func(args, workflow_args)
  File "/gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/lib/python3.11/site-packages/sos/__main__.py", line 725, in cmd_run
    env.logger.error(str(e))
                         ^
UnboundLocalError: cannot access local variable 'e' where it is not associated with a value
ERROR: cannot access local variable 'e' where it is not associated with a value

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Let me try to reproduce this, but as i said above, there are two ways to run a multi-node job on the cluster,

  1. (pseudo command) qsub cores=1 sos run -q cluster. This requires the head node to be accessible from computing nodes, which is not the case on my cluster.
  2. qsub cores=10 sos run -q none -j 10. This will allocate 10 nodes and let sos distribute jobs directly to computing nodes. This is supposed to be faster for a large number of smaller tasks since there is no overhead of creating and monitoring a large number of tasks.

Let me see if I can make your example work with both options.

from sos.

pgcudahy avatar pgcudahy commented on August 30, 2024

One thing I realized was wrong was that I should be using shared: rather than path: in my hosts.yml so I changed the localhost to

hosts:
  r209u16n01:
    address: pgc29@localhost
    shared:
      home: /vast/palmer/home.mccleary/pgc29/
      project: /gpfs/gibbs/project/cudahy/pgc29/
      scratch60: /vast/palmer/scratch/cudahy/pgc29/
    sos: /gpfs/gibbs/project/cudahy/pgc29/conda_envs/sos/bin/sos

But when I run !sos remote -v4 -c /home/pgc29/.sos/hosts.yml test r209u16n01 I still get

DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
DEBUG: Path /gpfs/gibbs/project/cudahy/pgc29 is not under any specified paths of localhost and is mapped to /gpfs/gibbs/project/cudahy/pgc29 on remote host.
Alias:       r209u02n01
Address:     pgc29@localhost
Queue Type:  process
ssh:         OK
scp:         OK
sos:         OK
paths:       No path_map between local and remote host.
shared:      shared directory / not in path_map

I did notice that this is run from a notebook within /gpfs/gibbs/project/cudahy/pgc29, and when I move the notebook to /vast/palmer/home.mccleary/pgc29/ !sos remote -v4 -c /home/pgc29/.sos/hosts.yml test r209u16n01 complains with

DEBUG: Path /vast/palmer/home.mccleary/pgc29 is not under any specified paths of localhost and is mapped to /vast/palmer/home.mccleary/pgc29 on remote host.
DEBUG: Path /vast/palmer/home.mccleary/pgc29 is not under any specified paths of localhost and is mapped to /vast/palmer/home.mccleary/pgc29 on remote host.
Alias:       r209u02n01
Address:     pgc29@localhost
Queue Type:  process
ssh:         OK
scp:         OK
sos:         OK
paths:       No path_map between local and remote host.
shared:      shared directory / not in path_map

Even with these changes I still get the same Failed to locate /gpfs/gibbs/project/cudahy/pgc29/helen_mixed_infection/notebooks/.tmp_script_qfbs3foy.sos error when submitting real jobs

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

@pgcudahy Could you try to execute the same workflow from the jupyter command line? Basically, could you please create a test.sos file with the [test] workflow you have, and create a terminal from Juypyter lab, and execute the workflow with sos run -r -q ... test.sos? Right now I suspect that sos notebook removes the temporary script before the remote host read and execute it.

from sos.

pgcudahy avatar pgcudahy commented on August 30, 2024

Yup, submitting that way works just fine

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

OK, then this is a problem with sos-notebook. I am patching sos-notebook now.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

@pgcudahy Please let me know if the problem has been addressed with sos notebook 0.24.1.

from sos.

pgcudahy avatar pgcudahy commented on August 30, 2024

Works well! Thanks so much for helping me with these edge cases.

from sos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.