cooperative-computing-lab / cctools Goto Github PK

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.

Home Page: http://ccl.cse.nd.edu

License: Other

Makefile 0.72% C 60.37% Shell 4.86% Python 21.24% Perl 2.64% C++ 9.86% Tcl 0.02% Go 0.02% Jupyter Notebook 0.13% SWIG 0.10% Gnuplot 0.04% Dockerfile 0.01%

cctools's Introduction

The Cooperative Computing Tools

About

The Cooperative Computing Tools (cctools) is a software package for enabling large scale distributed computing on clusters, clouds, and grids. It is used primarily for attacking large scale problems in science and engineering.

You can read more about this software at ReadTheDocs It is developed by members of the Cooperative Computing Lab at the University of Notre Dame, led by Prof. Douglas Thain. The file CREDITS lists the many people that have contributed to the software over the years.

Quick Install Via Miniconda

The easiest way to install the binaries is via Miniconda

conda install -y -c conda-forge ndcctools

Build From Source

To build from source and install in your home directory:

git clone git://github.com/cooperative-computing-lab/cctools.git cctools-src
cd cctools-src
unset PYTHONPATH
conda env create -y -f environment.yml
./configure --with-base-dir $CONDA_PREFIX --prefix $CONDA_PREFIX
make
make install

Then run the executables out of your home directory like this:

export PATH=$HOME/cctools-src/bin:$PATH
makeflow -v
vine_status

Copyright and License Notices

------------------------------------------------------------
This software package is
Copyright (c) 2003-2004 Douglas Thain
Copyright (c) 2005-2022 The University of Notre Dame
This software is distributed under the GNU General Public License.
See the file COPYING for details.
------------------------------------------------------------
This product includes software developed by and/or derived
from the Globus Project (http://www.globus.org/)
to which the U.S. Government retains certain rights.
------------------------------------------------------------
This product includes code derived from the RSA Data
Security, Inc. MD5 Message-Digest Algorithm.
------------------------------------------------------------
This product includes public domain code for the
SHA1 algorithm written by Peter Gutmann, David Ireland,
and A. M. Kutchman.
------------------------------------------------------------
This product includes the source code for the MT19937-64
Mersenne Twister pseudorandom number generator, written by 
Makoto Matsumoto and Takuji Nishimura.

cctools's People

Contributors

Stargazers

Watchers

Forkers

batrick btovar malbrec2 dpandiar kmulholland brenden dthain lyu2 inthecloud247 devonteapplewhite nbest937 mih haiyanmeng dcbradley nfillmore nhazekam andreyto cailurus hobbit19 leeping badmutex iheanyi reneme lyonslab trel irods charleszheng44 rboccabe vhawley cvmfs mcast bbockelm nkremerh isanwong macainian matz-e leochencipher tshaffe1 neilb879 xavierdingdev khurtado pbui pivie msultan km4rcus nekel-seyew piercecunneen alisw dberzano annawoodard kevinwern venkatarajasekhar etsangsplk mattlk13 daidong lincolnbryant t172 cguccione crysenia andrewlitteken cloudxtreme rhencke chrisburr conradbailey npotteig rlmv poquirion liyongming1982 lpwgroup tjdasso zsurma fschicke rykermcintyre zhxu73 tkameyama applejax124 ghani-p gregthain zhuozhaoli tjuedema wmccomis tphung3 pickkaa david-simonetti-nd benjaminlyons cpreciad jmrundle bwiseman77 nasseralbalawi nicklocascio barryslydelgado sghuang19 colinthomas-z80 ikhomyakov josephduggan7 mcarbona1 cooperative-computing-lab andrewhennessee jdolak constellr

cctools's Issues

Work Queue Input From URL

Add a capability to Work Queue that allows the caller to specify a URL as the source of an input file. The API should be roughly work_queue_task_specify_url(task,url,filename,flags). This should result in a new message type of "url" from the master to the worker, which causes the worker to invoke wget, curl, or something similar to retrieve the file.

Send peak info to catalog

--- dthain submitted bug at Wed, 13 Mar 2013 15:52:33 -0400 ---

The work queue master should keep track of the past peak values of workers
connected, tasks running, etc. and communicate those to the catalog.

This will preserve useful information even if some updates are lost, the
catalog information is stale etc. Also useful for the WQ visualization that
kwern is working on.

Work Queue Status Resources

Add to work_queue_status a -R option that shows the resources (cpu, ram, disk) etc aggregated by each master and foreman.

Since this only reported by the latest code, you may need to start some masters/foreman/workers by hand and look at work_queue_status -l to see what is reported.

Verify that the proper information is being reported, and think about what is most useful to report and how to present it.

parrot - cvmfs - mathematica

--- btovar commented at Mon, 25 Mar 2013 08:13:49 -0400 ---
I'm having trouble running the benchmark even without CVMFS. With parrot_run
(both 3.7.1 and master), parrot gets stuck with a series of:

2013/03/25 08:12:43.52 [12992] parrot_run: poll: waking pid 13267 because time
expired
2013/03/25 08:12:43.52 [12992] parrot_run: process: pid 13267 woken from wait
state
2013/03/25 08:12:43.52 [13267] parrot_run: syscall: poll
2013/03/25 08:12:43.52 [13267] parrot_run: libcall: poll 0x12eb0b30 1 3000
2013/03/25 08:12:43.52 [13267] parrot_run: poll: fd 4 want r--
2013/03/25 08:12:43.52 [13267] parrot_run: poll: fd 4 ready --- rpipe
2013/03/25 08:12:43.52 [13267] parrot_run: poll: select time expired
2013/03/25 08:12:43.52 [13267] parrot_run: libcall: = 0
2013/03/25 08:12:43.52 [13267] parrot_run: syscall: = 0
2013/03/25 08:12:43.52 [13267] parrot_run: syscall: poll
2013/03/25 08:12:43.52 [13267] parrot_run: libcall: poll 0x12eb0b30 1 3000
2013/03/25 08:12:43.52 [13267] parrot_run: poll: fd 4 want r--
2013/03/25 08:12:43.52 [13267] parrot_run: poll: fd 4 ready --- rpipe
2013/03/25 08:12:43.52 [13267] parrot_run: poll: select time remaining
3.000000
2013/03/25 08:12:43.52 [13267] parrot_run: poll: wake in time 3.000000
2013/03/25 08:12:43.52 [13267] parrot_run: poll: wake on fd 31 flags r--
2013/03/25 08:12:43.52 [13267] parrot_run: libcall: = -1 Resource temporarily
unavailable
2013/03/25 08:12:43.54 [13277] parrot_run: syscall: =
2013/03/25 08:12:43.54 [13277] parrot_run: syscall: futex
2013/03/25 08:12:43.54 [13277] parrot_run: syscall: =
2013/03/25 08:12:43.54 [13277] parrot_run: syscall: clock_gettime

2013/03/25 08:10:28.67 [12948] parrot: syscall: futex
2013/03/25 08:10:28.87 [12948] parrot: syscall: =
2013/03/25 08:10:28.87 [12948] parrot: syscall: futex
2013/03/25 08:10:28.87 [12948] parrot: syscall: =
2013/03/25 08:10:28.87 [12948] parrot: syscall: clock_gettime
2013/03/25 08:10:28.87 [12948] parrot: syscall: =
2013/03/25 08:10:28.87 [12948] parrot: syscall: futex
2013/03/25 08:10:29.07 [12948] parrot: syscall: =
2013/03/25 08:10:29.07 [12948] parrot: syscall: futex
2013/03/25 08:10:29.07 [12948] parrot: syscall: =
2013/03/25 08:10:29.07 [12948] parrot: syscall: clock_gettime
...

--- btovar took bug at Fri, 22 Mar 2013 07:26:28 -0400 ---

--- btovar submitted bug at Fri, 22 Mar 2013 07:26:18 -0400 ---
As reported by Suchandra Thapa:

I'm running into an problem with running Mathematica using Parrot and CVMFS.
With the latest version (3.7.1) of Parrot, while trying to run the built-in
Mathematica benchmark, I'm seeing Mathematica hang. When I try the same thing
using Mathematica using the same repository after the CVMFS fuse module, the
benchmarking works.

update monitor visualizer to recent changes

In particular:

All memory units are now in megabytes.
Swap memory and total number of processes are now also reported. Swap memory in both summary and time series; total number of processes only in the summary.
Number of files, and disk footprint might be missing from both summary and time series (--without-disk-footprint option).
Ordering of columns in time series changed.
Disk footprint in the summary changed from float to int.

Autobuild Should Run Make Test

Autobuild should return to running make test, and reflect the test success/failure in the database.

Update wavefront's webpage

list repositories in cvmfs

It would be nice if getdir("/cvmfs") (eg. 'ls /cvmfs') gave the declared repositories with -r at the command line, rather than fail with "No such file or directory".

consider using travis-ci.org for CI well integrated with github

see many other projects for a sample .travis.yml (e.g. our PyMVPA/PyMVPA)

Discrepancy between parrot and linux search implementations

--- bkokoszk submitted bug at Wed, 27 Feb 2013 09:06:01 -0500 ---
The INCLUDEROOT flag does not work the same for recursive searches.

Rework Work Queue Scheduling

--- dthain commented at Tue, 29 Jan 2013 13:30:11 -0500 ---
Mike will fix this accidentally.

--- dthain submitted bug at Wed, 13 Jun 2012 15:11:23 -0400 ---
We have received many different requests for how to schedule tasks to workers
in Work Queue, and the options keep piling up. We have schedule to fastest
workers, prefer workers with a certain property, workers with the most data,
etc. We also have the question of which task to schedule first. On top of
that, there are two scheduling points in the code: One where we have a task
and must find a worker, the other where we have a worker and must find a task.

Find a way to collect all of this into a more coherent framework. Perhaps
provide the user with some kind of hook so that they can specify their own
function for sorting workers/tasks?

utimensat for parrot

From Dan Bradley:

2013/06/01 20:10:01.86 [13014] parrot_run: notice: warning: system call
280 (utimensat) not supported for program /bin/touch

Worker: Check for Low Disk Space

--- dpandiar commented at Tue, 02 Apr 2013 15:37:05 -0400 ---
Commit d42ef8d modifies the disk space checks to use the size of the incoming
file in determining if the disk space will go below the given threshold.
--- lyu2 commented at Thu, 14 Feb 2013 15:46:10 -0500 ---
Just a suggestion: since the worker already reports available disk space, we
can let the master reject a worker when the sum size of a task's inputs,
including the executables, exceeds the worker's available disk space.

Having the master reporting file length is still good and needed because only
the worker knows the most up-to-date available disk space info.

Another note I want to add is that we need to be careful on cached input files
when doing the math.
--- dpandiar reopened at Thu, 14 Feb 2013 12:31:48 -0500 ---
Reopening bug to have the checks use the length of the incoming file to ensure
we won't exceed the threshold when transferring a single (large) file.
--- dpandiar resolved at Tue, 28 Feb 2012 13:44:37 -0500 ---
This issue happened (again) because the feature to check for available disk
space is turned off by default. Currently, the user needs to set the available
space threshold (with the -z option) to turn it on and have the worker monitor
available disk space and clean up when it is reached.

Rev 1800 checks in the fix that sets the default threshold to 100MB. Marking
this resolved now.

--- dthain commented at Mon, 13 Feb 2012 08:41:59 -0500 ---
After your proposal, of course.

--- dthain reopened at Mon, 13 Feb 2012 08:41:50 -0500 ---
Looks like this problem recently happeend again on compbio.cse.nd.edu with jobs
from wzhang7. Please discuss the details with Li and see what happened.

--- dpandiar resolved at Wed, 14 Sep 2011 11:18:01 -0400 ---
Rev 1451 checks in the (corrected) code for this feature. Marking this
resolved.
--- dthain commented at Mon, 15 Aug 2011 15:09:29 -0400 ---
Dinesh - The code that is currently checked in causes the worker to exit.
Would be better for it to jump to the cleanup code at the bottom which
disconnects, cleans up the disk and starts over.

--- dpandiar took bug at Fri, 03 Jun 2011 16:22:44 -0400 ---

--- dthain submitted bug at Fri, 28 Jan 2011 09:50:21 -0500 ---
We are seeing a number of cases where an aggressive user can fill up the disk
at a worker machine by sending lots and lots of data. Make the worker a more
cautious at the execution site by checking for available disk space before
accepting an incoming file. If the disk has less than 10 percent free or 100MB
free (configurable), then abort and clean up the disk.

fast_abort v2.0 for hierarchical WQ

--- malbrec2 submitted bug at Tue, 12 Mar 2013 17:09:24 -0400 ---
The semantics of fast_abort are unclear when work_queue_tasks may sit in worker
or foreman queues for a variable length of time. For example, if half of the
workers are connected to a foreman and half are connected directly to the
master, there may be a (substantial) difference in the effective runtime the
master observes.

We may want to consider moving to an implementation where the master specifies
the allowed runtime when transmitting the task and allows the worker to handle
killing it, rather than having the master explicitly manage task cancelling
itself.

Hierarchical Work Queue Options

I suspect that the new options controlling the name and port for the foreman and multi-slot worker are going to be confusing for new users. Please add more verbose long options and use those in the documentation. Suggest something like --port and --name to control the incoming side, --master-port and --master-name to handle the outgoing side.

Autobuild Should Update Database

Please fix the autobuild to update the database with the outcome of each revision/build, as it worked previously. I believe Ben has already updated the database schema to accommodate the long commit messages.

bash parrot bash infinite loop

--- btovar commented at Tue, 23 Apr 2013 12:58:22 -0400 ---
Not only bash, but sh, csh, and zsh as well.
--- dthain submitted bug at Tue, 26 Mar 2013 15:27:27 -0400 ---
Parrot can get into an infinite loop like this:

bash% cat test.sh

!/bin/sh

parrot_run bash

bash% ./test.sh
bash% bash% bash% bash% ...

parrot_run -d all gives this:

2013/03/26 15:26:11.83 [14963] parrot_run: libcall: ioctl 0 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: ioctl 5 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: = -1 Interrupted system call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: = -1 Interrupted system
call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: ioctl 0 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: ioctl 5 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: = -1 Interrupted system call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: = -1 Interrupted system
call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: ioctl 0 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: ioctl 5 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: = -1 Interrupted system call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: = -1 Interrupted system
call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: ioctl 0 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: ioctl 5 0x5403 0x4fa9bab0
2013/03/26 15:26:11.83 [14963] parrot_run: local: = -1 Interrupted system call
2013/03/26 15:26:11.83 [14963] parrot_run: libcall: = -1 Interrupted system
call

Strangely, the problem does not happen with this command:
bash -c "parrot_run bash"

Makeflow's tests hang

After commit 9ae1596, makeflow's tests (make test) hang.

use ptrace flags to follow threads

Parrot currently follows child processes and threads by manually modifying the
flags of clone(). Anecdotally, there are cases where this results in some
system calls being missed in the new threads, before ptrace catches up. Newer
versions of ptrace have some flags for automatically following children
(PTRACE_SETOPTIONS). Investigate whether this mechanism is more reliable than
the current one.

Multi Slot Worker Pauses

5 second pauses occur when the multi-slot worker is run with the example work queue application. To reproduce:

./work_queue_example *.c &
./work_queue_worker --cores 4 -d all localhost 9123

Strangely, the pauses do not occur when the worker has one core:

./work_queue_example *.c &
./work_queue_worker --cores 1 -d all localhost 9123

Storage Allocations not working with HDFS

Right now Chirp cowardly refuses to enforce storage allocations with HDFS. This is probably unnecessary and should be fixed. Or, the reason for the deficiency should be better explained in comments and in the manual.

monitor segfaults python wq

Possible improvements to Parrot namespace handling

From Brian Bockelman: "In that case, consider this a bonus:
http://lwn.net/Articles/528078/ (also merged for 3.8). User namespaces allow
chroot and various other namespace operations are now allowed for unprivileged
users."

Parrot Search Tests Fails

% ./configure
% make
% make test
...
testing ./TR_chirp_ops.sh ... ok
testing ./TR_parrot_search_chirp.sh ... fail
testing ./TR_parrot_search.sh ... fail

A brief look suggests that the intent of the tests is succeeding, but somehow that isn't being propagated back to the top level.

Multi-Slot Work Queue Default

The new multi-slot worker will, by default, use all cores available on the machine. This is likely to cause overload problems when applied by the unwitting user, or when the worker is deployed on a batch node where it has not been allocated all of the cores.

To avoid these cases, modify the new multi-slot work_queue_worker to assume it has one core by default, unless overridden by --cores N or --cores all (which uses all cores on the machine).

Then, modify sge_submit_workers, so that the user can select the number of cores per batch job, which should in turn be passed to qsub and the --cores option to work_queue_worker. This will make it easy to robustly deploy M workers with N cores each.

On Linux 3.8+, Parrot should use PTRACE_O_EXITKILL

@bbockelm let us know about ptrace's PTRACE_O_EXITKILL option [1] which causes the kernel to kill any traced children if the tracer dies for any reason. This is a better solution than the Parrot watchdog and signal handling.

This should be a build time alternative since we need to support older kernels.

[1] https://lkml.org/lkml/2012/11/18/148

Make command-line options consistent across CCTools executables

--- dpandiar submitted bug at Fri, 21 Sep 2012 12:47:00 -0400 ---
Currently, the command-line options that set or modify similar features are not
consistently named. For example, to set the catalog server to query/report:
work_queue_worker requires the -C option,
chirp_status requires the -c option,
and chirp_server requires the -u option.

declare --long-options

--- dthain commented at Tue, 26 Feb 2013 13:48:22 -0500 ---
Argue about this with Patrick.

--- btovar submitted bug at Thu, 14 Feb 2013 11:46:36 -0500 ---
For 4.0 it might be a good idea to switch to --long-options for the least used
options, since we are running out of letters and good mnemonics. For example,
in makeflow we use -k to check syntax. It is easier to think that -k keeps the
symbolic links (that is what -K currently does), and use --syntax for the
check.

Along the same lines, in -Z port-file, the -Z does not give any hint of what
the option does. I would like to change it to --port-file.

parrot does not support syscall 135 (personality)

--- btovar submitted bug at Fri, 22 Mar 2013 09:28:19 -0400 ---
On opteron.crc.nd.edu:

parrot sh
sh-3.2$ echo uname -m
2013/03/22 09:27:19.83 [20341] parrot_run: notice: warning: system call 135
(personality) not supported for program /usr/bin/x86_64
sh: /bin/uname: Bad address

Makeflow Bundler Memory Leak

--- crobins9 submitted bug at Wed, 03 Apr 2013 12:03:13 -0400 ---
The bundler name translation is leaky.

filename is never freed after memory is allocated by string_format.

Black List Bad Workers in WQ

--- dthain un-assigned bug at Fri, 09 Nov 2012 10:29:56 -0500 ---

--- dthain commented at Tue, 28 Aug 2012 13:20:55 -0400 ---
Add option to make the worker BAD.

--- dthain assigned bug to jvallejo at Tue, 28 Aug 2012 13:19:23 -0400 ---

--- dthain submitted bug at Wed, 13 Jun 2012 15:08:18 -0400 ---

Track the success/failure and/or performance of bad workers in the Work Queue.
For those that a persistent problems, blacklist them and don't send tasks
there.

This is a bit more tricky than it may sound, because you don't want occasional
failures or a brief systematic outage to cause everything to be blacklisted.

parrot - cvmfs - atlas interaction

--- dthain commented at Mon, 01 Apr 2013 14:11:30 -0400 ---
Diagnosis:

When a process running in 32-bit mode attempts to exec() a 64-bit program, the
exec works correctly, except that the final event of exiting from the system
call is misinterpreted. The process is in 64-bit mode, but the system call
number held over from the event has the old value of SYSCALL32_execve, which
corresponds to SYSCALL64_munmap. Parrot then completes the system call
according to munmap instead of execve, and everything gets confused.

The solution will have to keep track of whether the process is completing an
exec system call, and then route the system call appropriately.

--- dthain commented at Mon, 01 Apr 2013 14:09:33 -0400 ---

include

Simpler test case:

gcc -m32 test.c -o test
./parrot_run -d all ./test

test.c:

main()
{
execl("/bin/sh","sh","-c","/bin/echo hello",0);
}

--- dthain commented at Mon, 01 Apr 2013 13:24:51 -0400 ---
Narrowed the problem down to the 32-bit version of execve.

Works:

gcc test.c -o test.64
./parrot_run ./test.64

Doesn't work:

gcc -m32 test.c -o test.32
./parrot_run ./test

test.c

include

main()
{
execl("/bin/sh","sh","-c","( svn checkout
svn+ssh://svn.cern.ch/reps/atlasoff/PhysicsAnalysis/AnalysisCommon/UserAnalysis/tags/UserAnalysis-00-15-06
PhysicsAnalysis/AnalysisCommon/UserAnalysis ) 2>/tmp/tmpQwLS4m",0);
}

--- dthain took bug at Mon, 01 Apr 2013 12:26:35 -0400 ---

--- dthain submitted bug at Mon, 01 Apr 2013 12:26:27 -0400 ---
Stefan Kluth reports the following problem in parrot-cvmfs with atlas
software.
To reproduce, built the latest parrot with cvmfs enabled on a RHEL5 machine
like cclweb01. Then, run the following script:

parrot.atlas.sh

!/bin/sh

export PARROT_ALLOW_SWITCHING_CVMFS_REPOSITORIES="yes"

DEBUG="-d process"

parrot_run $DEBUG -p cache01.hep.wisc.edu:3128 -r
'_.cern.ch:pubkey=,url=http://cvmfs-stratum-one.cern.ch/opt/_;http://cernvmfs.gridpp.rl.ac.uk/opt/*;http://cvmfs.racf.bnl.gov/opt/*
atlas-nightlies.cern.ch:pubkey=,url=http://cvmfs-atlas-nightlies.cern.ch/cvmfs/atlas-nightlies.cern.ch'
./parrot.atlas2.sh

parrot.atlas2.sh

!/bin/sh

export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh'
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
mkdir -p athena/17.2.4
cd athena/17.2.4
asetup 17.2.4,here
cmt co -r UserAnalysis-00-15-06 PhysicsAnalysis/AnalysisCommon/UserAnalysis

The final command, cmt, fails with a mysterious error code.
With the process debug flag on, we learn that a particular exec call results in
an immediate segfault:

2013/04/01 11:45:32.89 [18014] parrot_run: process: execve: /bin/sh is an
ordinary executable
2013/04/01 11:45:32.89 [18014] parrot_run: process: execve: /bin/sh attempting
2013/04/01 11:45:32.89 [18014] parrot_run: process: execve: argv[0] == "sh"
2013/04/01 11:45:32.89 [18014] parrot_run: process: execve: argv[1] == "-c"
2013/04/01 11:45:32.89 [18014] parrot_run: process: execve: argv[2] == "( svn
checkout
svn+ssh://svn.cern.ch/reps/atlasoff/PhysicsAnalysis/AnalysisCommon/UserAnalysis/tags/UserAnalysis-00-15-06
PhysicsAnalysis/AnalysisCommon/UserAnalysis ) 2>/tmp/tmptyhsZ_"
2013/04/01 11:45:32.89 [8303] parrot_run: process: pid 18014 received signal 11
(Segmentation fault) (state 0)
2013/04/01 11:45:32.89 [8303] parrot_run: process: pid 18014 exited abnormally
with signal 11 (Segmentation fault)
2013/04/01 11:45:32.89 [18013] parrot_run: process: 18013 created pid 18015
2013/04/01 11:45:32.89 [8303] parrot_run: process: pid 18015 received signal 19
(Stopped (signal)) (state 1)
2013/04/01 11:45:32.89 [8303] parrot_run: process: pid 18015 received signal 18
(Continued) (state 1)

Still investigating...

Change makeflow's ppm encoding

Currently the ppm generated uses a whole integer to represent a single bit. This makes the files several of orders of magnitude bigger than they could be. We could either encode several bits in an integer, or even better, change the color representation to triplets of chars, increasing the number of color available while still reducing the size of the files generated.

Update man pages with long options

--- crobins9 commented at Wed, 20 Mar 2013 12:47:45 -0400 ---
For options with long and short options but no parameter use OPTION_ITEM(`-s
--something').

For options with long, short, and a parameter use OPTION_TRIPLET(-s, something,
param)
--- btovar submitted bug at Tue, 19 Mar 2013 12:40:08 -0400 ---

Makeflow Environment Handling

-- dthain submitted bug at Tue, 29 Mar 2011 16:27:09 -0400 ---

Environment variables are not consistently set up across the various
implementations of Makeflow. For example, if you define SHELL in your
makeflow, you will see the right definition in the local implementation, but
not in the work queue. Look across the implementations of batch_job, and
figure out how to either pass the entire environment on every invocation, or
just pass along a few specific definitions.

Work Queue Name Check

Easy bug fix to learn the commit process:

In work_queue_worker, if the user specifies the same project name with -N and -M emit an explanatory error message and abort.

Documentation for using cluster batch system in Makeflow

--- dpandiar assigned bug to malbrec2 at Tue, 04 Dec 2012 17:57:24 -0500 ---
Mike has already taken up fixing the documentation on the use of this option.
--- dpandiar submitted bug at Tue, 04 Dec 2012 17:52:52 -0500 ---
We don't have documentation in user manual for using the "cluster" back-end for
job execution in Makeflow (that is, makeflow -T cluster).

Dynamic Makeflows

--- dthain commented at Tue, 29 Jan 2013 13:29:20 -0500 ---
Wait until refactoring complete.

--- dthain un-assigned bug at Tue, 18 Sep 2012 12:07:44 -0400 ---

--- dthain assigned bug to jfetsch at Tue, 28 Aug 2012 13:15:58 -0400 ---

--- dthain submitted bug at Wed, 13 Jun 2012 15:24:03 -0400 ---

Modify Makeflow to periodically examine the input file to see if the DAG has
gotten larger. If so, add the new elements to the DAG and keep going.
Consider some boundary cases, such as what happens if the added rule is
invalid, or has missing dependencies.

Discuss this feature with Andrey Tovchigrechko [email protected] to get a
better sense of what he is trying to accomplish.

Document MAKEFLOW recursive call

incorrect number of workers shown in work_queue_status

For the shrimp makeflow with about 900 workers, it shown about 1850.

Work Queue needs specify directory option

--- dthain assigned bug to malbrec2 at Tue, 26 Feb 2013 13:50:34 -0500 ---

Add a new API entry for specify_deirectory with option for recursiveness.
Deprecate specify_file for directories.

--- dthain commented at Fri, 15 Feb 2013 11:52:43 -0500 ---
Let's talk about this in detail on Tuesday, since it mixes up some issues in
both WQ and Makeflow.

--- malbrec2 submitted bug at Thu, 14 Feb 2013 15:39:36 -0500 ---
One of the problems Lauren and I ran into when diagnosing her weaver/makeflow
issue was that there is no method of specifying just a directory hierarchy as
an input requirement. Instead, if a directory is specified then Work Queue
assumes you want both the directory and any children.

This was a problem because the way weaver constructed the makeflow, it was
specifying the task's output file somewhere within a hierarchy that existed on
the master but was not being created on the worker. The worker's task didn't
know it had to create the hierarchy, so when it tried to write its output it
would fail with a "permission denied" error.

Example:
out_dir/out1: /bin/cat in_dir/in1
/bin/cat in_dir/in1 > out_dir/out1

would fail because out_dir is never created on the worker, even if it already
exists on the master. Adding out_dir as an input requirement won't work,
because that would also transfer anything inside out_dir (which is particularly
bad if that happens to be out_dir/1TB_FILE)

So we need a way of telling the work_queue that a directory hierarchy is
necessary. After discussing the problem with Dinesh & Li, we came up with a
few options:

Add "environment" input options. This could also include environment
variables (and maybe common input files?)
Allow the worker to specify "directory only" through a flag or special
character.
Change our semantics so specifying a directory only ever creates that
directory. The user would then have to specify each file in the directory that
they needed. Optionally add another API command to parse a directory and add
each file in it to the task. This, however, would likely break some existing
user code and thus isn't optimal.
Deprecate the current API commands and replace them with new ones that
behave as described in (3). Have the deprecated commands maintain current
behavior. This is what we did with "work_queue_specify_input_file" and
friends.
Hack around the problem by having work queue examine the output filenames
and compare any directory hierarchy there to what exists on the master. If the
directories referenced in the output already exist on the master, create them on
the worker. So in the example if output_dir existed on the master, it would be
created on the worker. If it didn't already exist on the master then work
queue would assume the task creates it.

Work Queue: Separate Cache and Job Directories

If the user has files which are plain numbers, their names in the cache directory conflict with the directories created per task.

To avoid these conflicts, separate the worker namespace into $ROOT/cache/$FILE and $ROOT/task/$TASK/$FILE

For example, this property causes the TR_allpairs_composite.sh test to file.
The worker log looks like this:

2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> put A.2 2 777 2 0
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Putting file A.2 into workspace
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> put B.2 2 777 2 0
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Putting file B.2 into workspace
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> put 3 2 0644 2 1
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Putting file 3 into workspace
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> task 2
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> cmd 45
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> ./allpairs_multicore -e "" A B ./divisible.sh
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> cores 1
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> memory 0
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> disk 0
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> infile divisible.sh divisible.sh 1
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> infile allpairs_multicore allpairs_multicore 1
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> infile A.2 A 0
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> infile B.2 B 0
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> infile 3 3 1
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> infile 2 2 1
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> end
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: linking file divisible.sh into workspace ./2 with ./2/divisible.sh
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Could not create directory - ./2 (Not a directory)
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: linking file allpairs_multicore into workspace ./2 with ./2/allpairs_multicore
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Could not create directory - ./2 (Not a directory)
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: linking file A.2 into workspace ./2 with ./2/A
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Could not create directory - ./2 (Not a directory)
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: linking file B.2 into workspace ./2 with ./2/B
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Could not create directory - ./2 (Not a directory)
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: linking file 3 into workspace ./2 with ./2/3
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Could not create directory - ./2 (Not a directory)
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: linking file 2 into workspace ./2 with ./2/2
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: Could not create directory - ./2 (Not a directory)
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: started process 19731: ./allpairs_multicore -e "" A B ./divisible.sh
2013/05/31 16:26:27.60 [19727] work_queue_worker: wq: --> kill 1

parrot and sshd

--- pdonnel3 commented at Mon, 25 Mar 2013 12:11:31 -0400 ---
After making a public/private key pair, I was able to reproduce this bug
using:

/usr/bin/ssh -o PreferredAuthentications=publickey -o
UserKnownHostsFile=/dev/null -i
/afs/nd.edu/user31/pdonnel3/cctools/git/cctools/foo/id_rsa -v -o
ProxyCommand="/afs/nd.edu/user31/pdonnel3/cctools/git/cctools/parrot/src/parrot_run
/usr/sbin/sshd -e -i -h
/afs/nd.edu/user31/pdonnel3/cctools/git/cctools/foo/id_rsa -o UsePAM=no -o
PidFile=/dev/null -o
AuthorizedKeysFile=/afs/nd.edu/user31/pdonnel3/cctools/git/cctools/foo/id_rsa.pub
-f /dev/null" localhost

I get this output:

Accepted publickey for pdonnel3 from UNKNOWN port 65535 ssh2
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Requesting [email protected]
debug1: Entering interactive session.
Attempt to write login records by non-root user (aborting)
ioctl(TIOCSCTTY): Operation not permitted
debug1: channel 0: free: client-session, nchannels 1
Killed by signal 2.

The thread where this bug is introduced is here [1]. The lcg-ls bug [2] is
unrelated.

[1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1202&L=CCTOOLS&F=&S=&P=3620
[2] https://listserv.nd.edu/cgi-bin/wa?A2=ind1302&L=CCTOOLS&F=&S=&P=4034
--- dthain commented at Fri, 22 Mar 2013 15:36:55 -0400 ---
Do a quick check to see if this is fixed by signal handling.

--- dthain assigned bug to pdonnel3 at Tue, 26 Feb 2013 13:53:25 -0500 ---

--- dthain changed system to cctools at Thu, 24 Jan 2013 12:31:22 -0500 ---

--- dthain submitted bug at Thu, 05 Jul 2012 14:24:57 -0400 ---
Dan Bradley reported a problem when using sshd inside of Parrot, which actually
does happen when using the condor_ssh_to_job feature.

It can be reproduced as follows:

ssh -i /path/to/id_rsa -v -o ProxyCommand="/path/to/parrot_run /usr/sbin/sshd
-e -i -h /path/to/id_rsa -o UsePAM=no -o PidFile=/dev/null -o
AuthorizedKeysFile=/path/to/id_rsa.pub" localhost

Integrate and document makeflow's new gc

Update man pages with long options

--- crobins9 commented at Wed, 20 Mar 2013 12:47:45 -0400 ---
For options with long and short options but no parameter use OPTION_ITEM(`-s
--something').

For options with long, short, and a parameter use OPTION_TRIPLET(-s, something,
param)
--- btovar submitted bug at Tue, 19 Mar 2013 12:40:08 -0400 ---

parrot_run hangs on Linux 3.2

--- dthain changed severity to low at Wed, 27 Feb 2013 08:34:12 -0500 ---

--- dthain un-took bug at Wed, 23 Jan 2013 15:35:27 -0500 ---

--- dthain took bug at Wed, 27 Jun 2012 16:15:26 -0400 ---
Just fixed some problems relating to Linux 3.2 on Debian/Ubuntu in r2155. Can
you check to see if the problem is still there?

--- pbui commented at Thu, 01 Mar 2012 14:59:03 -0500 ---
Arch Linux (my laptop). I can't repeat on RHEL6. I'll see if Debian testing
has the problem in a bit.
--- dthain commented at Thu, 01 Mar 2012 14:55:39 -0500 ---
What distribution?

--- pbui submitted bug at Thu, 01 Mar 2012 14:36:23 -0500 ---
Running the latest parrot_run on a Linux 3.2.8 kernel hangs on what appears to
be lseek.

$ parrot_run -d all stat /etc/hosts
...
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: lseek 3 -1 0
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: = -1 Interrupted system
call
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: lseek 3 -1 0
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: = -1 Interrupted system
call
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: lseek 3 -1 0
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: = -1 Interrupted system
call
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: lseek 3 -1 0
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: = -1 Interrupted system
call
2012/03/01 14:34:23.96 [22230] parrot_run: libcall: lseek 3 -1 0

resource_monitor issues with interactive applications

--- btovar took bug at Fri, 26 Apr 2013 15:32:21 -0400 ---
On SIGTTIN the monitor now terminates the application. In the future a flag
will be added to allow interactive applications, with a warning of weird
interactions with exit() and signals.
--- btovar submitted bug at Thu, 25 Apr 2013 17:27:38 -0400 ---
doing: resource_monitor bash

bash is stopped by SIGTTIN, which means it is waiting for input in the
background.

Resource Control Interface

--- btovar commented at Wed, 06 Mar 2013 08:34:40 -0500 ---
Each task in the workflow will have a minimum and maximum rank, according to
where they appear in the dag. If the rank of two tasks overlap, that means that
there is a schedule where the tasks can be run concurrently. If the minimum rank
of a task is bigger than the maximum of the other, then the former task has to
always run first than the latter.

Once we have the tasks divided in ranks, it will be easier for makeflow to
report the range of resources, using the profile statistics from the logs
generated by resource_monitor.
--- btovar took bug at Tue, 29 Jan 2013 09:06:16 -0500 ---

--- dthain changed type to feature at Thu, 30 Aug 2012 13:41:19 -0400 ---

--- dthain submitted bug at Wed, 13 Jun 2012 15:02:35 -0400 ---
Develop a common method for a program to manage the resources of a (possibly
parallel) subprogram. It should be possible to ask the subprogram how many
resources it could use, and also to tell it how many resources it may use. The
mechanism should support hierarchy cleanly.

For example, it should be possible to ask a makeflow how many cores it needs.
The makeflow would examine the dag and report back that it could use between 1
and 38 cores. The parent looks at the possibilities, then decides to assign 10
cores to the makeflow, which then lives within that limit.

chirp_server crashes when used with the -Q option

I have been doing some work with Chirp (I am one of Chris Moretti's students at Princeton), and I found that when Chirp is started with the -Q option ("enforce root quota"), it will crash. The problem appears to be on line 601 of chirp_server.c; chirp_root_path is uninitialized when passed to chirp_alloc_init. I have been able to fix this by adding chirp_root_path = cfs->init(chirp_root_url); just before that call. Hopefully this is a proper solution; if not, what should I do?

parrot segfaults with cvmfs wildcards

--- btovar commented at Tue, 14 May 2013 10:53:10 -0400 ---
Related segfault:

cvmfs: repository name does not match (found atlas-nightlies.cern.ch, expected
atlas.cern.ch)
cvmfs: Failed to initialize root file catalog
bash: segmentation fault

--- btovar took bug at Tue, 14 May 2013 10:46:27 -0400 ---

--- btovar submitted bug at Tue, 14 May 2013 10:46:20 -0400 ---
If a repository name matches a cvmfs repository wildcard given at the command
line, but there is no such repository, parrot segfaults.

Example with: -r '*.cern.ch ....

ls /cvmfs/at.net
ls: /cvmfs/at.net: No such file or directory

ls /cvmfs/nothing.cern.ch
bash: segmentation fault

cooperative-computing-lab / cctools Goto Github PK

cctools's Introduction

The Cooperative Computing Tools

About

Quick Install Via Miniconda

Build From Source

Copyright and License Notices

cctools's People

Contributors

Stargazers

Watchers

Forkers

cctools's Issues

!/bin/sh

include

test.c:

test.c

include

parrot.atlas.sh

!/bin/sh

parrot.atlas2.sh

!/bin/sh

Recommend Projects

Recommend Topics

Recommend Org