vatlab / sos Goto Github PK
View Code? Open in Web Editor NEWSoS workflow system for daily data analysis
Home Page: http://vatlab.github.io/sos-docs
License: BSD 3-Clause "New" or "Revised" License
SoS workflow system for daily data analysis
Home Page: http://vatlab.github.io/sos-docs
License: BSD 3-Clause "New" or "Revised" License
It is very useful to save input and output of a step to properly named variables so that later steps can refer to
them with their names. There can be three implementations
[step_index: input_alias=name, output_alias=name]
Pros:
Cons:
[step_index]
name = step_input
name1 = step_output
stuff = step_output[0]
other = step_output[1:]
Pros:
step_input
or step_output
. For example, if there are two types of output files,Cons:
step_index
and step_output
, which is more or less internalinput: alias=name
output: alias=name
Pros:
step_input
Cons:
step_input
and name
should be filtered even group_by
input files.When I use env.logger
from pysos.utils the logging message always appear twice. I remember I must have run into this before but I cannot recall the way out. I had to add if not self._logger.handlers
to get rid of the duplicate, based on a random stackoverflow.com post. I am not sure if the problem occurs in the current SoS -- since I only used the RuntimeEnvironment
class in my application something might be different, and the patch below is obviously not good because there are two types of handler cout
and ch
and I'm not dealing with it properly. Just want to know if this is an issue before taking any other actions.
- cout = logging.StreamHandler()
- levels = {
- '0': logging.WARNING,
- '1': logging.INFO,
- '2': logging.DEBUG,
- '3': logging.TRACE,
- None: logging.INFO
- }
- #
- cout.setLevel(levels[self._verbosity])
- cout.setFormatter(ColoredFormatter('%(color_levelname)s: %(color_msg)s'))
- self._logger.addHandler(cout)
+ if not self._logger.handlers:
+ cout = logging.StreamHandler()
+ levels = {
+ '0': logging.WARNING,
+ '1': logging.INFO,
+ '2': logging.DEBUG,
+ '3': logging.TRACE,
+ None: logging.INFO
+ }
+ #
+ cout.setLevel(levels[self._verbosity])
+ cout.setFormatter(ColoredFormatter('%(color_levelname)s: %(color_msg)s'))
+ self._logger.addHandler(cout)
The name action check_output
conflicts with subprocess.check_output
. In need of a better name.
Software version info matching is important for pipeline reproducibility. For single executable type of software it is the users responsibility to provide and install so what SOS can do is perhaps record it somewhere, e.g., sos_session_info.txt
. But SOS automatically installs R libraries which may cause problems with versions. A simply way out would be using these syntax for R library: package_host/package_name/version
. For example, cran/ggplot2/0.99
and bioc/ggbio/xxx
and even github R packages username@github/package
(for github perhaps there are no version but commit/tag instead?).
What sure what is involved/required to support docker.
SoS is now registered with PIP and can be installed using command pip3 install sos
.
SoS needs psutil
for proper cleanup when the process is killed (Keyboard interrupt). I added in setup.py
the following lines
install_requires=[
'psutil',
],
so that psutil
can be automatically installed with sos
$ pip3 install sos
Collecting sos
Downloading sos-0.5.0.tar.gz (42kB)
100% |████████████████████████████████| 51kB 2.1MB/s
Collecting psutil (from sos)
Using cached psutil-4.1.0.tar.gz
Installing collected packages: psutil, sos
Running setup.py install for psutil ... done
Running setup.py install for sos ... done
Successfully installed psutil-4.1.0 sos-0.5.0
although
python3 setup.py install
generates a warning. This ticket will remain open for all installation issues.
I looked at the wiki but I did not see the a benchmarking feature for SoS steps. Perhaps it can be something good to have?
After updating to the current SoS my previous code does not work:
def get_md5(value):
import sys, hashlib
base, ext = value.rsplit('.', 1)
res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest()
return '{}.{}'.format(res, ext)
[simulate_1: alias = 'simulate']
n = [1000]
true_mean = [0, 1]
output_suffix = 'rds'
input: for_each = ['n', 'true_mean']
output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}'
_output = [get_md5(x) for x in _output]
run: ...
The error message is:
ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output]
to variable _output: name 'get_md5' is not defined
What is the proper way to use customized Python functions with current SoS?
Nested workflows might have multiple [parameters]
step. Right now when option -h
is passed to an ArgumentParser, it will print out help message for the first [parameters]
step and quit. It is therefore difficult to print out options from all [parameters]
steps. A solution is to hijack -h
and turn on a help
mode and process all [parameters]
steps but it is not time to worry about this corner case for now.
Snakemake has some magic that allows input
to be accessed as both list and object with attributes. For example,
it allows
input: reference='seq.fasta'
bamfiles=['a.bam', 'b.bam']
and expressions such as input.reference
, input.bamfile
, and input
. This can be easily achieved with some Python magic
but I am not sure if I should follow it and make SoS more difficult to understand.
Pros:
Cons:
group_by
.reference='seq.fasta'
bamfiles=['a.bam', 'b.bam']
input: reference, bamfiles
and use input
, reference
, and bamfiles
separately. Here reference
more likely
belongs to depends
.
For simplicity and mostly the last reason, I incline not to use complicated data structure here.
VPT has some support for running on cluster; I'm wondering if they are mature enough to be ported to SoS?
In the present VST implementation, actions take care of runtime signatures and a step collects output files from one or more actions.
When an action needs to be executed multiple times with different outputs (group_by
option), runtime signatures are saved for
each execution.
SoS currently adopts a step-wise output specification so it does not care how many times an action is executed or how many actions are
executed, and thus loses the fine control of action-level runtime signature.
Solution: SoS takes a mult-signature approach. If output
is defined for each input, each repeat of the action will have its own signature.
Limiting or monitoring the RAM and CPU (cores) the step uses?
In utils.py the logger has lowest level of '0' which will still display warning messages. I'm not sure if in SoS all warning messages are worth keeping, but in my interaction with the library I sometimes want to hide warning messages. I'm thinking of adding {'-1': logging.ERROR}
so that verbosity -1
will do the trick. Will this make sense or is there another way out? I can send a pull request that changes also sos
interface, if it makes sense ...
I have this script:
input_files = ('1.txt', '2.txt', '3.txt', '4.txt', '5.txt', '6.txt', '7.txt', '8.txt')
[1]
for item in input_files:
run('touch ${item}')
[2]
output_suffix = 'out'
input: input_files, group_by = 'pairs', pattern = '{base}.{ext}'
output: pattern = '{_base}.${output_suffix}'
run('''
echo ${_input}
echo ${_output}
''')
and I got this error message
ERROR: Failed to process directive input: 'NoneType' object is not iterable
Please pull and reinstall sos for a minor bug fix before trying to reproduce the error. Thank you!
Right now all SoS steps are executed in separate processes and the only return value is the step info specified by the alias
option. There are two problems:
Proposal:
Add a shared
step option to specify which variable a step will use (if available) and return. Steps that share the same variables will form a dependency relationship.
We try to handle ~
in filenames automatically but it is hard to ensure os.path.expanduser
is called whenever a path is needed. Because we will also perform magic operations such as temporary()
and dynamic()
on filenames, we should encapsulate filenames in a dedicated class.
There are some options to allow concurrent execution of step actions.
nonconcurrent
, because the actions should be safefor_each
asnc_for_each
for nonconcurrent for each. Perhapsfor_all
as nonconcurrent? The problem is that group_by
would also generatenc_group_by
.Decided to use section option nonconcurrent
)
There is a need to save the version of program or packages of R ...
There is a need to modify _output
and a proposal to implement something like a filter=func
option. The reason is that the outcome of pattern
needs to be post-processed. However, for this special case, pbecause pattern
is simple to simulate, perhaps
[func('${} ${}' for x, y in zip(_A, _B)]
is enough for
pattern='{_A} {_B}', filter=func
so we do not need a separate output option.
The real trouble behind changing _output
directly is we will have to monitor each statement for the change of _output
and even so change of _output
might not affect _step.output
because statements after process
might be executed in a separate process. Even worse is that we might evaluate statement multiple times (e.g. dryrun before run) so _output = process(_output)
can fail.
If the output of step is dynamically determined, for example, by running glob.glob
from output directory, the output
might be empty or wrong at the planning stage. For example
[100]
output: '*.bam'
[200]
run('samtools sort')
Step 200 might get an empty list at the planning stage and will be treated as a leaf/starting step with no input, and
causes an error.
Solution:
dynamic
section option to tell SoS that the output of step 100 is dynamic?dynamic
property to related variables.dynamic
input, output, and depends option.A dynamic option to input
and output
directives seems to be best, but dynamic('*.txt')
is also acceptable. The problem with the latter is that dynamic(func(args))
looks a bit strange because essentially speaking dynamic
delay evaluation of string (*.txt
) so the format dynamic(func(args))
can be a bit confusing.
It can be nice if we could isolate the user groups to allow, for example, write permission only to work dir and specified directories. This is difficult to achieve so this thread is created for the purpose of gathering ideas.
CWL has the requirement keyword which handles all sorts of requirement. SoS solves this by
depends:
item in pipeline step, which lists all dependent files that will go into step signature.fail_if()
action, which stops the workflow if the requirements are not satisfied.The regular for loop is allowed in step action but it is difficult to run them in parallel because step action can be very complicated. The for_each
parameter is designed to repeat the action with different values in parallel.
That is to say
input: for_each='method'
run('command1 ${_method}')
is more or less equivalent to
for _method in method:
run('command1 {}'.format(_method))
String interpolation cannot be used because SoS does not understand the loop.
I am not sure the current design is intuitive though. It requires the variables to be defined before and uses a derived variable (_name
for variable name
), but the usual for_each
loop does not fit into the existing parameter structure.
Another concern is that we are using input
for both regular and looped cases. It might
make sense to use _input
for the latter.
We would like to encourage users to run a script in dryrun mode to check the integrity of the script and running environment (availability of command etc) before actually running a script. Currently the dryrun mode is implemented as
sos run -d script [workflow] [options]
We use -d
instead of --dryrun
because SoS allows arbitrary command line arguments to be defined by a SoS script and using --dryrun
might cause conflict.
Since -d
is not commonly used for dryrun mode, perhaps a dedicated dryrun
subcommand is better?
sos dryrun script [workflow] [option]
We can even define both in case no method is better than another.
A SoS can be easier to understand if we make most SoS variables readonly. That is to say, a SoS variable can not be changed
after it has been initialized. Exceptions to this rule can be system variables and temporary looping variables. It is also
possible to use _name for all non-readable variables.
Pros:
Cons:
bam_files
in the option skip
example.It might worth compiling a wiki table of terms in SoS wiki that explains in brief various jargon such as "step variables", "step derivatives" , after we are somewhat more settled later.
Our step current has format
input:
output:
depends:
runtime:
with assignment allowed before/between these keyword, and any statement after the last keyword is considered as step process. This is clear enough in most cases but it might be clearer to have a separate process
keyword that begins the step process. For example, a problem an advanced user might encounter is that
input:
_output = some_user_function()
will not work because _output
is part of the step process so change of _output
will not affect step output (unless he adds another keyword after the statement). It would be clearer to write this particular example as
input:
_output = some_user_function()
process:
start of process
SoS can certainly define a step with
run('sos-runner anotherworkflow ${input}'
to execute another workflow, but it would be good to define another workflow from existing ones, or maintain a library of workflows.
I can see it might be helpful to allow for definition of functions or classes in [parameter]
. For example,
[parameters]
def get_variable_names(val):
...
return val_names
Then
[1]
input: get_variable_names(${val})
This may handle something more complicated than what a lambda function can do.
I'm guessing I should post some of my problems writing SoS scripts and, after we fix it, make it to the Example page.
Here is one:
1.sos.txt
I have this mock pipeline attached where I try to automatically generate files based on the files generated from previous step. It is a test to match file names. I cannot get this pipeline to work because output:
does not loop over input parameters. Rather it treats input parameters in list as a string. What is the proper syntax for this user case?
I have a SoS script like this
[a_1]
[a_2]
[b_1]
[b_2]
[A_1=a_1+b_1]
[A_2=a_1+b_2]
[A_3=a_2+b_1]
[A_4=a_2+b_2]
I'm hoping to run workflow A
so that all the 4 workflows will be executed. However my hope was that for example b_1
will automatically take the output from a_1
as its input. It does not seem to be the case in the current implementation. The INFO messages have empty input and I'm guessing this is because each element in the workflow still looks for its upstream by index, rather by their new position in the combined workflow. Is that the case and how do I achieve my goal, then?
The use of special syntax in SoS might be troublesome. For users who would prefer authentic Python interpreter, we might be able to provide something like
from pysos import no_raw_tripple_string, no_string_interpolation
and disable these SoS added features.
Libraries would be python modules with defined SoS actions, but how to maintain and import these modules require further investigation. Furthermore, extensive use of libraries somehow beats the purpose of SoS (readability) because libraries hide the details of actions.
Limiting the files that the directory the step action can write to?
The differences between _step.input
, _input
, _output
, _step.output
, loop_var
and _loop_var
can be confusing. I am wondering if there can be ways to completely remove loops from steps. For example, we can move the loop to the section option level (but that place is crowded as well).
To the very least we need a good figures to explain these variables well.
Does it make sense to force some name conversion so that users immediately know the type and nature of variables? This might
make the script a bit more readable. For example
RESOURCE_DIR='/path/to/resource'
)_loop_variable
, _label
, _input
, right now input
is used)$
and @
symbols though.By 'enforce', I mean SoS can give warning even error if a variable's usage does not match its name convention.
I tent to think we should let users use the style they prefer.
These can be determined automatically if input and output files are specified, right?
These options are therefore not needed. In addition, adding the option terminal
does not prevent
other steps to depend on the ouput files of a step, leading to potential errors.
Decided to use no special option and rely on input and output specification of steps.
I am wondering if we can allow other types of dependencies such as
defined("variable')
hascommand('tophat')
Existing files can be kept or converted to exist('file'). The problem here is that we need to differentiate between condition that can will never be met (e.g. command) and those that can be met in runtime (existence of file).
Bo
Right now at least in test scripts, step processes are executed in SoS local namespace and can, for example run
executed = []
[10]
executed.append(_step.name)
This is ok for now but will significantly complicate the implementation of parallel execution of scripts because step processes would then all reply on the same central global environment, causing a lot of inter-process troubles. It would make sense to copy runtime variables to step processes and let the process execute independently. That is to say, change of local namespace will not affect SoS.
Changes that have been made:
process
keyword is used to replace runtime
, basically runtime option is now process runtime options.That is to say, the above example will continue to work because everything is still in SoS namespace.
Would it clearer to require explicit input files? Right now the a step's _step.input
is
the output of previous step. The problem with no default input is that step aliases are
almost always required and make the script a bit cumbersome.
Pros and cons of using default input:
pro:
input100
, but input: output200
is notcons:
This is a thread devoted to the parallel execution feature of SoS. Basically, we need to write or find an existing implementation of DAG. Adage (https://pypi.python.org/pypi/adage/0.1.7.1) looks like a close match but we will need further investigation of this package.
Celery is something we can use. It would allow us to expand to the server environment because it would become user's responsibility to set up the Celery environment for SoS to run across nodes.
Python 3.6 introduced formated string, which uses {}
. It also accepts expressions (with addition of conversion !
and format specification :
)
but so I am very attempted to declear that all SoS string literals are python format string. However, the use of {}
forces the doubling of regular
{
and }
, which is disastrous for scripts such as shell and R.
I guess I will have to specify that SoS string is different and has to be configurable sigil for different languages.
Current SoS steps look more or less like this
[*_31]
# get the number of mapped reads
input: sortbam.output, group_by='single'
output: '${_input}.bai', '${_input}.flagstat'
process: concurrent=False
run('''
samtools index ${_input}
samtools flagstat ${_input} > ${_input}.flagstat
''')
and I am wondering if we can write it as
[*_31]
# get the number of mapped reads
input: sortbam.output, group_by='single'
output: '${_input}.bai', '${_input}.flagstat'
run: concurrent=False
samtools index ${_input}
samtools flagstat ${_input} > ${_input}.flagstat
where run
etc can be any SoS action that accepts a single string input, and we can add multiple sections if needed (e.g. run:
, python:
)
Pros:
[XX]
etc in script.Cons:
process
will start a new process and run()
etc can be inline (executed with SoS) or external (separate process) need to be changed. We can keep the run()
form inline though.There can be a need for dependency rules. For example, if a bam index file is needed (dependent upon) and
the bam file exists, then samtools index
would be automatically called. This does not sound like a good
idea because samtools index
can always be put as a regular part of the workflow. On the other hand, adding
such rules can help if these common steps are not always needed, or needed for multiple steps of the pipeline.
I am not sure how useful such magic is. Implementation wise it might not be too difficult, some syntax
like the following could be used
# section name does not have index so it would be called if and only if the
# step is needed. pattern of output is defined as section option.
[ index_bam : output=*.bam.bai]
# input is defined from output
input = output[0][:-4]
# the action is defined as usual, but output does not need to be defined
# again.
run('samtools index ${input}')
could potentially work. This allows gnumake style definition of pipelines that can be used to construct complete workflows. It seems to contradict
the design of SoS though.
Decision: auxiliary steps.
In cases where output
needs to be composed from input
, there can be a need to process input
or _input
before step output
is specified.
Right now a SoS step can have
assignment
step_input/output/depends
statements
The proposed layout would be
Mixture of assignment/step_input/output/depends
statements
Note that step input
could introduce loop and run the assignments and step output
and depends
repeatedly.
The ticket stems from discussion in a closed issue and for better organization I make it separate issue here. The problem is that when group_by
and pattern
are both used in input:
in conjunction with pattern
in output:
, the output file names is then based on input file names yet it does not honor the group_by
rule from input:
, thus creating name mismatch in upcoming actions.
Runtime signatures are usually not portable because filenames and pathes would change when the project
is moved to another location. However, because it is possible for a project to be archived, and restored
for some further analysis, saving the runtime signatures with the project might be usable. In addition,
if runtime signatures include standard and error output of the commands, it can be useful to keep them
for later reference. The interface can be
sos admin --pack_runtime runtime.db
for packing runtime information to a file,
sos admin --unpack_runtime runtime.db
for unpacking runtime information of the current project, and
sos admin --unpack_runtime runtime.db
The sos view
or sos edit
commands might be used to show standard and error output of commands.
Python 3 contains modules such as concurrent and asyncio which can be useful for SoS. These modules are not available for Python 2.7 so it makes sense to support only Python 3.x.
We have addressed this problem with !q
and !r
format converter
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.