vatlab / sos Goto Github PK

SoS workflow system for daily data analysis

Home Page: http://vatlab.github.io/sos-docs

License: BSD 3-Clause "New" or "Revised" License

Python 61.01% Shell 0.21% Jupyter Notebook 36.06% JavaScript 0.89% Smarty 0.87% CSS 0.20% Dockerfile 0.17% R 0.06% Vim Script 0.52%

sos's People

Contributors

Stargazers

Watchers

Forkers

pombredanne mr-c jirikuncar explorerwjy sksundaram-learning gitter-badger pythseq jdblischak wangbinhu lstmemery shanhaiying biditsh hwestbro a-robu hopenjin atieno-ouma mrusoff qiknowledge cailiang9 stvhanna nandor-poka velcon-zheng yzhao2020 pgcudahy datastark augix siyangming flyhorse1980 ralfbarkow bungalord trellixvulnteam wook2014 genostack hoffmangeospatial jupyterchina biyankilani chriscox-westfall gaow julia-luz kainblue100 qaison earodriguezm tianwen0003 maksatjan-cyber

sos's Issues

Implementation of input and output alias

It is very useful to save input and output of a step to properly named variables so that later steps can refer to
them with their names. There can be three implementations

[step_index: input_alias=name, output_alias=name]

Pros:

Easy to notice and emphasize the proper names of input and output of the step.
Make sure the variables are list of strings.

Cons:

Special syntax (section option)
Cannot define multiple variables for different outputs.

[step_index]
name = step_input
name1 = step_output
stuff = step_output[0]
other = step_output[1:]

Pros:

Use existing variable assignment mechanism
Does not have to limit to step_input or step_output. For example, if there are two types of output files,
you can separate them into several variables.

Cons:

The variable can be a string instead of list of string depending on how it is assigned.
Users need to know step_index and step_output, which is more or less internal
it is currently unclear where to put these statements because statements after directive is step action, and statement before directives are evaluated without output information.

input: alias=name
output: alias=name

Pros:

No special syntax and no need to know step_input

Cons:

Cannot define multiple variables for difficult outputs.
Conceptually speaking input can have nothing to do with step_input and name should be filtered even group_by input files.

Logger message appears twice

When I use env.logger from pysos.utils the logging message always appear twice. I remember I must have run into this before but I cannot recall the way out. I had to add if not self._logger.handlers to get rid of the duplicate, based on a random stackoverflow.com post. I am not sure if the problem occurs in the current SoS -- since I only used the RuntimeEnvironment class in my application something might be different, and the patch below is obviously not good because there are two types of handler cout and ch and I'm not dealing with it properly. Just want to know if this is an issue before taking any other actions.

-        cout = logging.StreamHandler()
-        levels = {
-            '0': logging.WARNING,
-            '1': logging.INFO,
-            '2': logging.DEBUG,
-            '3': logging.TRACE,
-            None: logging.INFO
-        }
-        #
-        cout.setLevel(levels[self._verbosity])
-        cout.setFormatter(ColoredFormatter('%(color_levelname)s: %(color_msg)s'))
-        self._logger.addHandler(cout)
+        if not self._logger.handlers:
+            cout = logging.StreamHandler()
+            levels = {
+                '0': logging.WARNING,
+                '1': logging.INFO,
+                '2': logging.DEBUG,
+                '3': logging.TRACE,
+                None: logging.INFO
+            }
+            #
+            cout.setLevel(levels[self._verbosity])
+            cout.setFormatter(ColoredFormatter('%(color_levelname)s: %(color_msg)s'))
+            self._logger.addHandler(cout)

Rename check_output

The name action check_output conflicts with subprocess.check_output. In need of a better name.

Version control of R libraries

Software version info matching is important for pipeline reproducibility. For single executable type of software it is the users responsibility to provide and install so what SOS can do is perhaps record it somewhere, e.g., sos_session_info.txt. But SOS automatically installs R libraries which may cause problems with versions. A simply way out would be using these syntax for R library: package_host/package_name/version. For example, cran/ggplot2/0.99 and bioc/ggbio/xxx and even github R packages username@github/package (for github perhaps there are no version but commit/tag instead?).

Support for docker

What sure what is involved/required to support docker.

Installation issues.

SoS is now registered with PIP and can be installed using command pip3 install sos.

SoS needs psutil for proper cleanup when the process is killed (Keyboard interrupt). I added in setup.py the following lines

    install_requires=[
          'psutil',
      ],

so that psutil can be automatically installed with sos

$ pip3 install sos
Collecting sos
  Downloading sos-0.5.0.tar.gz (42kB)
    100% |████████████████████████████████| 51kB 2.1MB/s 
Collecting psutil (from sos)
  Using cached psutil-4.1.0.tar.gz
Installing collected packages: psutil, sos
  Running setup.py install for psutil ... done
  Running setup.py install for sos ... done
Successfully installed psutil-4.1.0 sos-0.5.0

although

python3 setup.py install

generates a warning. This ticket will remain open for all installation issues.

Benchmarking of SoS steps

I looked at the wiki but I did not see the a benchmarking feature for SoS steps. Perhaps it can be something good to have?

Global Python functions no longer works?

After updating to the current SoS my previous code does not work:

def get_md5(value):
    import sys, hashlib
    base, ext = value.rsplit('.', 1)
    res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest()
    return '{}.{}'.format(res, ext)

[simulate_1: alias = 'simulate']
n = [1000]
true_mean = [0, 1]
output_suffix = 'rds'
input: for_each = ['n', 'true_mean']
output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}'
_output = [get_md5(x) for x in _output]
run: ...

The error message is:

ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output]
 to variable _output: name 'get_md5' is not defined

What is the proper way to use customized Python functions with current SoS?

sos run script workflow -h does not work for nested workflows

Nested workflows might have multiple [parameters] step. Right now when option -h is passed to an ArgumentParser, it will print out help message for the first [parameters] step and quit. It is therefore difficult to print out options from all [parameters] steps. A solution is to hijack -h and turn on a help mode and process all [parameters] steps but it is not time to worry about this corner case for now.

Unifying SoS dictionary and list?

Snakemake has some magic that allows input to be accessed as both list and object with attributes. For example,
it allows

input:  reference='seq.fasta'
        bamfiles=['a.bam', 'b.bam']

and expressions such as input.reference, input.bamfile, and input. This can be easily achieved with some Python magic
but I am not sure if I should follow it and make SoS more difficult to understand.

Pros:

It seems nice to name different types of input files.

Cons:

It does not work well with input options such as group_by.
Unlike snakemake, SoS already allows definition of variables, so users can define

reference='seq.fasta'
bamfiles=['a.bam', 'b.bam']

input: reference, bamfiles

and use input, reference, and bamfiles separately. Here reference more likely
belongs to depends.

For simplicity and mostly the last reason, I incline not to use complicated data structure here.

Run SoS on cluster

VPT has some support for running on cluster; I'm wondering if they are mature enough to be ported to SoS?

Runtime signature of actions

In the present VST implementation, actions take care of runtime signatures and a step collects output files from one or more actions.
When an action needs to be executed multiple times with different outputs (group_by option), runtime signatures are saved for
each execution.

SoS currently adopts a step-wise output specification so it does not care how many times an action is executed or how many actions are
executed, and thus loses the fine control of action-level runtime signature.

Solution: SoS takes a mult-signature approach. If output is defined for each input, each repeat of the action will have its own signature.

Resource control

Limiting or monitoring the RAM and CPU (cores) the step uses?

Verbosity -1 for logging.ERROR in utils

In utils.py the logger has lowest level of '0' which will still display warning messages. I'm not sure if in SoS all warning messages are worth keeping, but in my interaction with the library I sometimes want to hide warning messages. I'm thinking of adding {'-1': logging.ERROR} so that verbosity -1 will do the trick. Will this make sense or is there another way out? I can send a pull request that changes also sos interface, if it makes sense ...

Error with `group_by='pairs'`

I have this script:

input_files = ('1.txt', '2.txt', '3.txt', '4.txt', '5.txt', '6.txt', '7.txt', '8.txt')

[1]
for item in input_files:
    run('touch ${item}')

[2]
output_suffix = 'out'
input: input_files, group_by = 'pairs', pattern = '{base}.{ext}'
output: pattern = '{_base}.${output_suffix}'
run('''
echo ${_input}
echo ${_output}
''')

and I got this error message

ERROR: Failed to process directive input: 'NoneType' object is not iterable

Please pull and reinstall sos for a minor bug fix before trying to reproduce the error. Thank you!

A shared step option.

Right now all SoS steps are executed in separate processes and the only return value is the step info specified by the alias option. There are two problems:

A step might want to return more values. For example, a step might return a flag that directs later steps.
Although a SoS step returns only an alias, it can make use of all variables defined up to the current step. This is not a problem for now because a step cannot rely on other steps if other steps only returns step into (alias). However, we might need to specify what other variable it relies on it problem 1 is addressed.

Proposal:

Add a shared step option to specify which variable a step will use (if available) and return. Steps that share the same variables will form a dependency relationship.

Encapsulate filename for os.path.expanduser and other functions

We try to handle ~ in filenames automatically but it is hard to ensure os.path.expanduser is called whenever a path is needed. Because we will also perform magic operations such as temporary() and dynamic() on filenames, we should encapsulate filenames in a dedicated class.

Section option `concurrent`

There are some options to allow concurrent execution of step actions.

Default to concurrent but allow section option nonconcurrent, because the actions should be safe
to execute in parallel most of the time (processing input files one by one or in pair, or
with different options).
Do not use section option, but specify this in input parameters. E.g. for_each as
concurrent for each, and nc_for_each for nonconcurrent for each. Perhaps
for_all as nonconcurrent? The problem is that group_by would also generate
multiple actions so we might also need nc_group_by.

Decided to use section option nonconcurrent)

Session info?

There is a need to save the version of program or packages of R ...

Disallow changing _output directly

There is a need to modify _output and a proposal to implement something like a filter=func option. The reason is that the outcome of pattern needs to be post-processed. However, for this special case, pbecause pattern is simple to simulate, perhaps

[func('${} ${}' for x, y in zip(_A, _B)]

is enough for

pattern='{_A} {_B}', filter=func

so we do not need a separate output option.

The real trouble behind changing _output directly is we will have to monitor each statement for the change of _output and even so change of _output might not affect _step.output because statements after process might be executed in a separate process. Even worse is that we might evaluate statement multiple times (e.g. dryrun before run) so _output = process(_output) can fail.

Dynamic input and output

If the output of step is dynamically determined, for example, by running glob.glob from output directory, the output
might be empty or wrong at the planning stage. For example

[100]
output: '*.bam'

[200]

run('samtools sort')

Step 200 might get an empty list at the planning stage and will be treated as a leaf/starting step with no input, and
causes an error.

Solution:

A dynamic section option to tell SoS that the output of step 100 is dynamic?
Holding off executing step 200 if it does not have explicit input instead of relying
on output of previous step?
A dynamic property to related variables.
dynamic input, output, and depends option.

A dynamic option to input and output directives seems to be best, but dynamic('*.txt') is also acceptable. The problem with the latter is that dynamic(func(args)) looks a bit strange because essentially speaking dynamic delay evaluation of string (*.txt) so the format dynamic(func(args)) can be a bit confusing.

Execution isolation/security features.

It can be nice if we could isolate the user groups to allow, for example, write permission only to work dir and specified directories. This is difficult to achieve so this thread is created for the purpose of gathering ideas.

Other requirements of steps

CWL has the requirement keyword which handles all sorts of requirement. SoS solves this by

depends: item in pipeline step, which lists all dependent files that will go into step signature.
fail_if() action, which stops the workflow if the requirements are not satisfied.

design of labels and for_each

The regular for loop is allowed in step action but it is difficult to run them in parallel because step action can be very complicated. The for_each parameter is designed to repeat the action with different values in parallel.

That is to say

input: for_each='method'

run('command1 ${_method}')

is more or less equivalent to

for _method in method:
    run('command1 {}'.format(_method))

String interpolation cannot be used because SoS does not understand the loop.

I am not sure the current design is intuitive though. It requires the variables to be defined before and uses a derived variable (_name for variable name), but the usual for_each loop does not fit into the existing parameter structure.

Another concern is that we are using input for both regular and looped cases. It might
make sense to use _input for the latter.

sos dryrun mode

We would like to encourage users to run a script in dryrun mode to check the integrity of the script and running environment (availability of command etc) before actually running a script. Currently the dryrun mode is implemented as

sos run -d script [workflow] [options]

We use -d instead of --dryrun because SoS allows arbitrary command line arguments to be defined by a SoS script and using --dryrun might cause conflict.

Since -d is not commonly used for dryrun mode, perhaps a dedicated dryrun subcommand is better?

sos dryrun script [workflow] [option]

We can even define both in case no method is better than another.

Read only SoS variable?

A SoS can be easier to understand if we make most SoS variables readonly. That is to say, a SoS variable can not be changed
after it has been initialized. Exceptions to this rule can be system variables and temporary looping variables. It is also
possible to use _name for all non-readable variables.

Pros:

This makes SoS less error prone.

Cons:

It sometimes make sense to adjust existing variables. For example the bam_files in the option skip example.

A table for terminology in wiki

It might worth compiling a wiki table of terms in SoS wiki that explains in brief various jargon such as "step variables", "step derivatives" , after we are somewhat more settled later.

Addition of step process keyword

Our step current has format

input:
output:
depends:
runtime:

with assignment allowed before/between these keyword, and any statement after the last keyword is considered as step process. This is clear enough in most cases but it might be clearer to have a separate process keyword that begins the step process. For example, a problem an advanced user might encounter is that

input:
_output = some_user_function()

will not work because _output is part of the step process so change of _output will not affect step output (unless he adds another keyword after the statement). It would be clearer to write this particular example as

input:
_output = some_user_function()

process:
start of process

More workflow features (nested workflow, step library)

SoS can certainly define a step with

run('sos-runner anotherworkflow ${input}'

to execute another workflow, but it would be good to define another workflow from existing ones, or maintain a library of workflows.

Define global Python functions and classes

I can see it might be helpful to allow for definition of functions or classes in [parameter]. For example,

[parameters]
def get_variable_names(val):
    ...
    return val_names

Then

[1]
input: get_variable_names(${val})

This may handle something more complicated than what a lambda function can do.

Problem with output file names

I'm guessing I should post some of my problems writing SoS scripts and, after we fix it, make it to the Example page.

Here is one:
1.sos.txt

I have this mock pipeline attached where I try to automatically generate files based on the files generated from previous step. It is a test to match file names. I cannot get this pipeline to work because output: does not loop over input parameters. Rather it treats input parameters in list as a string. What is the proper syntax for this user case?

Default input for combined workflow

I have a SoS script like this

[a_1]
[a_2]
[b_1]
[b_2]

[A_1=a_1+b_1]
[A_2=a_1+b_2]
[A_3=a_2+b_1]
[A_4=a_2+b_2]

I'm hoping to run workflow A so that all the 4 workflows will be executed. However my hope was that for example b_1 will automatically take the output from a_1 as its input. It does not seem to be the case in the current implementation. The INFO messages have empty input and I'm guessing this is because each element in the workflow still looks for its upstream by index, rather by their new position in the combined workflow. Is that the case and how do I achieve my goal, then?

future like feature to disable special SoS syntax.

The use of special syntax in SoS might be troublesome. For users who would prefer authentic Python interpreter, we might be able to provide something like

from pysos import no_raw_tripple_string, no_string_interpolation

and disable these SoS added features.

Include or import etc.

Libraries would be python modules with defined SoS actions, but how to maintain and import these modules require further investigation. Furthermore, extensive use of libraries somehow beats the purpose of SoS (readability) because libraries hide the details of actions.

Runtime control

Limiting the files that the directory the step action can write to?

Getting ride of looping features of input?

The differences between _step.input, _input, _output, _step.output , loop_var and _loop_var can be confusing. I am wondering if there can be ways to completely remove loops from steps. For example, we can move the loop to the section option level (but that place is crowded as well).

To the very least we need a good figures to explain these variables well.

Enforce naming convention?

Does it make sense to force some name conversion so that users immediately know the type and nature of variables? This might
make the script a bit more readable. For example

Constant being string and all captical letters. (e.g. RESOURCE_DIR='/path/to/resource')
Derived or temporary variables having leading underscore. (e.g. _loop_variable, _label, _input, right now input is used)
String and list of string having different name conversion? I hate perl's $ and @ symbols though.

By 'enforce', I mean SoS can give warning even error if a variable's usage does not match its name convention.

I tent to think we should let users use the style they prefer.

Section option `terminal` and `starting`

These can be determined automatically if input and output files are specified, right?
These options are therefore not needed. In addition, adding the option terminal does not prevent
other steps to depend on the ouput files of a step, leading to potential errors.

Decided to use no special option and rely on input and output specification of steps.

Other dependencies

I am wondering if we can allow other types of dependencies such as

defined("variable')
hascommand('tophat')

Existing files can be kept or converted to exist('file'). The problem here is that we need to differentiate between condition that can will never be met (e.g. command) and those that can be met in runtime (existence of file).

Disallow the change of SoS environment in step process.

Right now at least in test scripts, step processes are executed in SoS local namespace and can, for example run

executed = []

[10]
executed.append(_step.name)

This is ok for now but will significantly complicate the implementation of parallel execution of scripts because step processes would then all reply on the same central global environment, causing a lot of inter-process troubles. It would make sense to copy runtime variables to step processes and let the process execute independently. That is to say, change of local namespace will not affect SoS.

Changes that have been made:

A process keyword is used to replace runtime, basically runtime option is now process runtime options.
All statements are now allowed in a SoS step and are executed in SoS namespace.
Step processes are executed in a separate process that is independent of SoS.

That is to say, the above example will continue to work because everything is still in SoS namespace.

Default input of step?

Would it clearer to require explicit input files? Right now the a step's _step.input is
the output of previous step. The problem with no default input is that step aliases are
almost always required and make the script a bit cumbersome.

Pros and cons of using default input:

pro:

Avoid specifying input repeatedly
Avoid using step alias. This could be avoided by adding runtime variables such as input100, but input: output200 is not
a good idea because the current step will be broken if step indexes are changed. A properly named variable is much better
for this purpose.

cons:

Input can change with change of location of step within a workflow. Things can get worse if users do not arrange steps in order, or use steps shared by multiple workflows etc.

All bout the -j option.

This is a thread devoted to the parallel execution feature of SoS. Basically, we need to write or find an existing implementation of DAG. Adage (https://pypi.python.org/pypi/adage/0.1.7.1) looks like a close match but we will need further investigation of this package.

Celery is something we can use. It would allow us to expand to the server environment because it would become user's responsibility to set up the Celery environment for SoS to run across nodes.

Default sigil

Python 3.6 introduced formated string, which uses {}. It also accepts expressions (with addition of conversion ! and format specification :)
but so I am very attempted to declear that all SoS string literals are python format string. However, the use of {} forces the doubling of regular
{ and }, which is disastrous for scripts such as shell and R.

I guess I will have to specify that SoS string is different and has to be configurable sigil for different languages.

Step process made prettier

Current SoS steps look more or less like this

[*_31]
# get the number of mapped reads
input: sortbam.output, group_by='single'
output: '${_input}.bai', '${_input}.flagstat'
process:  concurrent=False

run('''
samtools index ${_input}
samtools flagstat ${_input} > ${_input}.flagstat
''')

and I am wondering if we can write it as

[*_31]
# get the number of mapped reads
input:     sortbam.output, group_by='single'
output:    '${_input}.bai', '${_input}.flagstat'
run:       concurrent=False

       samtools index ${_input}
       samtools flagstat ${_input} > ${_input}.flagstat

where run etc can be any SoS action that accepts a single string input, and we can add multiple sections if needed (e.g. run:, python:)

Pros:

Make SoS scripts easier to read
Remove most of the triple quotes and actually allows scripts with triple quotes
Automatic dedent (remove common leading spaces) allows the use of [XX] etc in script.

Cons:

The current logic of process will start a new process and run() etc can be inline (executed with SoS) or external (separate process) need to be changed. We can keep the run() form inline though.
Python syntax highlighting will fail if we include blocks of, e.g., R code in this way. But we previous do not have syntax highlighting for R code anyway.
More rules to remember

backward dependency rules?

There can be a need for dependency rules. For example, if a bam index file is needed (dependent upon) and
the bam file exists, then samtools index would be automatically called. This does not sound like a good
idea because samtools index can always be put as a regular part of the workflow. On the other hand, adding
such rules can help if these common steps are not always needed, or needed for multiple steps of the pipeline.
I am not sure how useful such magic is. Implementation wise it might not be too difficult, some syntax
like the following could be used

# section name does not have index so it would be called if and only if the
# step is needed. pattern of output is defined as section option.
[ index_bam : output=*.bam.bai]

# input is defined from output
input = output[0][:-4]

# the action is defined as usual, but output does not need to be defined
# again.
run('samtools index ${input}')

could potentially work. This allows gnumake style definition of pipelines that can be used to construct complete workflows. It seems to contradict
the design of SoS though.

Decision: auxiliary steps.

Allow assignment between step input and output

In cases where output needs to be composed from input, there can be a need to process input or _input before step output is specified.

Right now a SoS step can have

assignment
step_input/output/depends
statements

The proposed layout would be

Mixture of assignment/step_input/output/depends
statements

Note that step input could introduce loop and run the assignments and step output and depends repeatedly.

Match input/output file grouping

The ticket stems from discussion in a closed issue and for better organization I make it separate issue here. The problem is that when group_by and pattern are both used in input: in conjunction with pattern in output:, the output file names is then based on input file names yet it does not honor the group_by rule from input:, thus creating name mismatch in upcoming actions.

Portability of runtime signatures

Runtime signatures are usually not portable because filenames and pathes would change when the project
is moved to another location. However, because it is possible for a project to be archived, and restored
for some further analysis, saving the runtime signatures with the project might be usable. In addition,
if runtime signatures include standard and error output of the commands, it can be useful to keep them
for later reference. The interface can be

sos admin --pack_runtime runtime.db

for packing runtime information to a file,

sos admin --unpack_runtime runtime.db

for unpacking runtime information of the current project, and

sos admin --unpack_runtime runtime.db

The sos view or sos edit commands might be used to show standard and error output of commands.

Support only python 3

Python 3 contains modules such as concurrent and asyncio which can be useful for SoS. These modules are not available for Python 2.7 so it makes sense to support only Python 3.x.

Handling of filenames with spaces and other special characters

We have addressed this problem with !q and !r format converter

vatlab / sos Goto Github PK

sos's People

Contributors

Stargazers

Watchers

Forkers

sos's Issues

Recommend Projects

Recommend Topics

Recommend Org