openwdl / wdl Goto Github PK
View Code? Open in Web Editor NEWWorkflow Description Language - Specification and Implementations
Home Page: https://www.openwdl.org/
License: BSD 3-Clause "New" or "Revised" License
Workflow Description Language - Specification and Implementations
Home Page: https://www.openwdl.org/
License: BSD 3-Clause "New" or "Revised" License
Dynamically sizing HDD & SDD sizes is a great feature about WDL + Cromwell that gives you cost savings in certain cloud environments. A current limitation is that you cannot call size
on an Array[File]
type or Array[Array[....[File]..]]
Types, which forces you to either guess the required disk size, or build logic into your workflow upstream for determining file sizes.
Some practical considerations:
File
Types, ie Map[String,File]
,Pair[File,File]
etcIn the spec, a command part option named quote
is listed: https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#command-part-options
However, although the spec goes on to have a subsection explaining the other four options (sep
, true
, false
, and default
) there appears to be no documentation of what is meant by the quote
option.
It would be nice to get methods to iterate over maps with scatter. It will also be good to have access to map.keySet and map.valueSet to be able to process keys or values of maps separately.
To make life easier for authors doing string checks like in https://gatkforums.broadinstitute.org/wdl/discussion/10354/multiple-backends-for-cromwell#latest
Eg:
File f
Boolean a = endsWith(basename(f), ".suffix")
Boolean b = startsWith(basename(f), "prefix.")
Boolean c = contains(basename(f), ".middle.")
Draft implementation: https://github.com/openwdl/wdl/commits/133-string-find/
Broad's GDAC often deals with sparsely and inconsistently populated data, and many analyses support a range of acceptable inputs. For example, our methylation preprocessor will correlate methylation data to expression data if it is available for a sample set. However, this means that in addition to optional inputs (already supported by WDL), output files are also optional, or at least conditional upon inputs. Currently cromwell raises an error when an expected output is not present, and I don't believe there is syntax to support it in WDL.
It would be nice to be able to specify that an output is optional. One solution could be to mirror to the optional input syntax, and I could specify optional outputs like this:
task my_task {
... inputs, commands, etc...
output {
required_output="file1"
optional_output?="file2"
}
}
Having this would eliminate the need to write additional logic to create/handle blank files.
There's currently no literal for this concept, inviting a variety of if
-based hackery around the problem.
Please add the ability to do nested scatter loops. The current single sample workflow runs HaplotypeCaller in parallel by scattering over interval. I would like to add a scattering by sample step so it can handle multiple samples.
(@kcibul - another Jeff Special)
The status of our spec(s) is super confusing.
There's "Draft 1 (closed)" which predates our team. There's "Draft 2 (open)". There's the master branch. There's the develop branch. The "Draft 2" on master doesn't match what's in develop all the time. There's also only one real implementation of the spec, and it's not even complete.
I recognizing the desire on our past selves' part to be official about everything but we could simplify this a lot by just having a spec that's allowed to change at will until if/when it actually becomes a problem.
Hi
Thanks for releasing this WDL script. I noticed the bwa parameters are: bwa_commandline="bwa mem -K 100000000 -p -v 3 -t 16 $bash_ref_fasta"
What is the -K
flag for? it is not defined in the bwa manual. Also is -M
missing?
Thanks
Matt
There is barely a mention of how workflow outputs works in the SPEC.md file
The only mention of it is here: https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#outputs
Acceptance criteria: write a better explanation of how workflow outputs work
It says:
"Tasks define all their outputs as declarations at the top of the task definition."
It should say:
"Tasks define all their inputs as declarations at the top of the task definition."
We are converting the WDL scripts to CWL, and there seem to be some discrepancies regarding the correct way to format arguments within the command line with boolean values.
While working on converting the WDL script to CWL using a wdl2cwl converter (https://github.com/common-workflow-language/wdl2cwl), we found that if a boolean value exists for a particular argument, errors were being raised. There is an issue regarding the interpolation of the argument within the command line.
According to the WDL HaplotypeCaller_3.6 script, if an argument takes in a boolean and its default value is false
, then the command line includes both the argument and the value false
at the end.
For example, examining this piece of code
-allowNonUniqueKmersInRef ${default="false" allowNonUniqueKmersInRef}
With a default value of false, the script would return
-allowNonUniqueKmersInRef false
if given no value (using the default) when in reality it should return nothing, as only a true value actually causes it to return something in the command line. An "invalid argument value false" error is returned in this case. The WDL script does not seem to portray that. We suspect that the interpolation should instead look like
${true = '-allowNonUniqueKmersInRef, false = '', Boolean bool}
as shown the documentation for wdl (https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#integer-lengtharrayx). When converted to CWL, the command line output does not run if done the current way. If false, the argument should not exist on the command line, and if true, the command line should have
-allowNonUniqueKmersInRef
.
The true
and false
values should not exist in the command line.
There also seem to be problems regarding arguments with array inputs. According to the documentation (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--input_prior) for something such as kmerSize
, with an array of [10, 25]
, then command line should look like
-kmerSize 10 -kmerSize 25
.
However, the output is -kmerSize [10,25]
, which is not understood by the program as it is not formatted correctly. According to the WDL documentation, the argument should look like array [String] prefix(String, Array [X])
so that a prefix exists for each value. In conversion from WDL to CWL, these problems are causing errors that causes the program to fail.
The format of the WDL seems to be incorrect, if we are understanding the code correctly. Is it possible that there exists a different version of the WDL code for HaplotypeCaller?
Perhaps we aren't understanding the code correctly. Any clarification is appreciated! Thank you!
Until #25 is fixed, the gatk wrappers should not be using optional arrays, such as the intervals files. Instead they should use this bash workaround, with something similar to:
Array[String]? foo
command {
if [ -n '${sep=',' foo}' ]; then
FLAG=--prefix=${sep=',' foo}
else
FLAG=''
fi
echo $FLAG
}
@danbills commented on Thu Jul 06 2017
It would be nice to restrict types to specific values so that users can only specify one of a limited number of a values.
Per this forum post
@cjllanwarne commented on Mon Jul 10 2017
Could be a nice bring-along with broadinstitute/cromwell#2283
@patmagee commented on Tue Aug 29 2017
Just wondering if there has been any discussion regarding this.
@katevoss commented on Tue Aug 29 2017
@patmagee not to my knowledge but I can start the ball rolling.
@geoffjentry & @mcovarr, any opinions on supporting enums?
@geoffjentry commented on Tue Aug 29 2017
I'm leery but not necessarily against it. I'm also generally the one with the most conservative opinion in terms of adding WDL syntax, so view that take as a lower bound of acceptance :)
@patmagee what were you thinking in terms of the syntax?
Pinging @vdauwera so she's abreast of this convo.
@vdauwera commented on Wed Aug 30 2017
FWIW Enum support would definitely be very valuable to us in GATK-world, and are likely to be useful in general.
My one caveat would be that they should be easier to work with than they were in Queue (friendly elbow poke at @kshakir).
@patmagee commented on Fri Sep 01 2017
@geoffjentry I share your concern with adding any sort of syntax to the spec, and my first inclination is to fake support enumeration. Im not exactly sure how to do that though, nor am I sure of whether its possible.
The cleanest way I can think of implementing would be maybe something like a Java style enum:
enum MyEnum {
"A","B","C"
}
workflow wf {
MyEnum thisIsMyEnum
}
Another way that we would be able to do it would be define an Enum type in a workflow like so:
workflow wf {
#This would get overridden at run time, but the value would need to be validated
Enum greeting = ["HELLO","GOODBYE"]
or
#Done override anything but validate it
Enum["HELLO","GOODBYE"] greeting
}
@katevoss commented on Thu Sep 28 2017
@geoffjentry sounds like this is more of a WDL feature request, shall I move it to the WDL repo?
@geoffjentry commented on Thu Sep 28 2017
https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#true-and-false
The docs indicate that you do not need a "true" and a "false", but in fact both are required.
I would like to be able to get a list of 3 kinds of files in a directory
*.png
*.txt
*.html
and store this list into a single output variable. I know the glob() function can be used to find each type, but that means 3 glob() invocations in the output{} section of my WDL, and there is no way to paste them together.
I suppose I could do "ls *.png *.txt *.html >> outputs.lst" in the command{} section, and then in my output{} section do Array outputs = read_lines("outputs.lst").
But it would be cleaner to be able to do something like
Array[File] = glob(".png") + glob(".txt") + glob(.html")
or
Array[File] = concat( glob(".png"), glob(".txt"), glob(*.html"))
or variants.
Needs Refinement
We want to be able to define types for the values of objects. One suggestion was something like the following (note struct
is using as a possible replacement for object
, see below):
struct MyType {
o_f: File
x: Array[String]
}
MyType foo = read_object(...)
It will coerce to the types it expects and if it can't that's a failure.
Open questions:
struct
above), or replace objectsobject
For Mutect we use a lot of sub-workflows, and it's tedious to pass a large set of parameters repeatedly. For example:
import "single_sample.wdl" as SingleSample
workflow MultiSample {
#these are the parameters of the single-sample subworkflow
File param1
File param2
. . .
Int param20
#other parameters
scatter (sample in samples) {
call SingleSample.single_sample {
input:
File param1
File param2
. . .
Int param20
}
}
}
In this case, the same list of parameters is copied 5 times: the single-sample task inputs, the single-sample workflow inputs, the single-sample workflow's call to the task, the multi-sample workflow inputs, and the multi-sample workflow's call to the single-sample workflow. It would be really nice to be able to encapsulate all these parameters, eg
params MyParams {
File param1
File param2
. . .
Int param20
}
task MyTask {
MyParams params
command {
java -jar ${params.gatk} -R ${params.reference} -L ${params.intervals} . . .
}
}
workflow MyWorkflow {
MyParams params
call MyTask { input params = params }
}
Consider a WDL command section with 2 or more Unix statements. Cromwell currently reports that such commands "succeed" even when one or more of its Unix statements fail (i.e. return a non-zero status code), as long as the final statement returns a zero status code (e.g. echo "Done").
This is fragile, and I think it's far more common to want Cromwell to report a failure in such cases than to silently ignore them.
One workaround is to chain all statements together into one effective line, using && to separate each statement, but this is less flexible and readable (uglier). To make it easier for the common case to be more robust and readable, it would be helpful to consider having the generated script file begin with "set -eo pipefail" (maybe even 'set -euo pipefail')
I am following the instructions here:
https://github.com/broadinstitute/wdl#getting-started-with-wdl
I have downloaded cromwell-0.19.jar.
I created the example "hello.wdl".
The README indicates that there is an "inputs" command to see the inputs to the wdl.
When I run "java -jar cromwell.jar", it indicates that there are just two commands "run" and "server".
It appears that the "inputs" command has moved to "wdltool".
I downloaded wdltool from https://github.com/broadinstitute/wdltool/releases
$ java -jar wdltool.jar inputs hello.wdl
{
"test.hello.name": "String"
}
The documentation should be updated:
Also at the bottom of this first example, it indicates:
Since the hello task returns a File, the result is a file that contains the string "hello world!" in it.
It would be worth adding a demonstration of that:
$ cat /home/user/test/c1d15098-bb57-4a0e-bc52-3a8887f7b439/call-hello/stdout8818073565713629828.tmp
hello world!
We need to decide (and then treat appropriately) the scope of what we're supporting in the WDL universe. For example, are we supporting, and thus maintaining, pywdl?
It looks like output of the tasks work in different way depending on if you call it inside scatter (like with dec, where I get integer from inc.increment) or outside of scatter block (in such case I get array from inc.increment)
workflow wf {
Array[Int] integers = [1,2,3,4,5]
scatter (i in integers) {
call inc {input: i=i}
call dec {input: i=inc.increment} #inc.increment is integer
}
call sum {input: ints = inc.increment} #inc.increment is an array
}
This is very confusing. Maybe it is possible to clarify the behaviour in docs?
We have a bunch of awesome tools for WDL! Unfortunately I miss the roundup all the time.
It'd be nice to make the link more prominent, perhaps as one of the highlighted pieces?
I understand this is a little more noisy, was thinking perhaps listing these things vertically may help.
Anyway I crafted this professional drawing to be more specific
It appears that the command part options example for true and false given in the SPEC (https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#true-and-false) is broken.
First of all, the parser does not seem to allow commas between command options (the list
macro is invoked without the :comma
second argument; https://github.com/broadinstitute/wdl/blob/develop/parsers/grammar.hgr#L388).
Secondly, it does not appear to be possible to assert a type of a command part by prefixing the expression with the type as is done in this example:
For example, ${true='--enable-foo', false='--disable-foo' Boolean yes_or_no} would evaluate to either --enable-foo or --disable-foo based on the value of yes_or_no.
Because the grammar does not seem to accept a type specification here, it is not clear to me when and how the true
/ false
key values should be interpreted. In this example, it would be clear if the type of yes_or_no
is Boolean
because the expression is just a simple identifier, but if the expression was more complex (such as a function call) it would be harder to determine the type of it without understanding the execution environment.
We have come across this while trying to improve the wdl2cwl converter (https://github.com/common-workflow-language/wdl2cwl) to handle more complex command expressions, and in order to add the true
and false
handling we will need to understand when they are meant to be invoked.
After reading the docs and some of the example scripts (such as the WDL for HaplotypeCaller), it seems possible that we are meant to use true
and false
options (each with a default value of ""
) whenever the type of the expression is Boolean
- but that they do not apply when the type is Boolean ?
as in that case the default
option is used instead and the true
/ false
values are simply stringified as in this example (from https://github.com/broadinstitute/wdl/blob/develop/scripts/wrappers/gatk/WDLTasks_3.6/HaplotypeCaller_3.6.wdl#L141):
-allelesTrigger ${default="false" useAllelesTrigger}
Which appears to evaluate to -allelesTrigger false
if useAllelesTrigger
is false
or unset/null, and -allelesTrigger true
if useAllelesTrigger
is true
.
If the true
and false
options apply to both Boolean
and Boolean ?
and we are meant to apply the default (""
) values of the true
and false
options as the docs suggest -- in other words, if the above line was equivalent to:
-allelesTrigger ${true="" false="" default="false" useAllelesTrigger}
Then I would expect this to evaluate to -allelesTrigger false
only when useAllelesTrigger
is unset/null, and to -allelesTrigger
when useAllelesTrigger
is true
or false
.
Clarification of the documentation would be helpful!
The grammar currently only allows meta
in tasks. We'd love to be able to explicitly track versioning & provenance of workflows themselves by embedding some identifiers in a workflow-level meta block, something like the following:
...
workflow finishQuickly {
meta {
revision: "1.0.0"
url: "https://path/to/my/repo/finishQuickly.wdl"
}
call Usain
}
I have a Map[String,File]
from which I want to extract the values into an Array[File]
, that would be equivalent to eg the Java expression List list = map.values
.
Justification: I need to provide multiple files to a task and I don't want to have to hardcode separate arguments for each. I can't just use an Array at the workflow inputs level because I need to be able to call on one of the files specifically in other tasks.
Worked out use case:
{
"DoStuffWithKnownSitesWf.known_sites_VCFs_map": { "dbsnp": "dbsnp_138.vcf", "mills": "mills_indels.vcf", "other": "other_sites.vcf" }
"DoStuffWithKnownSitesWf.known_sites_indices_map": { "dbsnp": "dbsnp_138.vcf.idx", "mills": "mills_indels.vcf.idx", "other": "other_sites.vcf.idx" }
}
task SomeTool {
Array[File] known_sites_VCFs
Array[File] known_sites_indices
command {
doSomething -knownSites ${sep=" -knownSites " known_sites_VCFs}
}
}
task SomeOtherTool {
File dbSNP_VCF
File dbSNP_index
command {
doSomethingElse --dbsnp ${dbSNP_VCF}
}
}
workflow DoStuffWithKnownSitesWf {
Map[String, File] known_sites_VCFs_map
Map[String, File] known_sites_indices_map
call SomeTool {
input:
known_sites_VCFs = known_sites_VCFs_map.values,
known_sites_indices = known_sites_indices_map.values,
}
call SomeOtherTool {
input:
dbSNP_VCF = known_sites_VCFs_map["dbsnp"],
dbSNP_index = known_sites_indices_map["dbsnp"],
}
}
The expression dbSNP_VCF = known_sites_VCFs_map["dbsnp"]
already works perfectly. But there's currently no way to do a straightforward known_sites_VCFs = known_sites_VCFs_map.values
. This is the feature request. Actual syntax can be different of course.
Bonus points for making the keys available as well, though I don't have an immediate use case in mind.
Draft implementation: https://github.com/openwdl/wdl/tree/43-map-values
As mentioned a couple of times, the pair dereferencing is not intuitive (you have to look it up in the docs to be able to find it). Perhaps we could additionally support array-style dereferencing?
Pair[Int, Int] p = (100, 22)
Int one_hundred = p[0]
Int twenty_two = p[1]
As the definition of globbing can vary from shell to shell, and language to language, the WDL spec should really clearly define its version of globbing rather than just giving examples.
Obviously, there is the most basic implementation of *
, ?
, [...]
, though the only examples given are of *
.
However, as this seems to be the only form of pattern matching available in WDL, the addition of brace expansion, character classes, and bash-style extended globbing would be excellent.
Allow for WDL like this:
task x {
command {...}
output {
String output = "foobar"
}
}
The output is called output
which currently does not work. WDL parser thinks this is a keyword.
I would like there to be support for the following type of scenario:
task foo {
Array[String] bar
Array[String]? baz
command {
something.py --input ${sep=' ' bar} ${"--optionalInput" + ${sep=' ' baz}}
}
...
As far as I can tell there isn't a way to do this now.
I have the following toy workflow, which I invoke with -jar cromwell-25.jar run example.wdl empty_inputs.json
. It reads a tsv that has either one column or two and scatters a task over each row of the tsv. The task prints the second column if it is present. When the input fake.tsv
has one column, everything is fine. However, when it has two columns eg
1</TAB>1
2</TAB>2
it fails with "Could not construct array of type WdlMaybeEmptyArrayType(WdlOptionalType(WdlIntegerType)) with this value: List(WdlInteger(1), WdlInteger(2))". (Side question: why is it trying to make a list out of values in two different scattered rows?) Using Int?
in the conditional instead of Int
does not make a difference.
Another bizarre twist: if instead of reading in from a file I hardcode the array, the error persists when each row of the array has the same number of columns but goes away when some rows have two columns and some do not. That is: Array[Array[Int]] table = [[1,1,1], [2,2]]
works, but Array[Array[Int]] table = [[1,1], [2,2]]
gives the same error as above.
task printInt {
Int? int
command { echo "${int}" > out.txt }
output { File out = "out.txt" }
}
workflow optional {
Array[Array[Int]] table = read_tsv("fake.tsv")
scatter (row in table) {
if (length(row) == 2) {
Int int = row[1]
}
call printInt {input: int=int }
}
}
Very often I need to slice an array. For instance, in RNA-Seq experiments I have tsv files with first column as a condtion and all subsequent as GSM ids for samples. It would be useful to get all elements of array except the first. I would love to have something either like Scala's head/tail or like Python access for slices of the array.
@jsotobroad commented on Fri Sep 22 2017
It would nice to have a way of summing up a list of floats like if you have a scatter task that you then want to gather the outputs of, its not straightforward on how you dynamically size that gather task. This is what we currently do (and if it's stupid let us know!)
scatter (bam in list_of_bams) {
call MakeBam{
inputs:
bam
}
Float mapped_bam_size = size(MakeBam.output_bam, "GB")
}
call SumFloats {
input:
sizes = mapped_bam_size,
}
.....
task SumFloats {
Array[Float] sizes
command <<<
python -c "print ${sep="+" sizes}"
>>>
output {
Float total_size = read_float(stdout())
}
runtime {
docker: "python:2.7"
preemptible: preemptible_tries
}
}
@geoffjentry commented on Fri Sep 22 2017
@jsotobroad Please file this in the wdl repo, not the cromwell repo
The WDL spec specifies that the values in the KV pairs of meta and parameter_meta blocks are strings but the grammar allows for expressions. This leads to a confusing use of RuntimeAttribute as part of the AST, as evidenced by the discussion around the comment in wdls4s #53
There doesn't appear to be a reason to use an expression in these blocks (paging @kcibul - do you disagree?), let's remove that capability which will also allow us to clean up wdl4s.
This ticket requires both updates to the spec/grammar here as well as cleaning up wdl4s.
It is useful to have files created by tasks/workflows reflect the name of the thing that created them, like
task Foo {
command {
echo "blah blah blah" > Foo_output.txt
...
}
...
}
But right now this name has to be hardcoded into the WDL. It would be cleaner and more robust (e.g. to changes in the workflow or task name) if it could instead be referenced by introspection, such as ${__name__} or something. Rather than duplicating the task name as a hardcode in multiple places within the various sections of the WDL body.
I would like to run PublicPairedSingleSampleWf_160927
workflow, but I can not find where to look for the sample data in the input.json
. Could it be mentioned in the README?
runtime should be treated the same as "command" or "output"
Example:
runtime { docker: "broadinstitute/picard"}
Thanks.
The timestamp was incremented to follow the newest version but the doc wasn't actually updated. Changes are minimal but should be noted, and the supported versions should be updated.
PublicPairedSingleSampleWf_160927.options.json only includes 2 of the 4 us-central1
zones:
"zones": "us-central1-b us-central1-c"
I'm not sure if there is a rational for limiting the zones that Cromwell can use to run tasks. The 4 Compute Engine zones in us-central1
are:
(Tagging @kcibul as I don't know if folks actually look in here regularly)
Object is a WDL type although as far as I can tell it is not defined in the WDL spec at all. One can kind of infer what it is from mentions but I didn't see any concrete explanation.
This WDL forum post presents a use-case for why a contains style function on arrays could be very useful. It also reflects on the downsides of current workarounds in WDL.
It would be helpful to have a function called "contains" or "exists" which returns a Boolean depending on whether a given value exists/is contained within a given array.
Draft implementation: https://github.com/openwdl/wdl/tree/117-contains
Truncated from @cjllanwarne
I believe the order of precedence should be:
- Inputs provided by the inputs JSON
- Inputs specified explicitly by call (i.e. in the workflow doing
call foo {input: inputInt = 6}
)- The default value in the task
Right now the order appears to be:
- Inputs specified explicitly by call (i.e. in the workflow doing
call foo {input: inputInt = 6}
)- The default value in the task
- Inputs provided by the inputs JSON
Full context: see the conversation from the Cromwell repo
It seems like this should be possible but it currently isn't (from what I can tell).
At the Mint meeting heard these will all be public in the HCA repo (Skylab repo) and available to anyone to take. These would relate to single-cell RNA-Seq. Contacts are Mint team.
E.g. 10X pipeline WDL
At the least, point users to the repo in our README.
In issue #89 I mentioned a use case of wanting to get a list of 3 kinds of files in a directory
*.png
*.txt
*.html
and store this list into a single output variable. The glob() function is not expressive enough to generate such a list with a single call, but a regular expression-based function would be.
I just used the attached wdl file (saved as .txt so github would attach it) in IntelliJ and the highlighter seems to find an error with the first '?' in the file. The error pops up as: , WdlTokenType.IDENTIFIER or WdlTokenType.LSQUARE expected, for '?'
Any subsequent '?' don't have an error.
The tutorial scripts featured on the website should be mirrored here. They can live in scripts/tutorials/
(scripts/
will be created by https://github.com/broadinstitute/wdl/pull/36 when it's merged).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.