I think we would do the usability of BioJulia a lot of favours if we made it easier to use the file format packages to do the processing of HTS files a la seqtk, samtools etc.
It's currently not as convenient to do, for example - comparing to seqtk:
using FASTX
transcribe(open(FASTQ.Reader, "x.fq"), open(FASTA.Reader, "x.fa"))
As it is to just do seqtk seq -a in.fq.gz > out.fa
or
using FASTX
names = Set(...names names names...)
open(FASTQ.Reader, "x.fq") do x
open(FASTQ.Writer, "y.fq") do y
r = FASTQ.Record()
while !eof(x)
read!(x, r)
if FASTQ.identifier(r) ∈ names
write(y, FASTA.Record(r))
end
end
end
As it is to do seqtk subseq in.fq name.lst > out.fq
One of julias strengths - if you're not using the part of BioJulia that provides efficient core data structures and algorithms, is the aspects that make it a pretty great glue language. So the ability to very tersely express the processing of HTS files at a high level is important.
There are a few proposals and options I'm considering to improve this area and wanted to get feedback or other ideas.
- File handles / string literals
The open(FASTQ/A.Reader
pattern gets irritating after a while.
I'd like to propose a filehandle kind of string literal that, when used in BioJulia ecosystem functions, does that for you. e.g.
using FASTX
transcribe(rdr"x.fq", wtr"x.fa")
It could even detect FASTA or FASTQ at the metaprogramming stage by inspecting the string. These could generate the code to open a stream, or just create a filehandle type that causes dispatch to use a transcribe function that does the appropriate opens.
- A way of tersely expressing stream processing / common HTS file processing
A lot of file processing involves opening a source, doing some stuff and writing to a sink. I wonder if we should implement that as a function(s), and then express e.g. transcribe
, using that instead.
With 1, this could look something like:
process(rdr"x.fq", wtr"y.fq") do record
...some stuff
end
We could also have a few of these methods e.g. a filter
, transform
, etc that make various assumptions about the functions.
Or we could just have a more general function that, for e.g. if the provided function returns a record, write it, if it returns nothing, don't. That way a user can make a decision about dropping records in the same function as it may transform records.
Another alternative to this would be to do similar to what I did with the DSL in Pseudoseq: https://github.com/bioinfologics/Pseudoseq.jl/blob/master/src/sequencing/DSL.jl
And provide a series of Iterators
like types, that wrap either a reader, or another iterator, that can be plugged into a writer.
So more like:
# Example of if `write` was extended to have methods that consumed a reader or an iterator...
# To achieve transcribe
write(wtr"out.fa", rdr"in.fq")
# To achieve a filter
write(wtr"out.fa", Filter(x -> rand(Bool), rdr"in.fa"))
# We could overload e.g. `|>` to make the above a bit nicer.
# Or even
rdr"in.fa" |> Filter( x -> rand(Bool) ) |> wtr"out.fa"
# Which overloading certain argument combinations for |> could be the equivalent to:
write(wtr"out.fa", Filter(x -> rand(Bool), rdr"in.fa"))
Basically angling for a kinda API that approaches a DSL on top of the readers and writers.
One such iterator could be one that modifies the iteration behaviour of an unwrapped reader, to do in-place record reading
over the default that allocates a new copy every time. e.g.
Inplace(rdr"in.fa") |> Filter( x -> rand(Bool) ) |> wtr"out.fa"
# or perhaps
rdr"in.fa" |> Inplace |> Filter( x -> rand(Bool) ) |> wtr"out.fa"
Or an alternative to this - we just try to provide as many individual functions like filter, trim, transcode etc. as possible and hope we cover everyone's common needs.
pinging @jakobnissen @CiaranOMara @kescobo