Coder Social home page Coder Social logo

biojulia / fastx.jl Goto Github PK

View Code? Open in Web Editor NEW
61.0 9.0 19.0 1.04 MB

Parse and process FASTA and FASTQ formatted files of biological sequences.

Home Page: https://biojulia.dev

License: MIT License

Julia 100.00%
fasta bio biojulia julia parsing file-format fastq fastq-format fastq-files fasta-format

fastx.jl's Introduction

FASTX

Latest Release MIT license DOI Stable documentation Latest documentation Pkg Status

Description

FASTX provides I/O and utilities for manipulating FASTA and FASTQ, formatted sequence data files.

Installation

You can install FASTX from the julia REPL. Press ] to enter pkg mode, and enter the following:

add FASTX

If you are interested in the cutting edge of the development, please check out the master branch to try new features before release.

Testing

FASTX is tested against Julia 1.X on Linux, OS X, and Windows.

Latest build status:

Unit tests Documentation

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Backers & Sponsors

Thank you to all our backers and sponsors!

backers

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the Julia Slack workspace, or you can try the Bio category of the Julia discourse site.

fastx.jl's People

Contributors

bioturbonick avatar blaiseli avatar ciaranomara avatar dehann avatar github-actions[bot] avatar harryscholes avatar jakobnissen avatar kescobo avatar timholy avatar tp2750 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastx.jl's Issues

Why export only a few names?

I was teaching using BioJulia code, and someone was confused when doing using FASTX and being able to do sequence(record) and identifier(record), but not hasidentifier.

What's the logic behind what is exported or not exported?

I think we should export all user-facing functions.

Reading fastq from TranscodingStream

I would like to be able to parse gzipped fastq files through a TranscodingStream, but for a start, I'm trying on a non-gzipped one, using the NoopStream.

Based on examples, I came up with the following approach:

julia> using FASTX

julia> using TranscodingStreams

julia> fq_filename = "/tmp/test_qaf_demux_proc/Undetermined.fq"
"/tmp/test_qaf_demux_proc/Undetermined.fq"

julia> open(NoopStream, fq_filename) do stream
           reader = FASTQ.Reader(stream)
       end
ERROR: UndefVarError: stream not defined
Stacktrace:
 [1] #Reader#5(::Nothing, ::Type{FASTX.FASTQ.Reader}, ::TranscodingStream{Noop,IOStream}) at /home/bli/.julia/packages/FASTX/PilAI/src/fastq/reader.jl:27
 [2] FASTX.FASTQ.Reader(::TranscodingStream{Noop,IOStream}) at /home/bli/.julia/packages/FASTX/PilAI/src/fastq/reader.jl:19
 [3] (::getfield(Main, Symbol("##9#10")))(::TranscodingStream{Noop,IOStream}) at ./REPL[11]:2
 [4] open(::getfield(Main, Symbol("##9#10")), ::Type{TranscodingStream{Noop,S} where S<:IO}, ::String) at /home/bli/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:161
 [5] top-level scope at REPL[11]:1

The following works:

julia> open(NoopStream, fq_filename) do stream
           for line in eachline(stream)
               println(line)
           end
       end

So I don't get why I have ERROR: UndefVarError: stream not defined if instead of reading from the stream, I just pass it to a FASTQ.Reader.

I also tried as follows:

julia> stream = open(NoopStream, fq_filename)
       reader = FASTQ.Reader(stream)
ERROR: MethodError: no method matching open(::Type{TranscodingStream{Noop,S} where S<:IO}, ::String)
Closest candidates are:
  open(::AbstractString, ::AbstractString) at iostream.jl:345
  open(::Function, ::Any...; kwargs...) at iostream.jl:373
  open(::Base.AbstractCmd, ::AbstractString) at process.jl:626
  ...
Stacktrace:
 [1] top-level scope at REPL[12]:1

I'm beginning in Julia, so I'm not really sure I interpret the error messages and source code correctly, but isn't FASTQ.Reader supposed to accept a TranscodingStream ?

function Reader(input::IO; fill_ambiguous = nothing)
    if fill_ambiguous === nothing
        seq_transform = nothing
    else
        seq_transform = generate_fill_ambiguous(fill_ambiguous)
    end
    if !(input isa TranscodingStream)
        stream = TranscodingStreams.NoopStream(input)
    end
    return Reader(State(stream, 1, 1, false), seq_transform)
end

And also, isn't my first error due to stream not being defined in the above code when input isa TranscodingStream is true ?

Your Environment

  • Package Version used: freshly installed as of today with Pkg.add("FASTX")
  • Julia Version used: 1.2.0
  • Operating System and version (desktop or mobile): Xubuntu Linux 16.04

Consider making `.data` fields of records const

Julia 1.8 gives us the ability to declare the data field of records const. Doing this could unlock some optimizations. However, there is currently no way of doing this in a way compatible with old Julia versions.
Consider doing this in a future version of Julia.

Use ReTest or similar niceness?

FASTX is composed of the FASTA and FASTQ modules, and a little bit of common code.
After the test refactoring for v2 (see #37 , implemented in #68 ), it should be relatively straightforward to use shiny new test toys like ReTest so that new features could be selectively tested.

CC @kescobo , who recently showed interest in applying ReTest to Automa, IIRC :)

Is FASTX.jl really only compatible with julia 1.1?

I'm trying to install a Julia application using FASTX in a singularity container. I had a working recipe to make this happen with a Julia 1.2 based container. Now I would like to do this in a Julia 1.0 based contained, but I got into Unsatisfiable requirements detected for package FASTX (see https://discourse.julialang.org/t/solving-restricted-by-julia-compatibility-requirements-to-versions-uninstalled-no-versions-left/29068).

I see that the [compat] section of the Project.toml file has precise version number indications. In particular, I see the following:

julia = "1.1"

What are the effects of these statements? This does not prevent me from using FASTX with Julia 1.2. Could this however have an effect for lower version numbers ?

ERROR: Error when parsing FASTX file. Saw unexpected byte 'C' on line 1

I am using FASTX.jl for the first time. I try to read a FASTA file "Test_txt.txt", which contains

>Acc1
GATC
>Acc2
TGAC

and get the error message mentioned in the title.
The problem is clearly in my file. However, it is readable by other softwares (BioEdit, APE...) able to manage FASTA files. I got the same error message with FASTA files downloaded from the NCBI and with test files saved as MS-DOS or Unicode files. I had a look at the throw_parser_error function in the FASTX.jl source file but it didn't help.
I will appreciate if you could let me know what the "byte 'C' on line 1" might be and how to get rid of it?
Best wishes,
Laurent

julia> using FASTX

julia> validate_fasta(IOBuffer(">Acc1\nGATC\n>Acc2\nTGAC")) === nothing
true

julia> reader = FASTAReader(IOBuffer(">Acc1\nGATC\n>Acc2\nTGAC"))
FASTX.FASTA.Reader{TranscodingStreams.NoopStream{IOBuffer}}(TranscodingStreams.NoopStream{IOBuffer}(<mode=idle>), 1, 1, nothing, FASTX.FASTA.Record:
  description: ""
     sequence: "", true)

julia> collect(reader)
2-element Vector{FASTX.FASTA.Record}:
 FASTX.FASTA.Record:
  description: "Acc1"
     sequence: "GATC"
 FASTX.FASTA.Record:
  description: "Acc2"
     sequence: "TGAC"

julia> validate_fasta(IOBuffer("C:/Users/Laurent/Desktop/Test_txt.txt")) === nothing
false

julia> reader = FASTAReader(IOBuffer("C:/Users/Laurent/Desktop/Test_txt.txt"))
FASTX.FASTA.Reader{TranscodingStreams.NoopStream{IOBuffer}}(TranscodingStreams.NoopStream{IOBuffer}(<mode=idle>), 1, 1, nothing, FASTX.FASTA.Record:
  description: ""
     sequence: "", true)

julia> collect(reader)
ERROR: Error when parsing FASTX file. Saw unexpected byte 'C' on line 1
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] throw_parser_error(data::Vector{UInt8}, p::Int64, line::Int64)
    @ FASTX C:\Users\Laurent\.julia\packages\FASTX\9Dngy\src\FASTX.jl:124
  [3] macro expansion
    @ C:\Users\Laurent\.julia\packages\FASTX\9Dngy\src\fasta\readrecord.jl:102 [inlined]
  [4] readrecord!(stream::TranscodingStreams.NoopStream{IOBuffer}, record::FASTX.FASTA.Record, state::Tuple{Int64, Int64})
    @ FASTX.FASTA C:\Users\Laurent\.julia\packages\Automa\5enCH\src\Stream.jl:124
  [5] _read!
    @ C:\Users\Laurent\.julia\packages\FASTX\9Dngy\src\fasta\reader.jl:104 [inlined]
  [6] iterate(rdr::FASTX.FASTA.Reader{TranscodingStreams.NoopStream{IOBuffer}}, state::Nothing)
    @ FASTX.FASTA C:\Users\Laurent\.julia\packages\FASTX\9Dngy\src\fasta\reader.jl:79
  [7] iterate
    @ C:\Users\Laurent\.julia\packages\FASTX\9Dngy\src\fasta\reader.jl:79 [inlined]
  [8] _collect(cont::UnitRange{Int64}, itr::FASTX.FASTA.Reader{TranscodingStreams.NoopStream{IOBuffer}}, #unused#::Base.HasEltype, isz::Base.SizeUnknown)
    @ Base .\array.jl:718
  [9] collect(itr::FASTX.FASTA.Reader{TranscodingStreams.NoopStream{IOBuffer}})
    @ Base .\array.jl:707
 [10] top-level scope
    @ REPL[14]:1

Your Environment

  • Package Version used: 2.1.2
  • Julia Version used: 1.9.0
  • Operating System and version (desktop or mobile): Windows 10 (64-bit)

Release version 1.1

@BioJuliaBot register

Release notes:

[1.1.0] - 2019-08-07

Added

  • Base.copyto! methods for copying record data to LongSequences.
  • FASTA.seqlen & FASTQ.seqlen for getting the length of a sequence in a record.

Changed

  • Use BioSequence.jl v2.0 or higher.
  • Use TranscodingStreams v0.9.5.

More conservative autodetection of sequence type

DNA/RNA sequences with just a few ambiguous characters will be detected as amino acid sequences.
Perhaps a better scheme is to check if all characters are translatable in first DNA, then RNA, then AA.

Documenter clearer that writers/readers take ownership of underlying IO

Hi,

I am trying to write out fastq files generated from a Nanopore MINION machine. There seems to be issues with sequences not being written correctly, sometimes. There definitely appears to be an issue encoding/decoding the quality from these files. Nanopore claims they use the Sanger standard PHRED encoding, which seems to be supported by the FASTX package.

Actually upon checking, there appears to be issues with the FASTA writer as well. The demo below fails to write the FASTA file. However the converted sequences println fine??

CODE:

using FASTX
using BioSequences
using ArgParse


@doc """
Take a fastq file and convert U's to T's because nanopore annoying outputs U's

input -> {inp -> path to fastq ; out -> path to output file} 
output <- new fastq file at path "out"
""" -> 
function uToT(in::IO,out::IO)
    reader,writer = FASTQ.Reader(in),FASTQ.Writer(out)
    for record ∈ reader
        seq = sequence(LongRNA{2},record)
        seq = convert(LongDNA{2},seq)
        write(writer,FASTQ.Record(identifier(record),description(record),seq,
        quality(record,:sanger))) # must specify this for nanopore! 

    end 
    return 
end 

s = ArgParseSettings()
@add_arg_table s begin
    "-i"
        help = "path to input fastq"
        arg_type = String
    "-o"
        help = "output fastq path"
        arg_type = String

end 

args = parse_args(s)

inp = open(args["i"])
out = open(args["o"],"w")
uToT(inp,out)
close(inp)
close(out)

Short demo file. When you run the code on this, nothing will save.

@56bcbdb8-4aa0-45b0-ac5e-34e8635c6d0e runid=a83ffa912f48cb1dcfd0618644311df0055deded sampleid=1 read=11254 ch=35 start_time=2022-02-19T00:36:07Z
CCGGUAGAGGAACCACACCAAUACCCCGAACUGGUGGUAAACUCUACUGCGGUGACGGCAUUGUAGGAGGUCCUGCGGAAAAAAAUAGCUCGACAGGAUAAAAAAAAU
+
&,(%&*3/31:840(0&'*-%%%'+.65>>*1/.-,)./123-*&%.>=642854,*'**&7:2738A447.187<64=>=:7--06<79748-0++358:85-,+'&
@296e2952-5ee9-4ac0-bb90-da39e2e2261e runid=a83ffa912f48cb1dcfd0618644311df0055deded sampleid=1 read=10784 ch=167 start_time=2022-02-19T00:31:52Z
AAGGCGUAAGAGGAACCACUCAAUCCAUCCGAACACUGGUGGUUAAACUCUACUGCGGUGACGAUACUGUGAGGGAGGUCCUGCGGAAAAUAGCUCGACGCCAGGAAAAAAGC
+
'9;;.%'(/4/0(25/;3:-3668157.164/%-#'4<9=,6A1<<<5'$'((0/,**-3*15<5(/.)-)+())&,0$,'45897>B=8318.-678+0131+&--*(&%$$
@65602197-98f5-4d51-a4eb-656be3af2d05 runid=a83ffa912f48cb1dcfd0618644311df0055deded sampleid=1 read=11565 ch=25 start_time=2022-02-19T00:39:33Z
CAGGCGUAGAGGAACCACACCAAUCCAUCCCGAACUUGGUGGUUAAACUCUACUGCGGUGACGAUACUGUAGGGAGGUCCUGCGGAAAAAUAGCUUGUGCCAGGAAAAAAU
+
#334-)&293558:0/'>8<:;9540.(+998??:;?;9<53084*)&&%/2;:78/-/98-37646225&0=A?7766,567,):469'8:951'(&%--+))'58/&&&

Different slicing methods for FASTX.FASTA.sequences produces inconsistent behavior

Expected Behavior

sequence(record::Record, part::UnitRange{Int}]) is equivalent to sequence(record::Record)[part::UnitRange{Int}]

Current Behavior

# behaves as expected
a = sequence(record::Record, part::UnitRange{Int}]
reverse_complement(reverse_complement(a)) == a

# seems unsafe and taking the reverse complement mutates the sequence irreversibly 
b = sequence(record::Record)[part::UnitRange{Int}]
reverse_complement(reverse_complement(b)) != b

Possible Solution / Implementation

I'm really not sure what's going on here. I think it may be an issue of pass by reference vs pass by copy?

Steps to Reproduce (for bugs)

import Pkg
pkgs = [
"BioSequences",
"FASTX",
]
Pkg.add(pkgs)
for pkg in pkgs
    eval(Meta.parse("import $pkg"))
end
Pkg.status(pkgs)
test = FASTX.FASTA.Record("test", BioSequences.randdnaseq(rand(UInt8)))
R = UnitRange(sort([rand(1:FASTX.FASTA.seqlen(test)) for i in 1:2])...)
a = FASTX.FASTA.sequence(test, R)
b = FASTX.FASTA.sequence(test)[R]

# true
@assert a == b

# true
@assert typeof(a) == typeof(b)

# true
@assert BioSequences.reverse_complement(BioSequences.reverse_complement(a)) == a

# false but should be true
@assert BioSequences.reverse_complement(BioSequences.reverse_complement(b)) == b

Context

Taking the reverse complement of the subsequences was mutating the underlying sequence and introducing erroneous results into my analyses

Your Environment

  • Package Version used:
  [7e6ae17a] BioSequences v2.0.4
  [c2308a5c] FASTX v1.1.2
  • Julia Version used: 1.2
  • Operating System and version (desktop or mobile): Ubuntu 18.04 LTS
  • Link to your project: N/A

This may actually be an issue with BioSequences but I'm not sure. Any ideas with what is going on here?

Next release of TranscodingStreams will likely break this package

FASTX.jl/src/fasta/index.jl

Lines 230 to 234 in 868d702

# Disturbingly, there is no API to get the absolute position of
# an Automa machine operating on a stream. We ought to fix this.
# This workaround works ONLY for a NoopStream,
# and relies on abusing the internals.
buffer_offset = buffer.transcoded - buffer.marginpos + 1

Why does this code use the internals of TranscodingStreams?

I will soon remove the transcoded field from the TranscodingStreams.Buffer type, so this package will be broken.

Have accesor functions return views?

FASTQ.Record and FASTA.Record stores its data as Vector{UInt8}. Accessing the data, like identifier, quality etc, however, copy the data and return new heap-allocation structs.

I think it would be nicer if all these functions returned views into the original data by default:

  • It's faster, and that really does matter for FASTX data
  • I think it makes mores sense semtantically: "This is the identifier of the record", not a copy of the identifier.

For strings, we can use the StringView package. For quality and sequences, we can have lazy iterator functions.

We might also want to do this for XAM.jl.

Subsetting a FASTQ record

Expected Behavior

It would be great to be able to subset a fastq record using normal string subsetting syntax like this:

s1[3:6]

Current Behavior

We can subset sequnce and quality separately like this:

julia> s1 = first(open(FASTQ.Reader, "/tmp/s1.fq"))
julia> sequence(s1, 3:6)
4nt DNA Sequence:
AGTT
julia> quality(s1, 33, 3:6)
4-element Array{UInt8,1}:
 0x02
 0x24
 0x24
 0x24

But using normal string subsetting syntax does not work:

julia> s1[3:6]
ERROR: MethodError: no method matching getindex(::FASTX.FASTQ.Record, ::UnitRange{Int64})

It would be useful if this returned a new FASTQ-record with the sequnce and quality of the range.

Possible Solution / Implementation

I'll like to try and add a getindex method for this.
Will you be willing to accept a pull request?

(@v1.5) pkg> st FASTX
Status `/tmp/d1/Project.toml`
  [c2308a5c] FASTX v1.1.3

read! doesn't work in v2.0.0?

Thank you for building a nice package.

When I migrated from FASTX.jl v1.3.0 to v2.0.0, an error occurred due to the behavior of read!
I couldn't find any docs or tests on read! in v2.0.0, so I'm confused about whether this is a deprecated or a bug.

Expected Behavior

In v1.3.0

julia> open(FASTQ.Reader, "./test/test.fastq") do x
           r = FASTQ.Record()
           while !eof(x)
               read!(x,r)
               @show r
           end
       end

returns

r = FASTX.FASTQ.Record:
   identifier: ABC
  description: DEF
     sequence: TAGC
      quality: UInt8[0x29, 0x29, 0x29, 0x29]

Current Behavior

But, in v2.0.0, the code above returns nothing.

r = FASTQ.Record:
  description: ""
     sequence: ""
      quality: ""

Steps to Reproduce (for bugs)

I got the test file from here

Context

In ./src/fastq/reader.jl, read! seems to do nothing for rec.

function Base.read!(rdr::Reader, rec::Record)
    (cs, f) = _read!(rdr, rdr.record)
    if !f
        cs == 0 && throw(EOFError())
        throw(ArgumentError("malformed FASTQ file"))
    end    
    return rec
end

Is it expected behavior or a bug?
best.

Bug in `Reader` with stream

I did

reader = FASTQ.Reader(GzipDecompressorStream(open(fastq)))

I get UndefVarError: stream not defined. The bug is visible in the relevant code, where stream is only defined inside one branch.

"""
function Reader(input::IO; fill_ambiguous = nothing)
if fill_ambiguous === nothing
seq_transform = nothing
else
seq_transform = generate_fill_ambiguous(fill_ambiguous)
end
if !(input isa TranscodingStream)
stream = TranscodingStreams.NoopStream(input)
end
return Reader(State(stream, 1, 1, false), seq_transform)
end
"""

I think you just want

"""
function Reader(input::IO; fill_ambiguous = nothing)
if fill_ambiguous === nothing
seq_transform = nothing
else
seq_transform = generate_fill_ambiguous(fill_ambiguous)
end
if !(input isa TranscodingStream)
stream = TranscodingStreams.NoopStream(input)
else
stream = input
end
return Reader(State(stream, 1, 1, false), seq_transform)
end
"""

I'd be happy to put this in a PR if that would be helpful.

This template is rather extensive. Fill out all that you can, if are a new contributor or you're unsure about any section, leave it unchanged and a reviewer will help you 😄. This template is simply a tool to help everyone remember the BioJulia guidelines, if you feel anything in this template is not relevant, simply delete it.

Expected Behavior

Current Behavior

Possible Solution / Implementation

Steps to Reproduce (for bugs)

Context

Your Environment

  • Package Version used:
  • Julia Version used:
  • Operating System and version (desktop or mobile):
  • Link to your project:

Make BioSequences an optional dependency

Julia just merged to master the ability to add optional dependencies. We should move BioSequences to an optional dependency, since it's only used for a few functions that not everybody uses.

This will only land in Julia 1.10, so we would need to bump Julia compat. We should wait a few years.

Refactor tests

The tests are fairly thorough, but I think they would be better if they were factored out into more testsets. It would make it easier to add new unit tests. I've though this every time when looking at the tests, so just making the issue here so I don't forget.

When (if) the refactoring ever happens, this would also be a good chance to push the coverage a little higher.

Unexpected behaviour when reading FASTA files with read!(reader, record)

Setup

using FASTX

write("seqs.fa",
"""
>a b
VFQTFCRRTCCFARYDRFIR
>c d
SCHKIKTFTYPTMKRFQGEH
""")

Expected Behavior

Using the read!(reader, record) way of reading FASTA files (https://biojulia.net/FASTX.jl/stable/manual/fasta/)

function readfasta(fasta)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            push!(records, record)
        end
    end
    return records
end

I would expect:

julia> readfasta("seqs.fa")
2-element Array{FASTX.FASTA.Record,1}:
 FASTX.FASTA.Record:
   identifier: a
  description: b
     sequence: VFQTFCRRTCCFARYDRFIR
 FASTX.FASTA.Record:
   identifier: c
  description: d
     sequence: SCHKIKTFTYPTMKRFQGEH

Current Behavior

However, all entries in the resulting array are for the final record in the file.

julia> readfasta("seqs.fa")
2-element Array{FASTX.FASTA.Record,1}:
 FASTX.FASTA.Record:
   identifier: c
  description: d
     sequence: SCHKIKTFTYPTMKRFQGEH
 FASTX.FASTA.Record:
   identifier: c
  description: d
     sequence: SCHKIKTFTYPTMKRFQGEH

Possible Solution / Implementation

I gather that this might be the 'correct' behaviour, but it is a massive gotcha. One way I've found to get this to work is to copy the record within the loop:

function readfasta(fasta)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            push!(records, copy(record))
        end
    end
    return records
end

If this is the correct behaviour, maybe we could add a note in the docs to show how the overwriting can be avoided.

NB this problem is not encountered if you 'do work' with the record, then push it to an array, e.g.:

function readfasta(fasta)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            record_no_desc = FASTA.Record(FASTA.identifier(record), FASTA.sequence(record))
            push!(records, record_no_desc)
        end
    end
    return records
end

Context

Reading through very large FASTA files and selecting records that meet some condition e.g. the identifier is in some set of ids that I want to keep e.g.:

function filterfasta(fasta, ids)
    records = FASTA.Record[]
    open(FASTA.Reader, fasta) do reader
        record = FASTA.Record()
        while !eof(reader)
            read!(reader, record)
            if FASTA.identifier(record) in ids
                push!(records, copy(record))
            end
        end
    end
    return records
end

Your Environment

(FASTARecordReaderBug.jl) pkg> st
    Status `~/Dropbox/Projects/FASTARecordReaderBug.jl/Project.toml`
  [c2308a5c] FASTX v1.1.0 #fd09319 (https://github.com/BioJulia/FASTX.jl.git)

julia> versioninfo()
Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 4

Issue with QualityEncoding for the Solexa format

Helloo!

I think I've discovered an issue with the implementation of the QualityEncoding that could affect the handling of quality scores in the Solexa/Illumina 1.0 format.

QualityEncoding doesn't support negative quality scores:

The Solexa/Illumina 1.0 format uses ASCII 59 through 126 to represent scores between -5 and 62. However, the current QualityEncoding constructor throws an error when the low end of the quality encoding range is less than the offset: This can be seen in src/FASTQ/quality.jl:

elseif low < offset
    error("Low end of in quality encoding range cannot be less than offset")

This may lead to errors when reading Solexa/Illumina 1.0 files with negative quality scores. I suggest removing these two lines, as it's an unnecessary contraint that only applies when creating a QualityEncoding, and actually prevents us from decoding certain formats properly and creating encodings for them.

Incorrect first character in SOLEXA_QUAL_ENCODING:

I also noticed that the first character for Solexa encoding is ASCII 64 '@', when it should be ASCII 59 ';'. This can be seen further down in src/FASTQ/quality.jl:

"Solexa (Solexa+64) quality score encoding"
const SOLEXA_QUAL_ENCODING     = QualityEncoding('@':'~', 64)

"Illumina 1.3 (Phred+64) quality score encoding"
const ILLUMINA13_QUAL_ENCODING = QualityEncoding('@':'~', 64)

If the ';' character was used, the current version would've thrown that low < offset error.

Examples

Current behavior:

julia> FASTQ.decode_quality(QualityEncoding('@':'~', 64), Int(';')) # current built-in Solexa encoding breaks
ERROR: Quality 59 not in encoding range 64:126

julia> FASTQ.decode_quality(QualityEncoding(';':'~', 64), Int(';')) # current constructor doesn't allow creating the correct encoding
ERROR: Low end of in quality encoding range cannot be less than offset

julia> collect(quality_scores(FASTQRecord("","ACGT",";?@A"), :solexa))
ERROR: Quality 59 not in encoding range 64:126

After removing the low < offset check and changing the first character of the Solexa encoding:

julia> FASTQ.decode_quality(QualityEncoding(';':'~', 64), Int(';'))

julia> collect(quality_scores(FASTQRecord("","ACGT",";?@A"), :solexa))
4-element Vector{Int64}:
 -5
 -1
  0
  1

Cheers!

Document `read!(::Reader, ::Record)`

Sorry, I'm a bit confused by the doc (or lack thereof), is that one correct way one has to read a fastq file, say if I want to read N reads (I could use while next !== nothing too) ? I think adding an example in the docs would be good.

reader = FASTQ.Reader(GzipDecompressorStream(open(file)) ; copy=false)

r = first(reader)

for k = 1:N

    # do something here

    next = iterate(reader)
    isnothing(next) && break
    r = next[1]
end

close(reader)

Also the old read!(reader, r) doesn't crash but does nothing, it should be an error now, no ?

Is our FASTA format idiosyncratic?

I usually work with files where the header does not conform to the pattern >{identifier} {description}. For example:
>A/New York/499/2004 | NP | A / H3N2 | 2004-01-15 | EPI_ISL_5131. It's clear the identifier isn't really A/New.

It's annoyed me for a long time that I've had to have a lot of boilerplate code like:

id = identifier(rec)
header = hasdescription(rec) ? id * " " * description(rec) : id
# do stuff with header

I also encountered that a lot in my old job with different files. Could it be just a mistake in the database I got the files from? When browsing the web, I can't actually find an (authoritative) source that states that the FASTA format has a description and the identifier separated. Instead, most of them just refer to the header, or identifier, or description as the whole line after >.

Is it possible that the idea of a separate identifier and description came from databases like Genbank, who incidentally began each header line with the accession number?

Considering that the current separation into identifier and description actually destroys information by not storing the separating whitespace, the original header can't even be reconstructed. Would it make sense to instead store the record like

mutable struct Record
    filled::UnitRange{Int}
    header::UnitRange{Int}
    sequence::UnitRange{Int}
    data::Vector{UInt8}
end

?

package does not precompile

With julia versions 1.9.0 and 1.9.1 the package does not precompile and is not usable under linux - see output below.

I tested it with julia version 1.8.5 under linux and it precompiles and runs.
It also runs on Mac (1.8.5 and 1.9.x).

Thanks for taking care.

test_FASTX) pkg> test FASTX
     Testing FASTX
      Status `/scratch/local/jl_d17NoU/Project.toml`
  [47718e42] BioGenerics v0.1.2
  [7e6ae17a] BioSequences v3.1.4
  [c2308a5c] FASTX v2.1.0 `~/.julia/dev/FASTX`
  [3372ea36] FormatSpecimens v1.1.0
  [e0db7c4e] ReTest v0.3.2
  [354b36f9] StringViews v1.3.1
  [3bb67fe8] TranscodingStreams v0.9.13
  [9a3f8284] Random `@stdlib/Random`
      Status `/scratch/local/jl_d17NoU/Manifest.toml`
  [67c07d97] Automa v0.8.2
  [47718e42] BioGenerics v0.1.2
  [7e6ae17a] BioSequences v3.1.4
  [3c28c6f8] BioSymbols v5.1.3
  [c2308a5c] FASTX v2.1.0 `~/.julia/dev/FASTX`
  [3372ea36] FormatSpecimens v1.1.0
  [bd334432] InlineTest v0.2.0
  [aea7be01] PrecompileTools v1.1.2
  [21216c6a] Preferences v1.4.0
  [e0db7c4e] ReTest v0.3.2
  [fdea26ae] SIMD v3.4.5
  [7b38b023] ScanByte v0.3.3
  [354b36f9] StringViews v1.3.1
  [3bb67fe8] TranscodingStreams v0.9.13
  [7200193e] Twiddle v1.1.2
  [0dad84c5] ArgTools v1.1.1 `@stdlib/ArgTools`
  [56f22d72] Artifacts `@stdlib/Artifacts`
  [2a0f44e3] Base64 `@stdlib/Base64`
  [ade2ca70] Dates `@stdlib/Dates`
  [8ba89e20] Distributed `@stdlib/Distributed`
  [f43a241f] Downloads v1.6.0 `@stdlib/Downloads`
  [7b1f6079] FileWatching `@stdlib/FileWatching`
  [b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils`
  [b27032c2] LibCURL v0.6.3 `@stdlib/LibCURL`
  [76f85450] LibGit2 `@stdlib/LibGit2`
  [8f399da3] Libdl `@stdlib/Libdl`
  [56ddb016] Logging `@stdlib/Logging`
  [d6f4376e] Markdown `@stdlib/Markdown`
  [ca575930] NetworkOptions v1.2.0 `@stdlib/NetworkOptions`
  [44cfe95a] Pkg v1.9.0 `@stdlib/Pkg`
  [de0858da] Printf `@stdlib/Printf`
  [3fa0cd96] REPL `@stdlib/REPL`
  [9a3f8284] Random `@stdlib/Random`
  [ea8e919c] SHA v0.7.0 `@stdlib/SHA`
  [9e88b42a] Serialization `@stdlib/Serialization`
  [6462fe0b] Sockets `@stdlib/Sockets`
  [fa267f1f] TOML v1.0.3 `@stdlib/TOML`
  [a4e569a6] Tar v1.10.0 `@stdlib/Tar`
  [8dfed614] Test `@stdlib/Test`
  [cf7118a7] UUIDs `@stdlib/UUIDs`
  [4ec0a83e] Unicode `@stdlib/Unicode`
  [deac9b47] LibCURL_jll v7.84.0+0 `@stdlib/LibCURL_jll`
  [29816b5a] LibSSH2_jll v1.10.2+0 `@stdlib/LibSSH2_jll`
  [c8ffd9c3] MbedTLS_jll v2.28.2+0 `@stdlib/MbedTLS_jll`
  [14a3606d] MozillaCACerts_jll v2022.10.11 `@stdlib/MozillaCACerts_jll`
  [83775a58] Zlib_jll v1.2.13+0 `@stdlib/Zlib_jll`
  [8e850ede] nghttp2_jll v1.48.0+0 `@stdlib/nghttp2_jll`
  [3f19e933] p7zip_jll v17.4.0+0 `@stdlib/p7zip_jll`
Precompiling project...
  ✗ FASTX
  ✗ FASTX → BioSequencesExt
  14 dependencies successfully precompiled in 35 seconds. 3 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

FASTX [c2308a5c-f048-11e8-3e8a-31650f418d12]

Failed to precompile FASTX [c2308a5c-f048-11e8-3e8a-31650f418d12] to "/home/arndt/.julia/compiled/v1.9/FASTX/jl_B443Ke".
LLVM ERROR: Do not know how to split the result of this operator!


[116426] signal (6.-6): Aborted
in expression starting at /home/arndt/.julia/dev/FASTX/src/workload.jl:3
__pthread_kill_implementation at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/nptl/pthread_kill.c:44
raise at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/signal/../sysdeps/posix/raise.c:26
abort at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/stdlib/abort.c:79
_ZN4llvm18report_fatal_errorERKNS_5TwineEb at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm18report_fatal_errorEPKcb at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16DAGTypeLegalizer17SplitVectorResultEPNS_6SDNodeEj at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16DAGTypeLegalizer3runEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm12SelectionDAG13LegalizeTypesEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.902 at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel20runOnMachineFunctionERN4llvm15MachineFunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc14SimpleCompilerclERNS_6ModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
operator() at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:1206
_ZN4llvm3orc14IRCompileLayer4emitESt10unique_ptrINS0_29MaterializationResponsibilityESt14default_deleteIS3_EENS0_16ThreadSafeModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16IRTransformLayer4emitESt10unique_ptrINS0_29MaterializationResponsibilityESt14default_deleteIS3_EENS0_16ThreadSafeModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
emit at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:631
_ZN4llvm3orc31BasicIRLayerMaterializationUnit11materializeESt10unique_ptrINS0_29MaterializationResponsibilityESt14default_deleteIS3_EE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc19MaterializationTask3runEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm6detail18UniqueFunctionBaseIvJSt10unique_ptrINS_3orc4TaskESt14default_deleteIS4_EEEE8CallImplIPFvS7_EEEvPvRS7_ at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession22dispatchOutstandingMUsEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession17OL_completeLookupESt10unique_ptrINS0_21InProgressLookupStateESt14default_deleteIS3_EESt10shared_ptrINS0_23AsynchronousSymbolQueryEESt8functionIFvRKNS_8DenseMapIPNS0_8JITDylibENS_8DenseSetINS0_15SymbolStringPtrENS_12DenseMapInfoISF_vEEEENSG_ISD_vEENS_6detail12DenseMapPairISD_SI_EEEEEE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc25InProgressFullLookupState8completeESt10unique_ptrINS0_21InProgressLookupStateESt14default_deleteIS3_EE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession19OL_applyQueryPhase1ESt10unique_ptrINS0_21InProgressLookupStateESt14default_deleteIS3_EENS_5ErrorE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupENS0_10LookupKindERKSt6vectorISt4pairIPNS0_8JITDylibENS0_19JITDylibLookupFlagsEESaIS8_EENS0_15SymbolLookupSetENS0_11SymbolStateENS_15unique_functionIFvNS_8ExpectedINS_8DenseMapINS0_15SymbolStringPtrENS_18JITEvaluatedSymbolENS_12DenseMapInfoISI_vEENS_6detail12DenseMapPairISI_SJ_EEEEEEEEESt8functionIFvRKNSH_IS6_NS_8DenseSetISI_SL_EENSK_IS6_vEENSN_IS6_SV_EEEEEE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupERKSt6vectorISt4pairIPNS0_8JITDylibENS0_19JITDylibLookupFlagsEESaIS7_EERKNS0_15SymbolLookupSetENS0_10LookupKindENS0_11SymbolStateESt8functionIFvRKNS_8DenseMapIS5_NS_8DenseSetINS0_15SymbolStringPtrENS_12DenseMapInfoISK_vEEEENSL_IS5_vEENS_6detail12DenseMapPairIS5_SN_EEEEEE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupERKSt6vectorISt4pairIPNS0_8JITDylibENS0_19JITDylibLookupFlagsEESaIS7_EENS0_15SymbolStringPtrENS0_11SymbolStateE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupENS_8ArrayRefIPNS0_8JITDylibEEENS0_15SymbolStringPtrENS0_11SymbolStateE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupENS_8ArrayRefIPNS0_8JITDylibEEENS_9StringRefENS0_11SymbolStateE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
addModule at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:1420
jl_add_to_ee at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:1815
_jl_compile_codeinst at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:241
jl_generate_fptr_impl at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:460
jl_compile_method_internal at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2348 [inlined]
jl_compile_method_internal at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2237
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2750 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
macro expansion at /home/arndt/.julia/dev/FASTX/src/workload.jl:37 [inlined]
macro expansion at /home/arndt/.julia/packages/PrecompileTools/0yi7r/src/workloads.jl:74 [inlined]
macro expansion at /home/arndt/.julia/dev/FASTX/src/workload.jl:9 [inlined]
top-level scope at /home/arndt/.julia/packages/PrecompileTools/0yi7r/src/workloads.jl:136
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:903
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1899
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1959
include at ./Base.jl:457
jfptr_include_31033 at /project/evogen-computing/julia/julia-1.9.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
jl_f__call_latest at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/builtins.c:774
include at /home/arndt/.julia/dev/FASTX/src/FASTX.jl:1
unknown function (ip: 0x7fa0aa703582)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_call at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_eval_module_expr at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:203 [inlined]
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:715
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1899
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1959
include at ./Base.jl:457 [inlined]
include_package_for_output at ./loading.jl:2045
jfptr_include_package_for_output_32423 at /project/evogen-computing/julia/julia-1.9.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_call at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1899
include_string at ./loading.jl:1909 [inlined]
exec_options at ./client.jl:305
_start at ./client.jl:522
jfptr__start_29455 at /project/evogen-computing/julia/julia-1.9.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
true_main at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/cli/loader_exe.c:59
__libc_start_call_main at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/csu/../sysdeps/nptl/libc_start_call_main.h:58
__libc_start_main_impl at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/csu/../csu/libc-start.c:381
unknown function (ip: 0x401098)
Allocations: 9101736 (Pool: 9092749; Big: 8987); GC: 10
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types /project/evogen-computing/julia/julia-1.9.1/share/julia/stdlib/v1.9/Pkg/src/Types.jl:69
 [2] precompile(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; internal_call::Bool, strict::Bool, warn_loaded::Bool, already_instantiated::Bool, timing::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
   @ Pkg.API /project/evogen-computing/julia/julia-1.9.1/share/julia/stdlib/v1.9/Pkg/src/API.jl:1581
 [3] precompile(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:warn_loaded,), Tuple{Bool}}})
   @ Pkg.API /project/evogen-computing/julia/julia-1.9.1/share/julia/stdlib/v1.9/Pkg/src/API.jl:156
 [4] precompile(; name::Nothing, uuid::Nothing, version::Nothing, url::Nothing, rev::Nothing, path::Nothing, mode::Pkg.Types.PackageMode, subdir::Nothing, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:warn_loaded,), Tuple{Bool}}})
   @ Pkg.API /project/evogen-computing/julia/julia-1.9.1/share/julia/stdlib/v1.9/Pkg/src/API.jl:171
 [5] top-level scope
   @ none:6
     Testing Precompilation of test environment failed. Continuing to tests
     Testing Running tests...
LLVM ERROR: Do not know how to split the result of this operator!


[116507] signal (6.-6): Aborted
in expression starting at /home/arndt/.julia/dev/FASTX/src/workload.jl:3
__pthread_kill_implementation at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/nptl/pthread_kill.c:44
raise at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/signal/../sysdeps/posix/raise.c:26
abort at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/stdlib/abort.c:79
_ZN4llvm18report_fatal_errorERKNS_5TwineEb at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm18report_fatal_errorEPKcb at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16DAGTypeLegalizer17SplitVectorResultEPNS_6SDNodeEj at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16DAGTypeLegalizer3runEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm12SelectionDAG13LegalizeTypesEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16SelectionDAGISel17CodeGenAndEmitDAGEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20SelectAllBasicBlocksERKNS_8FunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm16SelectionDAGISel20runOnMachineFunctionERNS_15MachineFunctionE.part.902 at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN12_GLOBAL__N_115X86DAGToDAGISel20runOnMachineFunctionERN4llvm15MachineFunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm19MachineFunctionPass13runOnFunctionERNS_8FunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc14SimpleCompilerclERNS_6ModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
operator() at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:1206
_ZN4llvm3orc14IRCompileLayer4emitESt10unique_ptrINS0_29MaterializationResponsibilityESt14default_deleteIS3_EENS0_16ThreadSafeModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16IRTransformLayer4emitESt10unique_ptrINS0_29MaterializationResponsibilityESt14default_deleteIS3_EENS0_16ThreadSafeModuleE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
emit at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:631
_ZN4llvm3orc31BasicIRLayerMaterializationUnit11materializeESt10unique_ptrINS0_29MaterializationResponsibilityESt14default_deleteIS3_EE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc19MaterializationTask3runEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm6detail18UniqueFunctionBaseIvJSt10unique_ptrINS_3orc4TaskESt14default_deleteIS4_EEEE8CallImplIPFvS7_EEEvPvRS7_ at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession22dispatchOutstandingMUsEv at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession17OL_completeLookupESt10unique_ptrINS0_21InProgressLookupStateESt14default_deleteIS3_EESt10shared_ptrINS0_23AsynchronousSymbolQueryEESt8functionIFvRKNS_8DenseMapIPNS0_8JITDylibENS_8DenseSetINS0_15SymbolStringPtrENS_12DenseMapInfoISF_vEEEENSG_ISD_vEENS_6detail12DenseMapPairISD_SI_EEEEEE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc25InProgressFullLookupState8completeESt10unique_ptrINS0_21InProgressLookupStateESt14default_deleteIS3_EE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession19OL_applyQueryPhase1ESt10unique_ptrINS0_21InProgressLookupStateESt14default_deleteIS3_EENS_5ErrorE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupENS0_10LookupKindERKSt6vectorISt4pairIPNS0_8JITDylibENS0_19JITDylibLookupFlagsEESaIS8_EENS0_15SymbolLookupSetENS0_11SymbolStateENS_15unique_functionIFvNS_8ExpectedINS_8DenseMapINS0_15SymbolStringPtrENS_18JITEvaluatedSymbolENS_12DenseMapInfoISI_vEENS_6detail12DenseMapPairISI_SJ_EEEEEEEEESt8functionIFvRKNSH_IS6_NS_8DenseSetISI_SL_EENSK_IS6_vEENSN_IS6_SV_EEEEEE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupERKSt6vectorISt4pairIPNS0_8JITDylibENS0_19JITDylibLookupFlagsEESaIS7_EERKNS0_15SymbolLookupSetENS0_10LookupKindENS0_11SymbolStateESt8functionIFvRKNS_8DenseMapIS5_NS_8DenseSetINS0_15SymbolStringPtrENS_12DenseMapInfoISK_vEEEENSL_IS5_vEENS_6detail12DenseMapPairIS5_SN_EEEEEE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupERKSt6vectorISt4pairIPNS0_8JITDylibENS0_19JITDylibLookupFlagsEESaIS7_EENS0_15SymbolStringPtrENS0_11SymbolStateE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupENS_8ArrayRefIPNS0_8JITDylibEEENS0_15SymbolStringPtrENS0_11SymbolStateE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
_ZN4llvm3orc16ExecutionSession6lookupENS_8ArrayRefIPNS0_8JITDylibEEENS_9StringRefENS0_11SymbolStateE at /project/evogen-computing/julia/julia-1.9.1/bin/../lib/julia/libLLVM-14jl.so (unknown line)
addModule at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:1420
jl_add_to_ee at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:1815
_jl_compile_codeinst at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:241
jl_generate_fptr_impl at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jitlayers.cpp:460
jl_compile_method_internal at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2348 [inlined]
jl_compile_method_internal at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2237
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2750 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
macro expansion at /home/arndt/.julia/dev/FASTX/src/workload.jl:37 [inlined]
macro expansion at /home/arndt/.julia/packages/PrecompileTools/0yi7r/src/workloads.jl:74 [inlined]
macro expansion at /home/arndt/.julia/dev/FASTX/src/workload.jl:9 [inlined]
top-level scope at /home/arndt/.julia/packages/PrecompileTools/0yi7r/src/workloads.jl:136
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:903
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1899
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1959
include at ./Base.jl:457
jfptr_include_31033 at /project/evogen-computing/julia/julia-1.9.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
jl_f__call_latest at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/builtins.c:774
include at /home/arndt/.julia/dev/FASTX/src/FASTX.jl:1
unknown function (ip: 0x7efdb69033b2)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_call at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_eval_module_expr at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:203 [inlined]
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:715
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1899
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1959
include at ./Base.jl:457 [inlined]
include_package_for_output at ./loading.jl:2045
jfptr_include_package_for_output_32443 at /project/evogen-computing/julia/julia-1.9.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
do_call at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:226
eval_stmt_value at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:177 [inlined]
eval_body at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:624
jl_interpret_toplevel_thunk at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1899
include_string at ./loading.jl:1909 [inlined]
exec_options at ./client.jl:305
_start at ./client.jl:522
jfptr__start_29455 at /project/evogen-computing/julia/julia-1.9.1/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
true_main at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/cli/loader_exe.c:59
__libc_start_call_main at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/csu/../sysdeps/nptl/libc_start_call_main.h:58
__libc_start_main_impl at /scratch/local/bee-buczek/glibc/glibc-2.36-1/source/csu/../csu/libc-start.c:381
unknown function (ip: 0x401098)
Allocations: 9102979 (Pool: 9093983; Big: 8996); GC: 14
ERROR: LoadError: Failed to precompile FASTX [c2308a5c-f048-11e8-3e8a-31650f418d12] to "/home/arndt/.julia/compiled/v1.9/FASTX/jl_bNz5in".
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, keep_loaded_modules::Bool)
    @ Base ./loading.jl:2296
  [3] compilecache
    @ ./loading.jl:2163 [inlined]
  [4] _require(pkg::Base.PkgId, env::String)
    @ Base ./loading.jl:1805
  [5] _require_prelocked(uuidkey::Base.PkgId, env::String)
    @ Base ./loading.jl:1660
  [6] macro expansion
    @ ./loading.jl:1648 [inlined]
  [7] macro expansion
    @ ./lock.jl:267 [inlined]
  [8] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1611
  [9] include(fname::String)
    @ Base.MainInclude ./client.jl:478
 [10] top-level scope
    @ ~/.julia/dev/FASTX/test/runtests.jl:8
 [11] include(fname::String)
    @ Base.MainInclude ./client.jl:478
 [12] top-level scope
    @ none:6
in expression starting at /home/arndt/.julia/dev/FASTX/test/FASTXTests.jl:1
in expression starting at /home/arndt/.julia/dev/FASTX/test/runtests.jl:8
ERROR: Package FASTX errored during testing

(test_FASTX) pkg> 

Migrate to new Automa release

Just a reminder to upgrade to the new Automa release when it is eventually done (may not be until the end of 2022).

Things to remember to include, if possible:

  • Check new improved error messages work with various kinds of malformed FASTX files/records
  • Is it possible to parse a FASTX from a string without using IO objects as intermediate (if possible, no big deal, see #80 )
  • Benchmark that it does not cause significant speed regressions

malformed FASTQ file

Hi,

I started using FASTX today and ran into the following issue.

Expected Behavior

I unzipped sra_data.fastq.gz and wanted to read it record by record via

r = FASTX.FASTQ.Reader(open("sra_data.fastq"))
for record in r
<do stuff>
end

Current Behavior

However, I'm getting the following error message: ArgumentError: malformed FASTQ file at line 367849. I looked at the corresponding line but could not spot any abvious problems.

Context

Among other things, I will be converting the FASTQ records into FASTA records eventually.

Your Environment

  • Package Version used: v1.1.1
  • Julia Version used: 1.6
  • Operating System and version (desktop or mobile): Ubuntu (20.04.1)

Deal better with missingness

To make an old discussion [1] more concrete:

FASTX.jl has an issue with missingness. Accessing a missing record throws an error, accessing a missing sequence return an empty sequence, and accessing a missing identifier returns nothing. We should strive to unify these. But unify them to what?

It is clear to me that a missing sequence returning an empty sequence is plain wrong. Not only is it wrong, it's also obnoxiously dangerous - it presents an annoying edge case. So agrees @SabrinaJaye. That leaves two options: Throwing an error, or returning a sentinel value like nothing.

The argument in favor of error-throwing is that you are guaranteed to only get either the kind of object or expect.
The argument in favor of nothing is that it's much easier to recover from (check for nothing) than from an error (use try catch).

However, quite often, perhaps most of the time, the user is put in a situation where they don't really know whether e.g. the header is actually filled in or not. In those cases, the argument that the user is guaranteed to always get an AbstractString feels very academic, because their program could instead crash. Worse, this cannot easily be checked for statically.

The exception is when attempting to do something that is actually not allowed, like attempting to construct an invalid Record.

So what I propose is:

  • We remove the concept of a "filled" record. For backwards compatibility, we just let isfilled always return true. Attempting to construct an empty record throws an error
  • Accessing any optional field return nothing if they are not present. Only: FASTA.description and FASTQ.description may be missing.

FASTA

  • FASTA files always have an identifier, but it may be the empty string.
  • FASTA files may or may not have a description. It has a description if there is whitespace AND non-whitespace after the identifier. If there is whitespace immediately after the > symbol, then non-whitespace, the identifier is empty and there is a description.

Checklist

  • - FASTA/Q header and identifier always return a StringView
  • - FASTA/Q descritption returns nothing or StringView
  • - FASTA/Q sequence is always non-empty and always return a sequence
  • - FASTQ quality is always non-empty and returns a an AbstractVector
  • - Constructors check for non-empty mandatory fields. Make isfilled always true.
  • - Remove checkfilled and has* functions

I'll make a PR with these changes soon.

1: https://github.com/orgs/BioJulia/teams/developers/discussions/2

ERROR: EOFError: read end of file

Hi! I got an error when I tried to folloe the content of fastq.md to read a fastq file

Expected Behavior

I expect that I can read my fastq successfully when I use the sample codes from the doc ()

julia> reader = open(FASTQ.Reader, "mydata.fastq")
       record = FASTQ.Record()
       while !eof(reader)
           read!(reader, record)
       end

Current Behavior

I got an error

julia> open(FASTQ.Reader, "mydata.fastq") do reader
               record = FASTQ.Record()
               while !eof(reader)
                   read!(reader, record)
               end
         end

ERROR: EOFError: read end of file
Stacktrace:
 [1] read!(::FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}, ::FASTX.FASTQ.Record) at /home/godkin/.julia/packages/FASTX/Uoya3/src/fastq/reader.jl:46
 [2] (::var"#7#8")(::FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}) at ./REPL[42]:4
 [3] open(::var"#7#8", ::Type{FASTX.FASTQ.Reader}, ::String; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/godkin/.julia/packages/BioGenerics/cCuGr/src/IO.jl:48
 [4] open(::Function, ::Type{FASTX.FASTQ.Reader}, ::String) at /home/godkin/.julia/packages/BioGenerics/cCuGr/src/IO.jl:46
 [5] top-level scope at REPL[42]:1

Possible Solution / Implementation

I tried to use for-loop to read the fastq, and it works.

julia> reader = open(FASTQ.Reader, "mydata.fastq")
julia> records = FASTQ.Record[]
0-element Array{FASTX.FASTQ.Record,1}

julia> for rec in reader
           push!(records, copy(rec))
       end

Your Environment

  • Package Version used: FASTX v1.1.3
  • Julia Version used: v1.4.2
  • Operating System and version (desktop or mobile): Pop!_OS 20.04

Installed Packages

(nanojulia) pkg> status
Status `~/Projects/nanojulia/Project.toml`
  [c7e460c6] ArgParse v1.1.0
  [336ed68f] CSV v0.7.6
  [a93c6f00] DataFrames v0.21.5
  [c2308a5c] FASTX v1.1.3
  [b98c9c47] Pipe v1.3.0

High level file processing

I think we would do the usability of BioJulia a lot of favours if we made it easier to use the file format packages to do the processing of HTS files a la seqtk, samtools etc.

It's currently not as convenient to do, for example - comparing to seqtk:

using FASTX
transcribe(open(FASTQ.Reader, "x.fq"), open(FASTA.Reader, "x.fa"))

As it is to just do seqtk seq -a in.fq.gz > out.fa

or

using FASTX

names = Set(...names names names...)

open(FASTQ.Reader, "x.fq") do x
    open(FASTQ.Writer, "y.fq") do y
        r = FASTQ.Record()
        while !eof(x)
        read!(x, r)
        if FASTQ.identifier(r)  names
            write(y, FASTA.Record(r))
        end
    end   
end

As it is to do seqtk subseq in.fq name.lst > out.fq

One of julias strengths - if you're not using the part of BioJulia that provides efficient core data structures and algorithms, is the aspects that make it a pretty great glue language. So the ability to very tersely express the processing of HTS files at a high level is important.

There are a few proposals and options I'm considering to improve this area and wanted to get feedback or other ideas.

  1. File handles / string literals
    The open(FASTQ/A.Reader pattern gets irritating after a while.
    I'd like to propose a filehandle kind of string literal that, when used in BioJulia ecosystem functions, does that for you. e.g.
using FASTX
transcribe(rdr"x.fq", wtr"x.fa")

It could even detect FASTA or FASTQ at the metaprogramming stage by inspecting the string. These could generate the code to open a stream, or just create a filehandle type that causes dispatch to use a transcribe function that does the appropriate opens.

  1. A way of tersely expressing stream processing / common HTS file processing

A lot of file processing involves opening a source, doing some stuff and writing to a sink. I wonder if we should implement that as a function(s), and then express e.g. transcribe, using that instead.

With 1, this could look something like:

process(rdr"x.fq", wtr"y.fq") do record
    ...some stuff
end

We could also have a few of these methods e.g. a filter, transform, etc that make various assumptions about the functions.
Or we could just have a more general function that, for e.g. if the provided function returns a record, write it, if it returns nothing, don't. That way a user can make a decision about dropping records in the same function as it may transform records.

Another alternative to this would be to do similar to what I did with the DSL in Pseudoseq: https://github.com/bioinfologics/Pseudoseq.jl/blob/master/src/sequencing/DSL.jl

And provide a series of Iterators like types, that wrap either a reader, or another iterator, that can be plugged into a writer.

So more like:

# Example of if `write` was extended to have methods that consumed a reader or an iterator...
# To achieve transcribe
write(wtr"out.fa", rdr"in.fq")

# To achieve a filter
write(wtr"out.fa", Filter(x -> rand(Bool), rdr"in.fa"))

# We could overload e.g. `|>` to make the above a bit nicer.
# Or even
rdr"in.fa" |> Filter( x -> rand(Bool) ) |> wtr"out.fa"

# Which overloading certain argument combinations for |> could be the equivalent to:
write(wtr"out.fa", Filter(x -> rand(Bool), rdr"in.fa"))

Basically angling for a kinda API that approaches a DSL on top of the readers and writers.

One such iterator could be one that modifies the iteration behaviour of an unwrapped reader, to do in-place record reading
over the default that allocates a new copy every time. e.g.

Inplace(rdr"in.fa") |> Filter( x -> rand(Bool) ) |> wtr"out.fa"
# or perhaps
rdr"in.fa" |> Inplace |> Filter( x -> rand(Bool) ) |> wtr"out.fa"

Or an alternative to this - we just try to provide as many individual functions like filter, trim, transcode etc. as possible and hope we cover everyone's common needs.

pinging @jakobnissen @CiaranOMara @kescobo

Make constructing records less horribly inefficient

In v2, the way records are constructed (directly, not parsed from a file) is not very nice, which hints at an underlying problem. Luckily this is internal and can be solved in a non-breaking feature release.

The idea is that we only want one source of what is a valid FASTX file, so when a user constructs a record from e.g. a string, we use the Automa parser also. Two issues with this:

First, the Automa parser only works on a TranscodingStream due to the way Automa generates the code. This means that to parse a bytevector like a string, we need to needlessly wrap it in a NoopStream(IOBuffer(data)), which creates way more overhead than needed.
To construct records from raw parts this is even more roundabout, since we first print the parts to an IOBuffer, then convert to a bytearray, then back to an IOBuffer.

The second issue is what happens if the user does parse(FASTARecord, ">A\nG\n>A\nG"). Clearly this should error as there are two records. But the Automa machine simply returns that it found a record. What the code does now is to try to load another record, then throw an error if that succeeds. This is simply bad design. [EDIT: I've changed that to instead use NoopStream's internals. It's better design, but still not great]

A better solution would be to somehow make parsing of a record from a byte buffer the single fundamental operation, then have the IO-based code operate on top of that. This is pretty tricky and requires a rework of Automa (but it would also make the Automa code way easier to reason about, so it should probably happen)

Consider despecializing AbstractFormattedIOs

With Julia 1.9, package image caching significantly reduces latency, as entire function can now be completely cached.
FASTX's readers and writers are parameterized by the underlying IO type. This means different underlying IO types causes the entire Automa-generated code to be recompiled, needlessly. This takes about half a second.

We might consider somehow despecializing the readers and writers (AbstractFormattedIOs, AFIOs) on their underlying IO. This will cause a dynamic dispatch whenever the AFIOs run out of buffer and need to query their underlying IOs, but these operations are already slow, so I suspect impact would be minimal (earlier tests of mine showed insignificant slowdown when removing the type parameter of FASTAReader completely). This will make the code type unstable, but allow precompilation.

subset FASTX.FASTQ.Record

There does not seem to be a convenient operation to subset the record directly!

julia> FASTX.FASTQ.Record("@name","ATC","!!!")[1:2]
ERROR: MethodError: no method matching getindex(::FASTX.FASTQ.Record, ::UnitRange{Int64})

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

FASTX.Records need a single-line representation

A vector of records display like:

3-element Vector{FASTX.FASTA.Record}:
 FASTX.FASTA.Record:
   identifier: A-seg1
  description: <missing>
     sequence: AGCAAAAGCAGGTCAATTATATTCAGTATGGAAAGAATA…
 FASTX.FASTA.Record:
   identifier: A-seg2
  description: <missing>
     sequence: AGCAAAAGCAGGCAAACCATTTGAATGGATGTCAATCCG…
 FASTX.FASTA.Record:
   identifier: A-seg3
  description: <missing>
     sequence: AGCAAAAGCAGGTACTGATTCGAAATGGAAGATTTTGTG…

It would be nicer if it was something like

3-element Vector{FASTX.FASTA.Record}:
 FASTX.FASTA.Record("A-seg1", nothing, AGCAAAAGCAGGTCAATTATATTCAGTATGGAAAGAATA…)
 FASTX.FASTA.Record("A-seg2", nothing, AGCAAAAGCAGGCAAACCATTTGAATGGATGTCAATCCG…)
 FASTX.FASTA.Record("A-seg3", nothing, AGCAAAAGCAGGTACTGATTCGAAATGGAAGATTTTGTG…)

Multiline FASTQ records?

According to FormatSpecimens, multiline FASTQ is a thing (googling around seem to confirm this). According to the philosophy of "be liberal in what you parse, conservative in what you produce", it would be nice to handle these. At the very least it's not optimal to keep a list of valid FASTQ files in FormatSpecimens, but then not actually be able to parse them.

However, it turns out that supporting multiline FASTQ using an FSM based parser is surprisingly hard.

The reason is that @ is allowed in a quality line. So, when encountering \n@ after a quality line, should the FSM transition to the header state, or just treat it as another quality line? Jesus, the guys who designed the FASTX formats...

The only way to get around this is to manually check, when encountering \n@ after a quality line, if the current number of parsed quality symbols is less than the number of parsed sequence symbols. One can use Automa's preconditions to do this (however, these needs to be fixed up first, see BioJulia/Automa.jl#102).

Ideally, the check should only happen once, on the @ byte, and not for every transition within the quality line regex, as preconditions currently does.

Erratum in documentation

I wanted to write a sequence to a file in FASTA format according to the example in the documentation

Current Behavior

ERROR: LoadError: MethodError: no method matching (::var"#7#8")(::FASTX.FASTA.Writer{TranscodingStreams.NoopStream{IOStream}})

Possible Solution / Implementation

The w after the do needs to added:

using BioSequences
x = dna"aaaaatttttcccccggggg"
rec = FASTA.Record("MySeq", x)
open(FASTA.Writer, "my-out.fasta") do w
    write(w, rec)
end

Steps to Reproduce (for bugs)

Run the snippet from the example

Context

Was trying to run the example from the docs

LoadError: UndefVarError: LongSequence not defined during precompilation step

I get an error message at the precompilation step of FASTX when using Julia version 1.4 on macOS. Interestingly enough, I only get the error when typing using FASTX and not when running a Julia script with the same statement. I am relatively new to Julia, so I may be missing something essential.

Steps to Reproduce (for bugs)

Just type the following:

using FASTX

Error Message

[ Info: Precompiling FASTX [c2308a5c-f048-11e8-3e8a-31650f418d12]
ERROR: LoadError: LoadError: LoadError: UndefVarError: LongSequence not defined
Stacktrace:
 [1] getproperty(::Module, ::Symbol) at ./Base.jl:26
 [2] top-level scope at /Users/feli/.julia/packages/FASTX/wcfDB/src/fasta/record.jl:203
 [3] include(::Module, ::String) at ./Base.jl:377
 [4] include(::String) at /Users/feli/.julia/packages/FASTX/wcfDB/src/fasta/fasta.jl:4
 [5] top-level scope at /Users/feli/.julia/packages/FASTX/wcfDB/src/fasta/fasta.jl:18
 [6] include(::Module, ::String) at ./Base.jl:377
 [7] include(::String) at /Users/feli/.julia/packages/FASTX/wcfDB/src/FASTX.jl:1
 [8] top-level scope at /Users/feli/.julia/packages/FASTX/wcfDB/src/FASTX.jl:7
 [9] include(::Module, ::String) at ./Base.jl:377
 [10] top-level scope at none:2
 [11] eval at ./boot.jl:331 [inlined]
 [12] eval(::Expr) at ./client.jl:449
 [13] top-level scope at ./none:3
in expression starting at /Users/feli/.julia/packages/FASTX/wcfDB/src/fasta/record.jl:203
in expression starting at /Users/feli/.julia/packages/FASTX/wcfDB/src/fasta/fasta.jl:18
in expression starting at /Users/feli/.julia/packages/FASTX/wcfDB/src/FASTX.jl:7
ERROR: Failed to precompile FASTX [c2308a5c-f048-11e8-3e8a-31650f418d12] to /Users/feli/.julia/compiled/v1.4/FASTX/ZwJmk_PXjlH.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1272
 [3] _require(::Base.PkgId) at ./loading.jl:1029
 [4] require(::Base.PkgId) at ./loading.jl:927
 [5] require(::Module, ::Symbol) at ./loading.jl:922

Your Environment

My Julia Environment

Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "/Applications/Visual Studio Code.app/Contents/Resources/app/bin/code"
  JULIA_NUM_THREADS = 1

Installed Packages

  "Bio"          => v"1.0.0"
  "BioSequences" => v"1.1.0"
  "ArgParse"     => v"1.1.0"

copyto!

We need the ability to conveniently copy sequence from a FASTQ or FASTA record to a BioSequences.jl sequence.

Currently you can make new sequences, but can't copy over to existing sequences.

Non-indexed FASTA: access sequence by genomic region

Hi,

Indexed FASTA have extract to quickly get a sequence. I could not find a way to do the same without an index. Could such a method be added ?

Expected Behavior

For example, with
sequence("chr1", 10, 1000, reader)

Possible Solution / Implementation

My attempt was

function sequence(chrom, x, y, reader)
    r= first(reader)
    seq = nothing
    while iterate(reader) != nothing
        if identifier(r) == chrom
            seq = FASTX.sequence(r)[x:y]
            break
        end
    end
    return seq
end

Your Environment

  • Package Version used: 2.1.2
  • Julia Version used: 1.9.2
  • Operating System and version (desktop or mobile): Gentoo

Thanks,

Performance comparison with Rust

I was wondering how fast this package is, so I made a small comparison with rust's fastq library. On v1.6, for fastq files rust is about 3-4 times faster, while for fastq.gz it's only 1.6x (I guess decompressing becomes more of a bottleneck). I think I've used every tricks to make the Julia code as fast as possible, have I missed something ?

While I think the performances are pretty good (1.6x doesn't matter that much), improvement are always welcome. It seems Julia's version is still allocating a lot of memory (1.07 M allocations: 225.188 MiB, 0.11% gc time, for a 100MB fastq.gz file), to me it seems the rust library is not really validating or converting anything, it just load the data and memory and goes through it. While have nicely converted records is nice, maybe a more "bare-bone" API could be provided when performance is really crucial, what do you think ?

My rust code (ends up being simpler than the Julia code):

use fastq::{parse_path, Record};

extern crate fastq;

fn main() {
    let filename = String::from("test.fastq.gz");

    let path = Some(filename);

    let mut total: u64 = 0;
    let mut tot_qual: u64 = 0; 
    parse_path(path, |parser| {
        parser.each(|record| {

            let s = record.seq();
            let q = record.qual();

            for (is, iq) in s.iter().zip(q.iter()) {
                if *is == b'A'{
                    total += 1;
                    tot_qual += (*iq -0x21) as u64;
                }
            }
            true
        }).expect("Invalid fastq file");
    }).expect("Invalid compression");
    println!("{:.15}", tot_qual as f64 / total as f64);
}

And Julia :

using FASTX, CodecZlib, BioSymbols, BioSequences

function count_reads(fasta)
    #reader = FASTQ.Reader(open(fasta))
    reader = FASTQ.Reader(GzipDecompressorStream(open(fasta)))
    record = FASTQ.Record()
    N, tot_qual = 0, 0
    seq = LongDNASeq(200)
    
    while !eof(reader)
        read!(reader, record)
        q = FASTQ.quality(record)
        copyto!(seq, record)

        @inbounds for i=1:length(q)
            if seq[i] == DNA_A
                N += 1
                tot_qual += q[i]
            end
        end
    end
    close(reader)
    tot_qual/N
end

fasta = "test.fastq.gz"

@time count_reads(fasta)

Converting FASTQ to FASTA

Hi again,

I was looking for the easiest way to convert FASTQ into FASTA (just dropping quality information). Perhaps I missed it, but there seems to be no direct conversion function (neither for whole files nor for individual records, i.e. FASTQ.Record -> FASTA.Record). The way I came up with is

r = FASTX.FASTQ.Reader(open("data.fastq", "r"))
w = FASTX.FASTA.Writer(open("data.fasta", "w"))
for record in r
    seq = FASTQ.sequence(record)
    id = FASTQ.identifier(record)
    record_fasta = FASTX.FASTA.Record(id, seq)
    write(w, record_fasta)
end
close(r)
close(w)

Which seems a bit error-prone and inconvenient compared to e.g. biopython:

using PyCall
SeqIO = pyimport("Bio.SeqIO")
res = SeqIO.convert("data.fastq", "fastq", "data.fasta", "fasta")

My questions are:

  1. Does this functionality exist and I just missed it?
  2. If not, is it within scope of this package and perhaps worthwhile adding?

so slowly of extract sequence by coord

function getseq(chrom,start,stop)
           record=open(FASTA.Reader,"Project/DataBase/hg38.fa",index="Project/DataBase/hg38.fa.fai")
           sequence(record[chrom])[start:stop]
           close(record)
end
@time getseq("chr1",2345,2356)

if use samtools faidx ,it is much faster

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.