Coder Social home page Coder Social logo

noodles's People

Contributors

artrand avatar claymcleod avatar dylan-dpc avatar ghuls avatar holtgrewe avatar jamestwebber avatar juliangehring avatar malthesr avatar mmalenic avatar nh13 avatar sharkloc avatar slazicoicr avatar tshauck avatar veldsla avatar zaeleus avatar zdk123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

noodles's Issues

issues with noodles-bcf

Hi, I think the record size for bcf is off. With this code:

use noodles_bcf as bcf;
use std::fs::{File};
use std::io::Write;


fn main() {
    let path = &"bug.bcf";

    let mut bcf = File::open(path).map(bcf::Reader::new).expect("couldn't open bcf");
    let mut stderr = std::io::stderr();

    for r in bcf.records() {
        let mut _rec = r.expect("error getting record");
        writeln!(stderr, "made it");
    }
}

and the attached file (after gzipping), I get an error. The error is different (invalid record length) if the file is bigger.
bug.bcf.gz

Here is the error with that file:

thread 'main' panicked at 'error getting record: Custom { kind: UnexpectedEof, error: "failed to fill whole buffer" }', src/main.rs:13:26

with a larger file, it reads 2 records and then shows:

thread 'main' panicked at 'error getting record: Custom { kind: InvalidData, error: "invalid record length: l_shared = 2147484031, l_indiv = 2147484031" }', src/main.rs:13:26

Clarify SAM record position in docs

Might be a good idea to mention that the supplied position parameter for the SAM record builder must be 1-based on the docs. Also may be worthwhile to mention somewhere that the SAM position is automatically converted 0-based when writing it out as a BAM record.

fails to compile

Hi,
I am trying to use this library as a replacement to rust-htslib. I am getting compilation error while trying to run examples. This can be reproduced both on macOS and centOS.

> git clone https://github.com/zaeleus/noodles
> cd noodles/
> cargo run --release --example bam_write > sample.bam
    Updating crates.io index
  Downloaded byteorder v1.3.4
  Downloaded bit-vec v0.6.2
  Downloaded indexmap v1.5.0
  Downloaded flate2 v1.0.16
  Downloaded hashbrown v0.8.1
  Downloaded miniz_oxide v0.4.0
  Downloaded cfg-if v0.1.10
  Downloaded adler v0.2.3
  Downloaded libc v0.2.73
  Downloaded crc32fast v1.2.0
   Compiling autocfg v1.0.0
   Compiling crc32fast v1.2.0
   Compiling libc v0.2.73
   Compiling cfg-if v0.1.10
   Compiling byteorder v1.3.4
   Compiling adler v0.2.3
   Compiling bitflags v1.2.1
   Compiling bit-vec v0.6.2
   Compiling miniz_oxide v0.4.0
   Compiling hashbrown v0.8.1
   Compiling indexmap v1.5.0
   Compiling flate2 v1.0.16
   Compiling noodles-bgzf v0.1.0 (/home/mayakond/noodles/noodles-bgzf)
error[E0599]: no associated item named `MAX` found for type `u16` in the current scope
 --> noodles-bgzf/src/virtual_position.rs:6:56
  |
6 | pub(crate) const MAX_UNCOMPRESSED_POSITION: u16 = u16::MAX; // 2^16 - 1;
  |                                                        ^^^ associated item not found in `u16`
  |
help: you are looking for the module in `std`, not the primitive type
  |
6 | pub(crate) const MAX_UNCOMPRESSED_POSITION: u16 = std::u16::MAX; // 2^16 - 1;
  |                                                   ^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0599`.
error: could not compile `noodles-bgzf`.
warning: build failed, waiting for other jobs to finish...
error: build failed

Buffered reading of BGZF breaks calculation of virtual positions.

While experimenting, I ran into a quirk in the bgzf module: when working with a reader: io::BufReader<bgzf::Reader<R>>, calls to reader.get_ref().virtual_position() break. A little MWE to illustrate:

use std::io::{self, BufRead, Read, Write};

use noodles_bgzf as bgzf;

fn log(line: &str, virtual_position: &bgzf::VirtualPosition) {
    println!(
        "Line '{}', compressed offset {}; uncompressed offset {}.",
        line, virtual_position.compressed(), virtual_position.uncompressed()
    );
}

fn main() -> io::Result<()> {
    // Setup

    let input = b"line1\nline2\nline3\nline4\nline5\n";

    let mut writer = bgzf::Writer::new(Vec::new());
    writer.write_all(input)?;
    let data = writer.finish()?;

    // Plain reader (this works as expected)

    println!("Plain reader:");

    let mut reader = bgzf::Reader::new(data.as_slice());
    let mut virtual_position = reader.virtual_position();
    let mut buf = [0u8; 6];

    while let Ok(_) = reader.read_exact(&mut buf) {
        let line = std::str::from_utf8(&buf[0..5]).expect("cannot parse UTF8");
        log(line, &virtual_position);
        virtual_position = reader.virtual_position();
    };

    println!();

    // Buffered reader (this does not work as expected)

    println!("Buffered reader:");

    let mut reader = io::BufReader::new(bgzf::Reader::new(data.as_slice()));

    loop {
        let virtual_position = reader.get_ref().virtual_position();
        let mut line = String::new();
        match reader.read_line(&mut line) {
            Ok(0) => break,
            Err(e) => return Err(e),
            _ => { line.pop(); },
        }
        log(&line, &virtual_position);
    }

    Ok(())
}

Running this prints:

Plain reader:
Line 'line1', compressed offset 0; uncompressed offset 0.
Line 'line2', compressed offset 0; uncompressed offset 6.
Line 'line3', compressed offset 0; uncompressed offset 12.
Line 'line4', compressed offset 0; uncompressed offset 18.
Line 'line5', compressed offset 0; uncompressed offset 24.

Buffered reader:
Line 'line1', compressed offset 0; uncompressed offset 0.
Line 'line2', compressed offset 46; uncompressed offset 0.
Line 'line3', compressed offset 46; uncompressed offset 0.
Line 'line4', compressed offset 46; uncompressed offset 0.
Line 'line5', compressed offset 46; uncompressed offset 0.

I guess this is simply a consequence of the fact that the virtual position will be moved on BufReader::fill_buf, since this is where the call to the inner bgzf::Reader::read occurs, rather than BufReader::consume. Therefore, I thought at first that I was perhaps in unintended territory by going through the BufReader::get_ref, but I notice the same thing occurs in vcf::Reader::virtual_position, so I thought it might be helpful to mention it here.

The only solution I could think of would be to implement io::BufRead on the bgzf::Reader directly, to ensure that the position gets updated properly in consume. I think that could be made to work, and in some sense it seems kind of natural since the bgzf::Reader is already kind of buffered via the calls to read_block. I'll be happy to try and help draft a PR if you think this is the way forward, but I thought maybe you might have different ideas for how and whether this should be supported.

GTF parser

While trying noodles squab I noticed that GTF files couldn't be read (only GFF files). Are there plans to support it (as from a parsing perspective, only column 9 needs different parsing).

bcf/vcf info setting, getting example

Hi, thanks for the library. It would be useful to have an example of setting and getting a value from the info field. I see the tests in the modules, but for a rust beginner like me, those are hard to translate to real-world.
Also it would be useful to have a way to make the genotype parsing optional for BCFs with many samples when we only need to access the INFO fields.

Error while reading a BCF header

While trying to read the header from this file I got an error. (The uncompressed version can be found in the same directory)
Code:

  let mut reader = File::open(path.as_ref()).map(bcf::Reader::new)?;

  let _ = reader.read_header().unwrap();

Output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom { kind: UnexpectedEof, error: "failed to fill whole buffer" }'

I'm not sure if this is a noodles bug, but I think so, because I created the BCF file converting this VCF file using bcftools, so it should be okay.

Async io and memory mapped file support?

This is an excellent project which shows off the capabilities of Rust in a bioinfomatics environment.

I was wondering if you plan async or memory mapped file support?

The two most common use cases are:

Data on local SSD -> Memory mapped files (you can get away with Cursor if performance is not an issue).
Data on object store -> async file i/o.

Async requires use of AsyncRead/AsyncWrite and async functions.

With BGZip encoded data, you can read ahead on multiple async requests (s3 recommends
around 16MB chunks) and decode on multiple threads to lower latency and memory usage.

On writes, you can use multipart upload on s3.

I'd be happy to help if you have not already done some of this.

docs.rs fails to standalone build async crates

Recent version of the async enabled crates fail to build standalone on docs.rs.

In the logs the following can be found

error[E0432]: unresolved import `tokio::fs`

bam started failing after 0.7.

When built through the wrapper noodles package the docs seem build without issues, but they are built without the async feature.

Use FnvHasher on IndexMap

noodles_sam::record::data::Data uses the default IndexMap to store the Tag/Value pairs. The IndexMap docs state:

IndexMap and IndexSet have a default hasher type S = RandomState, just like the standard HashMap and HashSet, which is resistant to HashDoS attacks but not the most performant.

I think there no plausible way to exploit the DoS attack when we are only hashing two bytes and alternative hash functions can easily be used. According to this page Fnv hasher performs best when hashing small values. FnvHasher is available as an external dependency

When I change the IndexMap in noodles_sam::record::data::Data to IndexMap<K, V, FnvBuildHasher> I see a measurable increase in performance depending on the number of data fields (5% at 5 fields, 10% at 16)

This is a two line change, but IndexMap is used in many more locations (mostly header types, but also vcf-record). When hashing small values I guess these can all be switched? Would you like a PR?

BCF header contig IDX

I was having problems using a BCF file produced by noodles in bcftools. After a bit of digging, I believe my issue has to do with the handling of the IDX field in the header contig field. To illustrate, running

echo -e '##fileformat=VCFv4.3\n##contig=<ID=1>\n#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tsample' | bcftools view -O b > test.bcf.gz

and then

use std::{error, fs, io};

use noodles_bcf as bcf;
use noodles_vcf as vcf;

fn main() -> Result<(), Box<dyn error::Error>> {
    let mut reader = fs::File::open("test.bcf.gz").map(bcf::Reader::new)?;

    reader.read_file_format()?;
    let raw_header = reader.read_header()?;
    let header: vcf::Header = raw_header.parse()?;

    let mut writer = vcf::Writer::new(io::stdout());

    writer.write_header(&header)?;

    Ok(())
}

prints something like

##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed",IDX=0>
##contig=<ID=1,IDX="0">
##bcftools_viewVersion=1.14-25-gc304cb1+htslib-1.14-9-ge769401
##bcftools_viewCommand=view -O b; Date=Tue Jan  4 13:22:23 2022
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample

Note in particular that the value "0" of the IDX field on the contig record is quoted ##contig=<ID=1,IDX="0">, since it is encoded as a regular entry of a IndexMap<String, String> rather than as a particular Option<usize> field as used e.g. for filters.

When writing BCFs with noodles, this causes issues when interfacing e.g. with bcftools. Modifying the above slightly to write a BCF,

Code
use std::{error, fs, io};

use noodles_bcf as bcf;
use noodles_vcf as vcf;

fn main() -> Result<(), Box<dyn error::Error>> {
    let mut reader = fs::File::open("test.bcf.gz").map(bcf::Reader::new)?;

    reader.read_file_format()?;
    let raw_header = reader.read_header()?;
    let header: vcf::Header = raw_header.parse()?;

    let mut writer = bcf::Writer::new(io::stdout());

    writer.write_file_format()?;
    writer.write_header(&header)?;

    Ok(())
}

and running for instance

cargo run | bcftools view

produces

[W::bcf_hdr_register_hrec] Error parsing the IDX tag, skipping
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##bcftools_viewVersion=1.14-25-gc304cb1+htslib-1.14-9-ge769401
##bcftools_viewCommand=view -O b; Date=Tue Jan  4 13:22:23 2022
##bcftools_viewCommand=view; Date=Tue Jan  4 13:38:29 2022
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample

where the contig entry has been skipped due to an error parsing the string IDX tag.

Digging around in the BCF specs sections 6.2.1-6.2.2, it's not clear to me how, if at all, the IDX field should work for contigs. My impression from reading samtools/hts-specs#591, however, is that contig IDXs should somehow be handled, albeit separately from other header records? I don't understand the details of the BCF spec well enough to be sure that I completely follow that thread, however.

Feel free to close if you think this is a problem in bcftools or with the spec, rather than with noodles - I figured either way it might save someone else a bit of time repeating the above investigation if there's a searchable issue describing the problem.

bam/writer/record: `n_cigar_op` is the wrong CIGAR length

When encoding the number of cigar operations it's not correct to divide by the sizeof u32. The currently incorrectly encoded cigar length results in invalid bams.

Occurs in both Writer and AsyncWriter.

let n_cigar_op = u16::try_from(record.cigar().len() / mem::size_of::<u32>())

let n_cigar_op = u16::try_from(record.cigar().len() / mem::size_of::<u32>())

I found it by adding the following test to the writer:

#[test]
fn test_write_record_with_cigar() -> Result<(), Box<dyn std::error::Error>> {
	use crate::record::cigar::Op;
	use crate::record::sequence::Base;
	use noodles_sam::record::cigar::op::Kind;
	use noodles_sam::record::quality_scores::Score;

	let mut buf = Vec::new();
	let mut record = Record::default();

	let cigar = record.cigar_mut();
	cigar.push(Op::new(Kind::Match, 3)?);

	let seq = record.sequence_mut();
	seq.push(Base::A);
	seq.push(Base::A);
	seq.push(Base::A);

	let qual = record.quality_scores_mut();
	qual.push(Score::try_from(40)?);
	qual.push(Score::try_from(40)?);
	qual.push(Score::try_from(40)?);

	write_record(&mut buf, &record)?;

	let expected = [
		0x2B, 0x00, 0x00, 0x00, // block_size = 43
		0xff, 0xff, 0xff, 0xff, // ref_id = -1
		0xff, 0xff, 0xff, 0xff, // pos = -1
		0x02, // l_read_name = 2
		0xff, // mapq = 255
		0x48, 0x12, // bin = 4680
		0x01, 0x00, // n_cigar_op = 1
		0x04, 0x00, // flag = 4
		0x03, 0x00, 0x00, 0x00, // l_seq = 3
		0xff, 0xff, 0xff, 0xff, // next_ref_id = -1
		0xff, 0xff, 0xff, 0xff, // next_pos = -1
		0x00, 0x00, 0x00, 0x00, // tlen = 0
		0x2a, 0x00, // read_name = "*\x00"
		0x30, 0x00, 0x00, 0x00, // cigar = "3M"
		0x11, 0x10, // seq = "AAAA"
		0x28, 0x28, 0x28// score = 40,40,40
	];

	assert_eq!(buf, expected);

	Ok(())
}

Make record attributes hashable

Hi!

I am currently migrating some code from rust-htslib to noodles.
While doing so it turned out that several bam and sam record attributes are no primitive times e.g. a record position and therefore do not support hashing.
As I need to add several attributes to a HashSet it would be helpful if hashing would be generally supported by deriving Hash
In my case this would concern ReferenceSequenceId and sam::record::Position.

INFO/MQ field type error

I was getting an error parsing a VCF header. The code may be reduced to the following:

fn main() {
    use noodles_vcf::Header;

    let raw_header = r#"##fileformat=VCFv4.2
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
"#;
    let header = raw_header.parse::<Header>();
    println!("{:?}", header); // Error
}

which produces Err(InvalidInfo(TypeMismatch(Integer, Float))). As the error indicates, the problem seems to arise from the fact that MQ here is Integer, whereas noodles seems to expect Float. As far as I can see, the specs aren't entirely clear about the types of the reserved INFO fields. However,

  1. There is in fact a note, near the end of p.6, about that the MQ fields for genotypes, whichs says

MQ : RMS mapping quality, similar to the version in the INFO field. (Integer)

  1. The example output files in the bcftools github consistently code MQ as an integer field (e.g. here)

which makes me think that MQ should perhaps be Type::Integer in noodles also.

On a side note, I wonder if something similar may also be true for BQ, though I can find even less info in the specs and bcftools examples in this case?

Happy to submit a PR if you'd like.

Async Readers and Send

There's and interesting barrier regarding Send with AsyncReader for the bgzf formats, and I'm not sure if it's by design. For example, the bam AsyncReader is not send, making code like in this example, fail to compile.

The error complains that: the trait std::marker::Send is not implemented for (dyn std::future::Future<Output = std::result::Result<noodles::noodles_bgzf::block::Block, std::io::Error>> + 'static)

Do you think that this is unavoidable?

Mutable access to VCF header

As far as I can tell, it's currently impossible to modify an existing vcf::Header, apart from adding unstructured records with the insert method. This is inconvenient e.g. for the case when modifying an existing VCF by adding a few INFO or FORMAT fields to the header and records, rather than building a header from scratch.

I see that the sam::Header already has _mut methods for cases like this. Would it make sense to add something similar for the vcf::Header methods that return IndexMaps? For the non-IndexMap fields (assembly, file_format, pedigree_db), there could be similar _mut methods to return mutable reference or perhaps simply set_ setters? Finally, it might sometimes be ergonomic to create a vcf::header::Builder from an existing header - perhaps impl From<Header> for Builder could be nice?

Happy to make the PR if you'd like.

No way to mutate the flags in a SAM record

The Recod::flags function returns Flags which is Copy so there is, as far as I can see, no way to mutate the flags.

pub fn flags_mut(&mut self) -> &mut Flags {
    &mut self.flags
}

is probably the way to go considering similar functions of the API. I can send a PR if you like.

Publish a version of things on crates.io

Is there a timeline or line in the sand for when you might be interested in pushing a v0.1.0 of all of these to crates.io? It's impossible to publish a crate that depends on any of the noodles family until there is a version of them in crates.io.

If there's a checklist of things to add to add or do for an initial release I'm happy to help push it over the finish line. Specifically I want to release a new version of perbase that uses noodles-bgzf can't publish it to crates.io until until noodles is there first.

Bam record should validate sequence and quality length before writing

In the encoded bam there is one field that encodes the length for both the sequence and the base qualities. When writing a bam::Record this is not validated resulting in possibly corrupt BAM files.

According to the bam spec when base qualities are omitted 0xFF should be written instead.

This can be done in the writer code, but it's probably cleaner to do a validate function on the record? Should I do a PR?

query causes a panic

When calling query on a ReferenceSequence from a Tabix file (this file specifically) I got a panic.
Code:

  let path = std::env::current_dir()
    .unwrap()
    .join("data")
    .join("vcf")
    .join("sample1-bcbio-cancer.vcf.gz.tbi");

  let index = tabix::read(path).unwrap();
  for ref_seq in index.reference_sequences() {
    ref_seq.query(150, i32::MAX);
  }

Output:

thread 'main' panicked at 'index out of bounds: 37450 >= 37450', /home/daniel/.cargo/registry/src/github.com-1ecc6299db9ec823/bit-vec-0.6.3/src/lib.rs:539:9

It seems to have something to do with the i32::MAX value, but it doesn't happen if the start value is 0
I also wanted to ask if there is an easy way of suing the query functionality with an open end or start, so it covers until the end of the sequence or the start.

parallelization of bgzf compression

hi,

I used noodles to modify and filter bam files with the primary focus of learning rust. Is it possible to parallelize the compression part when writing to a bam file.

Great package :)

Kristian

Error while parsing a valid VCF header

If I try to parse the VCF example header from the VCF specification I get the following error.
Code:

  use noodles_vcf::Header;
  r#"##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003"#
.parse::<Header>().unwrap();

Output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidContig(InvalidRecord)'

The error disappears when I remove the md5=f126cdf8a6e0c7f379d618ff66beb2da, part in the contig field.

Tag::Other representation

I'm trying noodles-(bam/sam) for a real project now and so far the experience is more than pleasant. I've come across a few performance issues I'd like to discuss. The Tag::Other(String) representation being a relatively simple change.

Problem

Currently Tag:Other(String) requires the end user to create a String when you want to query a tag. This has a few drawbacks

  • Each tag allocates
  • No constant tags because String cannot be const
  • Because the bam::data::fields iterator parses every tag in the data part, searching for a specific tag can be quite expensive.

Solution

Since the tag is required by the specification to be 2 bytes change the String to [u8; 2] (this is a LENGTH constant in de code).

This can be done with very few API changes and is implemented in a PR.
I ran a benchmark searching for the UMI tag in a real data line:
AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:75 YS:i:0 YT:Z:CP RG:Z:000000000-JM6RV.1 XS:A:+ NH:i:4 RX:Z:CGAATAAA MQ:i:1 MC:Z:75M ms:i:2646

let v = data
	.fields()
	.find_map(|f| {
		let f = f.ok()?;
		if f.tag() == &Tag::UmiSequence {
			Some(f.value().to_owned())
		} else {
			None
		}
	}).unwrap();
assert_eq!(v, Value::String("CGAATAAA".to_string()));

The change to [u8; 2] results in a decent performance gain:

DataValues/classic      time:   [569.66 ns 570.12 ns 570.64 ns]                                
                        change: [-28.071% -27.942% -27.830%] (p = 0.00 < 0.05)

Issues

The issue with this array implementation is the same as with the String version. Directly creating the enum variant is allowed and makes it possible to generate invalid tags. let t = Tag::Other("_a".to_string()); is currently accepted as well as in the changed version. This can probably be avoided by wrapping the array (or the String in the current version) in a newtype, but I'm not sure how this would affect the ability to make const tags.
Let me know if you want to explore this.

bcf/header/string_map: Disable type checking when building string map for VCF < 4.3

Hi,
yes, this VCF (converted to BCF): https://github.com/brentp/vcf-bench#testing-data

with this code: https://github.com/brentp/vcf-bench/blob/625482287b6238a5d1e018ac94b3d89558593986/rust-noodles/src/main.rs

also in that code, I couldn't get the key on line 27 to work, had to use the predefined, Key::TotalAlleleCount. I'm curious what I could change to use it as line 27.

Originally posted by @brentp in #41 (comment)

Format error while parsing a ยฟvalid? VCF header

If I try to parse the VCF header from this file, which I think has a valid VCF header, I get an error.
Code:

  use noodles_vcf::Header;
  r#"##fileformat=VCFv4.1
##fileDate=20100501
##reference=1000GenomesPilot-NCBI36
##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta
##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=DEL:ME:L1,Description="Deletion of L1 element">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:L1,Description="Insertion of L1 element">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
#CHROM POS     ID        REF              ALT          QUAL FILTER INFO                                                               FORMAT       NA00001
"#  
.parse::<Header>().unwrap();

Output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidFormat(TypeMismatch(Float, Integer))'

It seems to have something to do with the format lines, but I couldn't isolate the error

Something similar also happens with the header from this other file.
Code:

  use noodles_vcf::Header;
  r#"##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=UnifiedGenotyper,Version=2.6-4-g3e5ff60,Date="Sun Jul 21 00:41:26 EDT 2013",Epoch=1374381686726,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[/home/chapmanb/bio/bcbio-nextgen/tests/test_automated_output/bamprep/Test1/chrM/Test1-chrM_100_1000-prep.bam] read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[BadCigar, NotPrimaryAlignment] intervals=[/home/chapmanb/bio/bcbio-nextgen/tests/test_automated_output/gatk/chrM/Test1-chrM_100_1000-prep-variants-TestBatch1-raw-regions.bed] excludeIntervals=null interval_set_rule=INTERSECTION interval_merging=ALL interval_padding=0 reference_sequence=/home/chapmanb/bio/bcbio-nextgen/tests/data/automated/tool-data/../../genomes/hg19/seq/hg19.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 allow_bqsr_on_reduced_bams_despite_repeated_warnings=false validation_strictness=SILENT remove_program_records=false keep_program_records=false unsafe=LENIENT_VCF_PROCESSING disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=BOTH pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 allSitePLs=false min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 genotyping_mode=DISCOVERY output_mode=EMIT_VARIANTS_ONLY standard_min_confidence_threshold_for_calling=4.0 standard_min_confidence_threshold_for_emitting=4.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 input_prior=[] contamination_fraction_to_filter=0.05 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null dbsnp=(RodBinding name=dbsnp source=/home/chapmanb/bio/bcbio-nextgen/tests/data/automated/tool-data/../../genomes/hg19/variation/dbsnp_132.vcf) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub debug_file=null metrics_file=null annotation=[BaseQualityRankSumTest, FisherStrand, GCContent, HaplotypeScore, HomopolymerRun, MappingQualityRankSumTest, MappingQualityZero, QualByDepth, ReadPosRankSumTest, RMSMappingQuality, DepthPerAlleleBySample, Coverage] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=GC,Number=1,Type=Integer,Description="GC content around the variant (see docs for window size details)">
##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=chrM,length=16571,assembly=hg19>
##contig=<ID=chr22,length=20001,assembly=hg19>
##reference=file:///home/chapmanb/bio/bcbio-nextgen/tests/data/automated/tool-data/../../genomes/hg19/seq/hg19.fa
"#  
.parse::<Header>().unwrap();

Output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidFormat(NumberMismatch(Unknown, R))

Writing bgzipped VCF

I was trying to use noodles (thanks!) to write a VCF file with bgzip compression, which I thought I'd be able to do with a vcf::Writer<bgzf::Writer<W>> along the lines of what's happening in the BAM module.

To illustrate, I tried modifying noodles-vcf/examples/vcf_write.rs (changed lines are emphasised):

use std::io;

use noodles_bgzf as bgzf;
use noodles_vcf::{self as vcf, header::Contig};

fn main() -> Result<(), Box> {
    let stdout = io::stdout();
    let handle = stdout.lock();
    let bgzf_handle = bgzf::Writer::new(handle);
    let mut writer = vcf::Writer::new(bgzf_handle);

    let header = vcf::Header::builder()
        .add_contig(Contig::new(String::from("sq0")))
        .build();

    writer.write_header(&header)?;

    let record = vcf::Record::builder()
        .set_chromosome("sq0".parse()?)
        .set_position(1)
        .set_reference_bases("A".parse()?)
        .build()?;

    writer.write_record(&record)?;

    Ok(())
}

Now, running cargo run --release --example vcf_write compiles and runs, but nothing is written to stdout. I tried the same thing with a std::fs::File handle instead of a locked stdout, and that likewise compiles and runs, only to produce an empty file. If I explicitly flush the writes, I get a bgzip header and trailer around an uncompressed VCF, and then I wasn't sure how to proceed. Perhaps I'm missing something simpler?

Error compiling

While compiling an updated version (c235259) I got the following error

error[E0277]: the type `[u8]` cannot be indexed by `(Bound<usize>, Bound<usize>)`
   --> /home/daniel/.cargo/git/checkouts/noodles-f00ebc122d065317/c235259/noodles-fasta/src/reader.rs:251:36
    |
251 |         Ok(Record::new(definition, sequence[range].to_vec()))
    |                                    ^^^^^^^^^^^^^^^ slice indices are of type `usize` or ranges of `usize`
    |
    = help: the trait `SliceIndex<[u8]>` is not implemented for `(Bound<usize>, Bound<usize>)`
    = note: required because of the requirements on the impl of `std::ops::Index<(Bound<usize>, Bound<usize>)>` for `Vec<u8>`

Previously working tests now fail

The HtsGet-rs tests fail after some of the last changes. I have tracked the new errors down to the commits mentioned in umccr/htsget-rs#51. It seems the changes regarding "directly reading the metadata pseudo-bin" are responsible for this. The errors vary from not being able to read index files at all and some really weird values. It would be of great help if you could look into it ๐Ÿฅ‡

Using Genotype

I can parse a VCF record to get the FORMAT/GT fields as a noodles_vcf::record::genotype::field::value::genotype::Genotype, which I see is a newtype around Vec<noodles_vcf::record::genotype::field::value::genotype::allele::Allele>. However, I cannot find a way to work with these alleles in any way. From other parts of noodles I might have expected that Genotype implemented e.g. Deref<Target = Vec<Allele>> or provided getters/setters/iterators, but I cannot find any way to access the inner information. Am I missing something, or is such access lacking?

A MWE gets a little lengthy, but I'll include it for completeness:

use std::io;

use noodles_vcf as vcf;

fn main() -> io::Result<()> {
    let data = br#"##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
1	1	.	A	.	999	PASS	.	GT	0/0	./0	1/1"#;

    let mut reader = vcf::Reader::new(&data[..]);

    let _header = reader.read_header()?;

    let mut raw_record = String::new();
    reader.read_record(&mut raw_record)?;

    let record: vcf::Record = raw_record.parse().expect("cannot parse record");

    let genotypes: Vec<vcf::record::genotype::field::value::genotype::Genotype> = record
        .genotypes()
        .iter()
        .map(|gt| {
            gt.get(&vcf::record::genotype::field::key::Key::Genotype)
                .and_then(|field| field.value())
                .map(|value| match value {
                    vcf::record::genotype::field::Value::String(s) => s
                        .parse::<vcf::record::genotype::field::value::genotype::Genotype>()
                        .expect("cannot parse genotype"),
                    _ => panic!("no genotype for sample"),
                })
                .expect("cannot access genotype")
        })
        .collect();

    for genotype in genotypes {
        println!("{:?}", genotype);

        // Do something with alleles in genotype?
    }

    Ok(())
}

On a somewhat related note, would it be in scope for noodles to provide an iterator over noodles_vcf::record::genotype::field::value::genotype::Genotype from a noodles_vcf::record::Record in addition to the iterator over noodles_vcf::record::genotype::Genotype that is returned by Record::genotypes? For my own part at least, the GT field itself is by far the most commonly accessed FORMAT field, so this might be a nice ergonomic improvement as compared to the above.

Again, I'm happy to contribute PRs for either of those suggestions if you'd like.

How to access sequence name from reference_sequence_id ?

Hello
I am currently trying to get the target and mate-target ID which I find a bit confusing.
I got the header-like ReferenceSequence via:

    let sequences : IndexMap<String, sam::header::ReferenceSequence> = match reader.read_reference_sequences() {
        Ok(n) =>  n,
        Err(_) => panic!("ERROR: could not get ref sequence!"),
    };

Now I thought I could easily just look up the index and get the entry, but the reference_sequence_id returns something I am not familiar with:

let chr1_index  = read.reference_sequence_id().expect("ERROR: could not obtain read ref ID");

leads to

[src/lib/noodles_based_lib.rs:77] chr1_index = ReferenceSequenceId(
    1434,
)
[src/lib/noodles_based_lib.rs:77] chr1_index = ReferenceSequenceId(
    1434,
)

Stupid question, what is that structure and how do I get the index number to use as lookup for the IndexMap ? :)

Update:
Okay stupid me figured it out, just hope that this is the good way to do it:

 let chr1_index  = match read.reference_sequence_id().map(i32::from){
                    Some(n) => n,
                    _ => panic!("ERROR: could not retrieve the id of the reference!"),
                };
let chr1 : &str = match seq_local.get_index(chr1_index as usize){
                    Some(n) => n.0 ,
                    None  => panic!("ERROR: could not assign a reference ID to a reference name!"),
                };

If this seems fine, please feel free to close. Just want to avoid that I close it and you recommend against this way ;)

BinningIndex query and RangeBounds

I could be missing something, but the BinningIndex query function requires a RangeBounds<i32> and Copy type, however RangeBounds does not implement Copy.

Specifically:
let chunks = index.query(ref_seq_id, seq_start..=seq_end);

Outputs error E0277:
the trait bound RangeInclusive<i32>: std::marker::Copy is not satisfied

`docs.rs` typo

In many of the Cargo.toml files, the docs url is doc.rs. It should be docs.rs.

Potential for adding mzml and mgf?

Hi... very excited to see this library! I've developed a couple tangentially related packages, though noodles is much better developed.

In particular there's one that does mgf and mzml. These are used in metabolomics/proteomics for storing mass spec data. Are you accepting format contributions? / Would mgf and mzml be a good fit (i.e. a noodles-mgf)?

The goal of this package is to have have functions for reserializing to a different format (e.g. mgf to jsonl), so there's a CLI component and a lib component. It's annoying to comingle them, so seeing noodles I thought I'd see what you thought (or at least we can commiserate about bioinformatic formats ๐Ÿ˜„).

Calling seek changes future virtual_position() results (VCF)

I found a case in which calling seek changes the output. The file I'm using to test it can be found here
Code:

   let path = std::env::current_dir()
    .unwrap()
    .join("data")
    .join("vcf")
    .join("sample1-bcbio-cancer.vcf.gz");

  let mut reader = File::open(path)
    .map(bgzf::Reader::new)
    .map(vcf::Reader::new)
    .unwrap();

  println!("First round");
  reader.read_header().unwrap();
  let mut s = String::new();
  loop {
    reader.read_record(&mut s).unwrap();
    if s.is_empty() {
      break;
    }
    let vpos = reader.virtual_position();
    println!("{}/{}", vpos.compressed(), vpos.uncompressed());
    s.clear();
  }

  println!("Repeat after performing a seek");
  reader
    .seek(VirtualPosition::try_from((0, 7288)).unwrap())
    .unwrap();
  loop {
    reader.read_record(&mut s).unwrap();
    if s.is_empty() {
      break;
    }
    let vpos = reader.virtual_position();
    println!("{}/{}", vpos.compressed(), vpos.uncompressed());
    s.clear();
  }

Output:

First round
0/7481
0/7687
0/7882
0/8075
0/8269
0/8463
0/8658
0/8852
0/9108
0/9302
0/9496
3395/0
Repeat after performing a seek
0/7481
0/7687
0/7882
0/8075
0/8269
0/8463
0/8658
0/8852
0/9108
0/9302
0/9496
28/0

The results are exactly the same until the end of the block, when the seek version reports an incorrect compressed part. The same results appear when running each version with different readers.

vcf/header: Add generic records accessors

2. The header is still lacking mutable access to the unstructured fields. Perhaps it would make sense to add either get_mut/delete to go with get/insert, and/or simply add records/records_mut for full access?

#66


Add

  • vcf::Header::records -> &IndexMap<String, Vec<Record>> and
  • vcf::Header::records_mut -> &mut IndexMap<String, Vec<Record>>

to access the header map field. This is a map of generic records.

Noodles reports a different offset than the index

Not sure if this is really wrong, but I thought it could use a revision.
The index (made with tabix) reports a different block offset than noodles. The files can be found here

Code:

  let path = std::env::current_dir()
    .unwrap()
    .join("data")
    .join("vcf")
    .join("sample1-bcbio-cancer.vcf.gz");

  let mut reader = File::open(&path)
    .map(bgzf::Reader::new)
    .map(vcf::Reader::new)
    .unwrap();

  println!("From reading the records");
  reader.read_header().unwrap();
  let mut s = String::new();
  loop {
    reader.read_record(&mut s).unwrap();
    if s.is_empty() {
      break;
    }
    let vpos = reader.virtual_position();
    println!("{}/{}", vpos.compressed(), vpos.uncompressed());
    s.clear();
  }

  println!("From the index");
  let index = tabix::read(format!("{}{}", path.to_str().unwrap(), ".tbi")).unwrap();
  index
    .reference_sequences()
    .into_iter()
    .flat_map(|seq| seq.bins())
    .flat_map(|bin| bin.chunks())
    .map(|chunk| chunk.end())
    .for_each(|chunk| println!("{}/{}", chunk.compressed(), chunk.uncompressed()));

Output:

From reading the records
0/7481
0/7687
0/7882
0/8075
0/8269
0/8463
0/8658
0/8852
0/9108
0/9302
0/9496
3395/0
From the index
0/9302
3367/0

The index reports 3367, but noodles reading the records reports 3395.

handle serialization of structs?

Hi -- for my line of work I often have to (de-)serialize bioinformatic file formats into more common formats, and I'm curious if there are any recommendations for how to do that with noodles or if someone else has done it... like in an ideal world I could:

use noodles::fasta;

fn main() {
    let d = fasta::record::Definition::new(String::from("seq1"), None);

    let seq = "ATCG".as_bytes().to_vec();
    let s = fasta::record::Record::new(d, seq);

    let s_json = serde_json::to_string(s);
    println!("{}", s_json);
}

I know it's possible add serialization to structs in external packages, but it's a non trivial amount of work, so thought I'd ask either a) if there was a good path to take; b) any thoughts/plans on supporting serialization a la rust-bio.

Thanks!

noodles_cram: write_record fails for reads from circular contig

When writing ultralong nanopore reads from circular contigs (in this specific case chrM), noodles_cram::Writer::write_record fails with

thread 'main' panicked at 'index out of bounds: the len is 16569 but the index is 16569', /home/till/.cargo/registry/src/github.com-1ecc6299db9ec823/noodles-cram-0.11.0/src/data_container/compression_header/preservation_map/substitution_matrix/builder.rs:17:28

I assume this is because the read wraps around the contig. I've attached a cram file (failing_record.zip) which contains such a record. The reference fasta is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chm13.draft_v1.1.fasta.gz (from the telomere-to-telomere project)

Scope question: index creation

The crates currently offers read and write functionality for various file formats and their respective index formats. Is it within the scope of this repo to include an Indexer struct to calculate the index Record from the respecitve file format?

What is the MSRV

Hello everyone thanks for your very nice work.

I didn't see any information for the Minimal Support Rust Version. This information is important for people use noodles, and for new contributor to know which Rust functionality they can use.

A nice choice for my point of view is use 1.57 Rust version, it's match with edition 2021. But I didn't have many good argument for this version.

Seek not working properly (on VCF)

The following code fails:

  let path = std::env::current_dir()
    .unwrap()
    .join("data")
    .join("vcf")
    .join("sample1-bcbio-cancer.vcf.gz");

  let mut reader = File::open(path)
    .map(bgzf::Reader::new)
    .map(vcf::Reader::new)
    .unwrap();

  reader.read_header().unwrap();
  loop {
    if reader.read_record(&mut String::new()).unwrap() == 0 {
      break;
    }
  }

  reader
    .seek(VirtualPosition::try_from((0, 9302)).unwrap())
    .unwrap();
  assert_eq!(
    reader.virtual_position(),
    VirtualPosition::try_from((0, 9302)).unwrap()
  );

The file I'm reading is this one (The compressed version can be found in the same folder).
I think it may have something to do with the EOF.

Accessing (potentially missing) tags and reading values

Hey there! I've got a BAM-parsing pipeline that I'd love to re-implement in Rust using noodles to speed everything up and consolidate many steps.

I've been reading through the code/documentation/examples, which are all great! But I either haven't found or haven't grokked how to a) check for the presence of a tag and b) get the value in an efficient way.

Using pysam the workflow would be something like (for example):

for a in aligned_bam:
   if a.is_secondary or a.is_supplementary or a.mapping_quality < mq:
       continue
   for tag in tags:  # I have a list of tags I'm looking for
       if a.has_tag(tag):
           tag_histograms[tag][a.get_tag(tag)] += 1

I'm not sure how I would do this with noodles. My current guess is

for result in reader.records() {
    let record = result?;        
    let flags = record.flags();
    let data = record.data();

    if !(flags.is_supplementary() || flags.is_duplicate()) {
        let mapq = u8::from(record.mapping_quality());

        if mapq != 255 && mapq >= 10 {
            // iterate over data and add values to counters?
        }
    }
}

In part I'm questioning whether I need to iterate over the tags or if there's a faster lookup method. That's about where I decided to open an issue.

Happy to contribute examples for this workflow, if you can point me in the right direction!

noodles::fastq comment support

As of noodles version 0.17.0 / noodles-fastq version 0.3.0, the fastq module does not support comments at all. I know that these are rarely used, they are however part of the file format. Would it be possible to include functionality for reading/getting and writing/setting comments?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.