Coder Social home page Coder Social logo

Comments (4)

zaeleus avatar zaeleus commented on August 22, 2024 1

Thanks for using/testing the new BAM implementation and reporting this!

The BAM format is limited to 216-1 CIGAR operations but can overflow to an auxiliary data field if necessary. This is done by setting the record CIGAR operations buffer to a flag equal to kSmN, where k is the base count and m is the reference sequence length.1 The lazy decoder was accidentally consuming the CIGAR operations buffer when doing this check, which is only done for CIGARs with two operations. This is why the first five records in your example return empty operations, and the last one returns the correct value.

The RecordBuf reader fully decodes the record CIGAR operations and then tries to resolve whether it's an overflow flag, which is why it's not affected.

Sorry for the inconvenience. This is fixed in noodles 0.62.1 / noodles-bam 0.54.1.

Footnotes

  1. ยง 4.2.2 "N_CIGAR_OP field" (2023-05-24). โ†ฉ

from noodles.

rob-p avatar rob-p commented on August 22, 2024

The plot thickens (at least for me)!

If I use the buffered (eager) instead of lazy iterator, the fields seem to be correct. That is, if I use

 for result in reader.record_bufs(&header) {

instead of

 for result in reader.records() {

to iterate over the records, the cigar strings and the other fields seem ok --- that is, I get

record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(0x0), reference_sequence_id: Some(98279), alignment_start: Some(Position(527)), mapping_quality: Some(MappingQuality(1)), cigar: Cigar([Op { kind: SoftClip, len: 5 }, Op { kind: Match, len: 111 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([71, 71, 71, 67, 65, 65, 65, 65, 65, 65, 71, 65, 65, 71, 84, 71, 65, 71, 67, 84, 71, 71, 65, 71, 65, 84, 84, 71, 71, 65, 84, 67, 65, 67, 65, 71, 67, 67, 71, 65, 65, 71, 71, 65, 71, 84, 65, 65, 65, 71, 71, 84, 71, 67, 84, 71, 67, 65, 65, 84, 71, 65, 84, 71, 84, 84, 65, 71, 67, 84, 71, 84, 71, 71, 67, 67, 65, 67, 84, 71, 84, 71, 71, 65, 84, 84, 84, 84, 84, 67, 71, 67, 65, 65, 71, 65, 65, 67, 65, 84, 84, 65, 65, 84, 65, 65, 65, 67, 84, 65, 65, 65, 65, 65, 67, 84]), quality_scores: QualityScores([16, 18, 25, 26, 29, 30, 25, 22, 20, 12, 12, 25, 27, 19, 17, 17, 16, 20, 27, 25, 31, 28, 27, 22, 12, 11, 13, 28, 29, 34, 31, 28, 16, 16, 16, 12, 9, 10, 9, 10, 13, 13, 13, 11, 6, 20, 21, 24, 17, 16, 27, 23, 22, 14, 16, 13, 12, 13, 12, 26, 30, 29, 31, 31, 29, 31, 12, 27, 31, 30, 21, 9, 9, 12, 16, 18, 21, 25, 29, 18, 22, 23, 26, 26, 48, 41, 45, 43, 37, 33, 23, 26, 48, 39, 30, 36, 33, 23, 32, 34, 35, 36, 35, 34, 40, 32, 25, 16, 24, 32, 28, 27, 22, 8, 4, 12]), data: Data([(Tag("NM"), UInt8(1)), (Tag("ms"), UInt8(216)), (Tag("AS"), UInt8(216)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(80)), (Tag("cm"), UInt8(17)), (Tag("s1"), UInt8(97)), (Tag("s2"), UInt8(98)), (Tag("de"), Float(0.009)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(111)


record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98286), alignment_start: Some(Position(407)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 11 }, Op { kind: Match, len: 105 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(210)), (Tag("AS"), UInt8(210)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(18)), (Tag("s1"), UInt8(98)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(105)


record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98291), alignment_start: Some(Position(605)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 12 }, Op { kind: Match, len: 104 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(208)), (Tag("AS"), UInt8(208)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(18)), (Tag("s1"), UInt8(98)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(104)


record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98283), alignment_start: Some(Position(426)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 34 }, Op { kind: Match, len: 82 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(164)), (Tag("AS"), UInt8(164)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(12)), (Tag("s1"), UInt8(72)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(82)


record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98289), alignment_start: Some(Position(456)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 34 }, Op { kind: Match, len: 82 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(164)), (Tag("AS"), UInt8(164)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(12)), (Tag("s1"), UInt8(72)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(82)


record = RecordBuf { name: Some(Name([83, 82, 82, 49, 52, 50, 56, 54, 48, 53, 52, 46, 50, 49, 56, 55])), flags: Flags(SECONDARY), reference_sequence_id: Some(98281), alignment_start: Some(Position(465)), mapping_quality: Some(MappingQuality(0)), cigar: Cigar([Op { kind: SoftClip, len: 12 }, Op { kind: Match, len: 80 }, Op { kind: SoftClip, len: 24 }]), mate_reference_sequence_id: None, mate_alignment_start: None, template_length: 0, sequence: Sequence([]), quality_scores: QualityScores([]), data: Data([(Tag("NM"), UInt8(0)), (Tag("ms"), UInt8(160)), (Tag("AS"), UInt8(160)), (Tag("nn"), UInt8(0)), (Tag("tp"), Character(83)), (Tag("cm"), UInt8(13)), (Tag("s1"), UInt8(71)), (Tag("de"), Float(0.0)), (Tag("rl"), UInt8(0))]) }
alignment span = Some(80)

Presumably, the precise method of iteration shouldn't matter, right โ€” or at least one might expect any differences in behavior as critical as this to be loudly documented ;P.

Thanks!
Rob

from noodles.

rob-p avatar rob-p commented on August 22, 2024

Thanks for the super quick-fix and for the explanation. The pointer regarding the overflow field is also very useful. In this case, what would the cigar() function return? I must admit when I see Result<Option<T>> I immediately want to understand the error condition versus the normal empty condition, and where different information will be propagated back to the caller.

Thanks again for noodles. We've been using it across different projects and it's awesome to have a library for this in pure Rust, for so many different formats, and with such a responsive developer!

from noodles.

zaeleus avatar zaeleus commented on August 22, 2024

In this case, what would the cigar() function return?

bam::record::Cigar wraps a raw BAM CIGAR operations buffer. It can be either the cigar field or, if the flags as described above is set, the value of the CG data field.

Take, for example, the regression tests that were added for this issue.

static DATA: &[u8] = &[
0xff, 0xff, 0xff, 0xff, // ref_id = -1
0xff, 0xff, 0xff, 0xff, // pos = -1
0x02, // l_read_name = 2
0xff, // mapq = 255
0x48, 0x12, // bin = 4680
0x01, 0x00, // n_cigar_op = 1
0x04, 0x00, // flag = 4
0x04, 0x00, 0x00, 0x00, // l_seq = 0
0xff, 0xff, 0xff, 0xff, // next_ref_id = -1
0xff, 0xff, 0xff, 0xff, // next_pos = -1
0x00, 0x00, 0x00, 0x00, // tlen = 0
b'*', 0x00, // read_name = "*\x00"
0x40, 0x00, 0x00, 0x00, // cigar = 4M
0x12, 0x48, // sequence = ACGT
b'N', b'D', b'L', b'S', // quality scores
];
#[test]
fn test_cigar() -> io::Result<()> {
let fields = Fields::try_from(Vec::from(DATA))?;
let cigar = fields.cigar();
assert_eq!(cigar.as_ref(), &DATA[34..38]);
Ok(())
}
#[test]
fn test_cigar_with_2_cigar_ops() -> io::Result<()> {
let data = [
0xff, 0xff, 0xff, 0xff, // ref_id = -1
0xff, 0xff, 0xff, 0xff, // pos = -1
0x02, // l_read_name = 2
0xff, // mapq = 255
0x48, 0x12, // bin = 4680
0x02, 0x00, // n_cigar_op = 2
0x04, 0x00, // flag = 4
0x04, 0x00, 0x00, 0x00, // l_seq = 0
0xff, 0xff, 0xff, 0xff, // next_ref_id = -1
0xff, 0xff, 0xff, 0xff, // next_pos = -1
0x00, 0x00, 0x00, 0x00, // tlen = 0
b'*', 0x00, // read_name = "*\x00"
0x20, 0x00, 0x00, 0x00, 0x20, 0x00, 0x00, 0x00, // cigar = 2M2M
0x12, 0x48, // sequence = ACGT
b'N', b'D', b'L', b'S', // quality scores
];
let fields = Fields::try_from(Vec::from(&data))?;
let cigar = fields.cigar();
assert_eq!(cigar.as_ref(), &data[34..42]);
Ok(())
}
#[test]
fn test_cigar_with_overflowing_cigar() -> io::Result<()> {
let data = [
0xff, 0xff, 0xff, 0xff, // ref_id = -1
0xff, 0xff, 0xff, 0xff, // pos = -1
0x02, // l_read_name = 2
0xff, // mapq = 255
0x48, 0x12, // bin = 4680
0x02, 0x00, // n_cigar_op = 2
0x04, 0x00, // flag = 4
0x04, 0x00, 0x00, 0x00, // l_seq = 0
0xff, 0xff, 0xff, 0xff, // next_ref_id = -1
0xff, 0xff, 0xff, 0xff, // next_pos = -1
0x00, 0x00, 0x00, 0x00, // tlen = 0
b'*', 0x00, // read_name = "*\x00"
0x44, 0x00, 0x00, 0x00, 0x23, 0x00, 0x00, 0x00, // cigar = 2S2N
0x12, 0x48, // sequence = ACGT
b'N', b'D', b'L', b'S', // quality scores
b'C', b'G', b'B', b'I', 0x01, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00,
0x00, // data["CG"] = [4M]
];
let fields = Fields::try_from(Vec::from(&data))?;
let cigar = fields.cigar();
assert_eq!(cigar.as_ref(), &data[56..]);
Ok(())
}

test_cigar is the typical case; it wraps the cigar field. test_cigar_with_2_cigar_ops has two CIGAR operations that do not match the overflow flag, and it, again, wraps the cigar field. test_cigar_with_overflowing_cigar has two CIGAR operations that do match the overflow flag, but instead of returning the cigar field, it searches the record auxiliary data for a CG tag and returns that value.

I must admit when I see Result<Option<T>> I immediately want to understand the error condition versus the normal empty condition, and where different information will be propagated back to the caller.

In the case of Record::alignment_span, it's fallible because it requires decoding the CIGAR operations, which can fail (e.g., an invalid operation type; see also Cigar::alignment_span). When an alignment span is calculated to be 0, we consider the record alignment span to be missing: the record could be unmapped (therefore, no CIGAR operations), it has no reference-consuming operations (which may be nonsensical), etc. There could certainly be a counterexample that I'm not aware of, and if that's the case, please let me know!

from noodles.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.