Part of <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Mosquito short read alignment pipeline - test dataset about pipelines HOT 8 CLOSED

alimanfoo commented on September 11, 2024

Mosquito short read alignment pipeline - test dataset

from pipelines.

Comments (8)

alimanfoo commented on September 11, 2024

Here's a proposal for some data to use for testing and validation.

Sequence reads for all samples in Ag1000G phase 1 are in ENA. Here's a sample metadata file for Ag1000G phase 1 which includes ENA sample accessions:

ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/samples/samples.meta.txt

For example, the sample that we identify as "AN0131-C" happens to have the lowest mean coverage, and has the ENA sample accession "ERS224501". Sequence reads for this sample are available from this ENA sample page:

https://www.ebi.ac.uk/ena/data/view/ERS224501

ENA has the concept of a "run", which is equivalent to what we've been calling a "lanelet", i.e., demultiplexed data from a single sequencing lane. This page shows that there are three Illumina HiSeq 2000 runs of data for this sample. Data from each run are available in either FASTQ or BAM format. From this page you can follow links to download files from FTP. E.g., here are the forward and reverse reads for run ERR317337:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR317/ERR317337/ERR317337_1.fastq.gz
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR317/ERR317337/ERR317337_2.fastq.gz

I propose we use two samples with relatively low coverage from Ag1000G phase 1 as a minimal test set. E.g., we could use AN0131-C (ERS224501) and AB0252-C (ERS224143).

For all samples in Ag1000G phase 1 we have also previously processed the data through our old vr-pipe implementation of the alignment pipeline and uploaded the analysis-ready BAMs back to ENA. Those are a bit harder to find as they are not linked from the sample page unfortunately. But they are all part of the analysis project "PRJEB18691" available from this page:

https://www.ebi.ac.uk/ena/data/view/PRJEB18691

If you go to that page you will see a list of all analysis records. From there you can download a list of all of the analysis records as a text file. You can then search through that file for the sample you want. E.g., if looking for sample ERS224501, the corresponding alignment analysis is ERZ403847, and the analysis-ready BAM file is here:

ftp://ftp.sra.ebi.ac.uk/vol1/ERZ403/ERZ403847/AN0131_C.bam

I.e., this file provides the expected output from the pipeline for sample AN0131-C (ERS224501).

Here's a proposal for how to bring test data into this repo:

Create a file at pipelines/short-read-alignment-vector/fixtures/ag1000g-phase1-minimal/lanelets.tsv which is a tab-delimited file, one row per sequencing run, with 6 rows providing data for the 2 suggested samples, and the following columns:
- sample_id - our sample ID, e.g., "AN0131-C"
- read1_path - URL to forward reads in fastq, e.g., "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR317/ERR317337/ERR317337_1.fastq.gz"
- read2_path - URL to reverse reads in fastq, e.g., "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR317/ERR317337/ERR317337_2.fastq.gz"
Create a file at pipelines/short-read-alignment-vector/fixtures/ag1000g-phase1-minimal/expected.tsv which is a tab-delimited file, one row per sample, with 2 rows providing data for the 2 suggested samples, and the following columns:
- sample_id - our sample ID, e.g., "AN0131-C"
- path - URL to analysis-ready BAM, e.g., "ftp://ftp.sra.ebi.ac.uk/vol1/ERZ403/ERZ403847/AN0131_C.bam"

@tnguyensanger how does that sound? Would you be able to make a PR?

from pipelines.

alimanfoo commented on September 11, 2024

(Note that I'm implicitly suggesting we develop the pipeline to use FASTQ as input rather than BAM, because I suspect this is the more general case.)

from pipelines.

tnguyensanger commented on September 11, 2024

@alimanfoo

how does that sound? Would you be able to make a PR?

Sounds good. I'm debating to put this in another Sanger specific repo though. In our pipelines, we would be downloading from IRODS as opposed to from FTP. I'll need to document how to query and download these test files from IRODS as well. I can write a markdown document with links to the ENA downloads in this repo, but for end to end pipeline testing, I'll need to write another document with the Sanger specific IRODS paths.

(Note that I'm implicitly suggesting we develop the pipeline to use FASTQ as input rather than BAM, because I suspect this is the more general case.)

To be honest, I've never worked with any of the fastq files from core sequencing; I don't know if there are any caveats to using them. I'll dig into whether we'd lose any important bam/cram annotations. Operating directly from the core fastq files migh speed things up compared to our vrpipe legacy pipelines which always converted from bam/cram to chunked fastq before mapping.

from pipelines.

alimanfoo commented on September 11, 2024

Hi @tnguyensanger

how does that sound? Would you be able to make a PR?

Sounds good. I'm debating to put this in another Sanger specific repo though. In our pipelines, we would be downloading from IRODS as opposed to from FTP. I'll need to document how to query and download these test files from IRODS as well. I can write a markdown document with links to the ENA downloads in this repo, but for end to end pipeline testing, I'll need to write another document with the Sanger specific IRODS paths.

I wouldn't worry about any of that for now. For a reference implementation, it should be fine to work from FASTQ. For testing purposes these could be downloaded manually and run from local files.

Once we have a reference implementation, then as a second step we could figure out the best strategy for pulling data from IRODS within Sanger.

(Note that I'm implicitly suggesting we develop the pipeline to use FASTQ as input rather than BAM, because I suspect this is the more general case.)

To be honest, I've never worked with any of the fastq files from core sequencing; I don't know if there are any caveats to using them. I'll dig into whether we'd lose any important bam/cram annotations. Operating directly from the core fastq files migh speed things up compared to our vrpipe legacy pipelines which always converted from bam/cram to chunked fastq before mapping.

The first steps in the old vr-pipe implementation are all about converting the BAM into FASTQ, and so any BAM annotations would be lost anyway, so I'm confident that we can just start from the FASTQ as inputs and get the same result.

from pipelines.

tnguyensanger commented on September 11, 2024

@alimanfoo

It's not part of the new or old specs, but the vrpipe pipelines would always filter bams before they converted them to fastq for mapping such that:

reads that failed QC checks were removed
non-primary alignments were removed
chimeric (aka supplementary) alignments were removed

I assume that any fastq files released by Core would have failed reads already removed. But keeping in reads that have chimeric portions in the fastq might change the output of the new pipeline compared to previous releases.

I'll also put in a blurb in the actual spec pull request too.

from pipelines.

alimanfoo commented on September 11, 2024

@alimanfoo

It's not part of the new or old specs, but the vrpipe pipelines would always filter bams before they converted them to fastq for mapping such that:

reads that failed QC checks were removed

non-primary alignments were removed

chimeric (aka supplementary) alignments were removed

I assume that any fastq files released by Core would have failed reads already removed. But keeping in reads that have chimeric portions in the fastq might change the output of the new pipeline compared to previous releases.

I'll also put in a blurb in the actual spec pull request too.

Reply here.

from pipelines.

tnguyensanger commented on September 11, 2024

Some information regarding Sanger bam and ENA fastq inputs:

Raw sequencing bams/crams released from Sanger Core Sequencing do not contain any reads that have failed QC.
Sanger Core Sequencing uploads raw sequencing bams/crams to ENA, but never creates any fastq.
ENA converts raw sequencing crams to fastq using https://
github.com/samtools/htsjdk API, and converts raw sequencing bams to SRA then fastq using SRA API

Email correspondance with more details below:

On 09/04/2020, 13:08, "Rasko Leinonen via RT" [email protected] wrote:

Dear Colleague,

ENA users several APIs to convert submitted data into a standard Fastq files.
The generated Fastq files are described here:

https://ena-docs.readthedocs.io/en/latest/faq/archive-generated-files.html

We use different code depending on the submitted data format, but in most
cases including BAM, we convert data first to the NCBI SRA format and then use
NCBI SRA API to generate the Fastq files. In case of CRAMs we use https://
github.com/samtools/htsjdk to generate the Fastq files. We use APIs rather
than command line tools so can't unfortunately provide exact parameters used.
Due to resource limitations we are unable to provide or support access to the
code we have written, however, the document above describes the rules under
which the Fastq files are generated.

Best Regards,
Rasko Leinonen

On 07/04/2020, 20:50, "David Jackson via RT" [email protected] wrote:

> Do reads that fail QC checks from the sequencer ever get written to
> the raw bams/crams released from Core

No.

>   (ie reads with sam flag 512) ?
> 

If a bubble in the fluidics causes an indel inducing spatial artefact,
then reads from clusters in that region can have their QC bit set to
fail. The indels are detected from the <1% phiX spiked into each (non-
MiSeq) run.

> For illumina HiSeq reads, what QC checks would be done?
Per read, that spatial artefact check, is the only "check". Auxtags
will indictate whether any adapter was detected (non-NovaSeq runs).

David

from pipelines.

tnguyensanger commented on September 11, 2024

Some possible gotchas when comparing realigned bams from cromwell/WDL pipelines with legacy realigned bams:

The mosquito legacy pipelines would use the bedtools bam2fastq v1.1.0 binary to convert bam to fastq.
https://github.com/malariagen/legacy_pipelines/blob/master/vector-ops/pipelines/vo_agam/vo_agam_setups.pl

https://github.com/wtsi-team112/vr-pipe/blob/master/modules/VRPipe/Steps/bam_to_fastq.pm

They would also shuffle the alignments in the bams before converting to fastq and realigning.

https://github.com/wtsi-team112/vr-pipe/blob/master/modules/VRPipe/Steps/bam_shuffle_by_name.pm

The original reasons are lost to the departed bioinformaticists, but presumably it was done to remove any bias during remapping while calculating insert size. In older aligners, the earlier paired reads in the fastq would dictate the insert size. See https://gatkforums.broadinstitute.org/gatk/discussion/2908/howto-revert-a-bam-file-to-fastq-format and http://seqanswers.com/forums/archive/index.php/t-38985.html.

I am not sure if bam reshuffling is still necessary, but it is something to watch out for when we are comparing realigned bams from the test pipelines with legacy realigned bams.

from pipelines.

Mosquito short read alignment pipeline - test dataset about pipelines HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent