This issue is intended to cover implementation of a pipeline for producing analysis-re

(Here is the <a href="https://docs.google.com/document/d/1rm3ys6KuifvogG1Jpi9JafL

Mosquito short read alignment pipeline about pipelines HOT 5 CLOSED

alimanfoo commented on September 11, 2024

Mosquito short read alignment pipeline

from pipelines.

Comments (5)

alimanfoo commented on September 11, 2024

~~(Here is the short read alignment v1.1 specification document which defines the inputs, outputs and steps in the pipeline.)~~

Now imported into this repo, see spec version 1.1.1

from pipelines.

alimanfoo commented on September 11, 2024

Copying here a few details about the pipeline that were discussed previously in slack...

We expect the pipeline to be run as a batch for a set of samples, where we know ahead of time how many lanelets of sequencing we have for each sample, and where we also know the locations of the data files are containing the reads from each lanelet. (By "lanelet" I mean data from a single library from a single lane of sequencing, i.e., that has already been demultiplexed. And for mosquitoes we always have one library per sample, and aim for three lanelets per sample, although we can have anywhere between 1-6 lanelets.)

That's why the spec includes a "sample manifest" as one of the pipeline inputs. It's actually not really a sample manifest, that's a bad name, rather it's a manifest of input data files, and includes information about which data files correspond to which sample, to allow grouping by sample and merging. We have sometimes referred to this file informally as a "file of file names" or "FOFN".

Here is an example of one of these manifests:

AG1000G-AO.fofn.txt

The manifest includes one row per lanelet.

The "path" column is a bit weird, but the basic idea is that it includes some information about where to find the file. I say it's a bit weird because in fact the file is stored in IRODS, and so it would make more sense if the actual IRODS path was given in this column. But you can work out the IRODS path from what is given.

The "sample" field then has the sample identifier, and is used for grouping lanes together when merging to sample BAMs.

The "library" field can be ignored, as can the "study" field and the "ena_run" fields. It's really only the "path" and "sample" fields that matter.

What is not stated in the pipeline spec is the fact that the first steps in the pipeline would actually have to be pulling down the data files from IRODS and onto lustre. That is a deployment issue specific to running at Sanger.

What is not fully explained in the spec but hopefully clear is that there is an initial part of the pipeline where you can scatter by lanelet, then you have to group lanelets and scatter by sample.

Within the lanelet alignment step I imagine you'd also want to split up the input file and do some work in parallel then merge back together to get a lanelet BAM.

Within the per-sample steps I imagine you'd also want to split by genome region and do some work in parallel before merging back together to get a sample BAM.

from pipelines.

alimanfoo commented on September 11, 2024

Surfacing some discussion here for visibility regarding implementation strategy and how to deal with organising platform-agnostic versus platform-specific code.

Originally we discussed a strategy where a "core" platform-agnostic pipeline would be implemented, and then a "sanger" platform-specific wrapper/adapter pipeline would be implemented that dealt with the platform-specific configuration and called the core pipeline as an inner pipeline.

@JonKeatley112 recently raised some points that this may be hard or impossible to achieve in practice, because steps within the pipeline may need to be aware of platform-specific details, such as submitting to a specific queue or using a specific type of storage. He suggested an alternative approach where we first create a reference implementation of the pipeline, which does not include any platform-specific details, then use that as a basis for creating a platform-specific pipeline implementation.

FWIW it sounds to me like a good strategy to set a first goal to create a reference implementation of the pipeline. This reference implementation I think should be a working pipeline, but can make some simplifying assumptions, such as assuming the inputs are FASTQ files which already exist on local disk or can be downloaded from a URL. As long as it's possible to run the reference implementation on a local computer or development environment, that should be sufficient to show it works, and if we run it with a small test dataset (#6) we can validate that it works correctly.

Once we have the reference implementation working and validated, then we could think about how to develop a sanger platform-specific pipeline.

Hope that makes sense, very happy to discuss.

from pipelines.

kbergin commented on September 11, 2024

I believe the checklist for this ticket is complete, would we consider this issue done?

from pipelines.

alimanfoo commented on September 11, 2024

Tidying up here, done long ago!

from pipelines.

Mosquito short read alignment pipeline about pipelines HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent