Coder Social home page Coder Social logo

Comments (5)

alimanfoo avatar alimanfoo commented on September 11, 2024

(Here is the short read alignment v1.1 specification document which defines the inputs, outputs and steps in the pipeline.)

Now imported into this repo, see spec version 1.1.1

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024

Copying here a few details about the pipeline that were discussed previously in slack...

We expect the pipeline to be run as a batch for a set of samples, where we know ahead of time how many lanelets of sequencing we have for each sample, and where we also know the locations of the data files are containing the reads from each lanelet. (By "lanelet" I mean data from a single library from a single lane of sequencing, i.e., that has already been demultiplexed. And for mosquitoes we always have one library per sample, and aim for three lanelets per sample, although we can have anywhere between 1-6 lanelets.)

That's why the spec includes a "sample manifest" as one of the pipeline inputs. It's actually not really a sample manifest, that's a bad name, rather it's a manifest of input data files, and includes information about which data files correspond to which sample, to allow grouping by sample and merging. We have sometimes referred to this file informally as a "file of file names" or "FOFN".

Here is an example of one of these manifests:

AG1000G-AO.fofn.txt

The manifest includes one row per lanelet.

The "path" column is a bit weird, but the basic idea is that it includes some information about where to find the file. I say it's a bit weird because in fact the file is stored in IRODS, and so it would make more sense if the actual IRODS path was given in this column. But you can work out the IRODS path from what is given.

The "sample" field then has the sample identifier, and is used for grouping lanes together when merging to sample BAMs.

The "library" field can be ignored, as can the "study" field and the "ena_run" fields. It's really only the "path" and "sample" fields that matter.

What is not stated in the pipeline spec is the fact that the first steps in the pipeline would actually have to be pulling down the data files from IRODS and onto lustre. That is a deployment issue specific to running at Sanger.

What is not fully explained in the spec but hopefully clear is that there is an initial part of the pipeline where you can scatter by lanelet, then you have to group lanelets and scatter by sample.

Within the lanelet alignment step I imagine you'd also want to split up the input file and do some work in parallel then merge back together to get a lanelet BAM.

Within the per-sample steps I imagine you'd also want to split by genome region and do some work in parallel before merging back together to get a sample BAM.

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024

Surfacing some discussion here for visibility regarding implementation strategy and how to deal with organising platform-agnostic versus platform-specific code.

Originally we discussed a strategy where a "core" platform-agnostic pipeline would be implemented, and then a "sanger" platform-specific wrapper/adapter pipeline would be implemented that dealt with the platform-specific configuration and called the core pipeline as an inner pipeline.

@JonKeatley112 recently raised some points that this may be hard or impossible to achieve in practice, because steps within the pipeline may need to be aware of platform-specific details, such as submitting to a specific queue or using a specific type of storage. He suggested an alternative approach where we first create a reference implementation of the pipeline, which does not include any platform-specific details, then use that as a basis for creating a platform-specific pipeline implementation.

FWIW it sounds to me like a good strategy to set a first goal to create a reference implementation of the pipeline. This reference implementation I think should be a working pipeline, but can make some simplifying assumptions, such as assuming the inputs are FASTQ files which already exist on local disk or can be downloaded from a URL. As long as it's possible to run the reference implementation on a local computer or development environment, that should be sufficient to show it works, and if we run it with a small test dataset (#6) we can validate that it works correctly.

Once we have the reference implementation working and validated, then we could think about how to develop a sanger platform-specific pipeline.

Hope that makes sense, very happy to discuss.

from pipelines.

kbergin avatar kbergin commented on September 11, 2024

I believe the checklist for this ticket is complete, would we consider this issue done?

from pipelines.

alimanfoo avatar alimanfoo commented on September 11, 2024

Tidying up here, done long ago!

from pipelines.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.