Coder Social home page Coder Social logo

broadinstitute / longbow Goto Github PK

View Code? Open in Web Editor NEW
20.0 20.0 3.0 39.99 MB

Annotation and segmentation of MAS-seq data

Home Page: https://broadinstitute.github.io/longbow/

License: BSD 3-Clause "New" or "Revised" License

Python 53.62% Dockerfile 0.26% WDL 45.95% Makefile 0.17%

longbow's People

Contributors

jamestwebber avatar jonn-smith avatar kvg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

longbow's Issues

help. For the CD8 datasets

Hello,

Nice work! I have had read your paper and where can I download the CD8 datasets.
I do need some big data for testing and comparision.

Thank you, The best !

Splitting the segmented bam file?

Will it be possible to have a read tag for the MAS adapter used in the segmented bam file so that we can split the bam file if needed? Like if we are using 10 MAS adapter for ligation; can we split the segmented bam file into 10 independent bam file?

Also what is the memory requirement for the "logbow correct" step? Its continuously giving memory error.

Thanks

Fix integration tests to be more independent of each other

Right now, integration tests run data through successive commands until the command under test has the appropriate input for execution. This makes testing each command dependent on all other commands. Instead, we should write the appropriate files to the test_data directory and test directly on those as input, thus removing the command dependencies.

processes issue

Requested 16 cores, #SBATCH --ntasks=16

cannot find Process-18 (?)

[INFO 2022-02-24 19:20:19 segment] Invoked via: longbow segment -t 16
[INFO 2022-02-24 19:20:19 annotate] Invoked via: longbow annotate -t 16 1395T.ccs.bam
[INFO 2022-02-24 19:20:19 annotate] Running with 16 worker subprocess(es)
[INFO 2022-02-24 19:20:19 segment] Running with 16 worker subprocess(es)
[INFO 2022-02-24 19:20:19 segment] Using simple splitting mode.
[INFO 2022-02-24 19:20:19 extract] Invoked via: longbow extract -o extracted.bam
[INFO 2022-02-24 19:20:19 extract] Writing extracted read segments to: extracted.bam
[INFO 2022-02-24 19:20:19 extract] Including 2 flanking bases.
[INFO 2022-02-24 19:20:19 annotate] Annotating 2492060 reads
Process Process-18:
Traceback (most recent call last):
File "/mnt/projects/CCR-SF/active/Software/tools/Anaconda/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/mnt/projects/CCR-SF/active/Software/tools/Anaconda/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/mnt/projects/CCR-SF/active/Software/tools/longbow/longbow-0.5.21/src/longbow/annotate/command.py", line 271, in _write_thread_fn
segments = bam_utils.collapse_annotations(ppath)
File "/mnt/projects/CCR-SF/active/Software/tools/longbow/longbow-0.5.21/src/longbow/utils/bam_utils.py", line 292, in collapse_annotations
for i, seg in enumerate(path):
TypeError: 'NoneType' object is not iterable

Fix PyPi failures

My fork of Pomegranate is preventing release to PyPi properly:

…
Checking dist/maslongbow-0.5.39-py2.py3-none-any.whl: PASSED with warnings
WARNING  `long_description_content_type` missing. defaulting to `text/x-rst`.   
Checking dist/maslongbow-0.5.39.tar.gz: PASSED with warnings
WARNING  `long_description_content_type` missing. defaulting to `text/x-rst`.   
Uploading distributions to https://upload.pypi.org/legacy/
Uploading maslongbow-0.5.39-py2.py3-none-any.whl
[25](https://github.com/broadinstitute/longbow/runs/8102287644?check_suite_focus=true#step:13:26)l
  0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/110.2 kB • --:-- • ?
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.2/110.2 kB • 00:00 • 6.3 MB/s
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.2/110.2 kB • 00:00 • 6.3 MB/s
25hWARNING  Error during upload. Retry with the --verbose option for more details. 
ERROR    HTTPError: 400 Bad Request from https://upload.pypi.org/legacy/        
         Invalid value for requires_dist. Error: Can't have direct dependency:  
         'pomegranate @                                                         
         git+https://github.com/jonn-smith/pomegranate.git@d80ec9[30](https://github.com/broadinstitute/longbow/runs/8102287644?check_suite_focus=true#step:13:31)4bba4c42ee394
         3c887754fd78885cd55'

This needs to be fixed by either moving to a proper version of pomegranate (assuming the PR gets merged in to fix the heap overflow), by releasing my fork and using that, or by removing pomegranate and using something else.

Annotate runs into dictionary issue when iterating

I tried running annmas in the docker container, and annotate produces the following:

[INFO 2021-02-08 15:01:18] annmas: annotate started
[INFO 2021-02-08 15:01:18] Running with 4 worker subprocess(es)
Traceback (most recent call last):
  File "/annmas/venv/bin/annmas", line 11, in <module>
    load_entry_point('annmas', 'console_scripts', 'annmas')()
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/annmas/src/annmas/annotate/command.py", line 91, in main
    m = build_default_model()
  File "/annmas/src/annmas/utils/model.py", line 272, in build_default_model
    full_model.bake()
  File "pomegranate/hmm.pyx", line 794, in pomegranate.hmm.HiddenMarkovModel.bake
  File "/annmas/venv/lib/python3.8/site-packages/networkx/classes/reportviews.py", line 718, in <genexpr>
    for nbr, dd in nbrs.items()
RuntimeError: dictionary keys changed during iteration
double free or corruption (out)
Aborted (core dumped)

Specifying UMI length

Hello, I ran longbow to split up my reads and the UMI length I was expecting was 12, however the output UMIs are all 10bp long. Do I need to specify an alternate model when I run longbow annotate? Or is there an alternate way to specify the UMI length?

Thanks!
Matt

longbow `stats` uses shortcuts for array ligation profiles and can be misleading for some models

Because longbow stats uses the length of a segment name for whether to determine if that segment is a MAS-seq key segment (single-character segment names are canonically used), sometimes it can create misleading ligation profiles.

For example, if one were to take mas15 model data and run stats on it using the mas15threeP model, the ligation profiles would look fine, but when it came time to do the splitting into individual array elements, longbow could not actually perform any splitting because the non-key segments end up in the wrong order for mas15threeP (becauase they are actually mas15).

In practice this should only occur if one created a library with one order and annotated / collected stats for a completely separate library order / model.

SG tag may have an extra element on the front and back of the array

Right now, the SG tag appears to be annotating an extra region of the state sequence on the front and back of the array. For example, for this read,

m64020_201213_022403/23/ccs/23_807/10x_Adapter-3p_Adapter ... SG:Z:10x_Adapter:0-22,random:23-777,Poly_A:778-807,3p_Adapter:808-832

the state sequence says the first 22 bases should be the 10x_Adapter and the last 24 bases should be the 3p_Adapter. But the read actually contains this:

m64020_201213_022403_23_ccs_23_807_10x_Adapter-3p_Adapter

To compensate for now, I've written the inspect tool to ignore the first and last array elements of the SG tag (https://github.com/broadinstitute/annmas/blob/3ddb711b1bf5fb926c59cdc9f264e89075976e25/src/annmas/inspect/command.py#L180). But it does technically mean that the base ranges specified by the SG tag don't actually match up to the read sequence. We should fix this so that they do indeed match, and then change inspect so that it no longer has to do any trimming on the SG array.

Cannot run in python 3.8

There seem to be some issues in the libraries we use when running in python 3.8.

I think it's pomegranate, but I'm not 100% sure. I was able to get around this by using python 3.7.

See #18 for details.

The `XQ` tag must be truncated during `segment`

Currently the XQ tag is not clipped during segmentation to only include the sections in the resulting segmented read. This should be fixed by similar logic to how the SG tag is being clipped.

Creating an isoseq3 compatible output

Hi guys, just documenting the spec for isoseq3 support

What I did:
As you know, I've posted the data and processing script here

Basically, I processed the reads with longbow and ran the combined extracted data through isoseq3 tag.
When trying to dedup the reads with isoseq3 dedup, it would say terminate called without an active exception. Aborted.
Liz says this is because the read group tag is messed up. We have a script to iterate through our bams and change the RG:Z: tag so that the read group is constant. We also change the zm:i: tag to a unique counter since ZMWs are no longer unique. Unfortunately, this is ridiculously slow and it would be nice to have an isoseq3 compatible output from longbow.

Old tags (after isoseq3 tag --design 16B-10U-T):

  • ec:f:9.77338
  • np:i:8
  • rq:f:0.999347
  • sn:B:f,13.13,19.2964,4.94478,8.61356
  • we:i:2756052
  • ws:i:110592
  • zm:i:59572499
  • YS:f:-5961.52
  • RC:i:0
  • XQ:Z:44/46,0/0,56/60,48/50,30/32,46/46,0/0,44/60,48/50,30/32,46/46,0/0,58/60,48/50,32/32
  • YQ:Z:0.9397
  • YN:Z:mas15
  • YV:i:1
  • YG:i:3
  • YK:Z:N
  • SG:Z:10x_Adapter:0-22,random:23-698,Poly_A:699-727,3p_Adapter:728-752
  • RG:Z:9e129d4c
  • XC:Z:TCAGGTGCAGGTCGGA
  • XM:Z:TCCTGCGCAT
  • XA:Z:XC-XM

New tags:

  • XC:Z:TCAGGTGCAGGTCGGA
  • XM:Z:TCCTGCGCAT
  • ic:i:1
  • im:Z:m64013e_211031_055434/1/ccs
  • is:i:1
  • it:Z:TCAGGTGCAGGTCGGATCCTGCGCAT
  • zm:i:60314
  • RG:Z:e4927d21

Lmk if you need more info

New segment cigars need to be refactored

The new cigars that are produced by the model for each segment need to be updated in 2 ways:

  1. The cigar operator and count are currently in the wrong order. The proper order is: <count><op>
  2. The cigar operators are the states in the HMM. They should be changed to reflect the real cigar operators, including = and X.

This will disproportionately effect inspect, but will also have ramifications in other tools.

bump python requirement to <3.8

python 3.8 is listed in setup.py, so it should be supported. It looks like there are a few test failures as well as syntax warnings:

  • is not "" is now a SyntaxWarning because literals are not guaranteed a constant identity, needs to be changed to != ""
  • tests are failing due to some change to the language, need to investigate. edit this is a pysam issue

documentation on how to create new models

There will be certainly be interest in creating additional models -- e.g. with fewer segments to accommodate larger inserts or other schemes of information embedded in each segment. A tutorial on how to train new models would be very useful

Models need more descriptive adapter names.

The new adapter names are based on ancient mythical figures. These are not descriptive of what the function or location of the adapters actually are, which makes it harder to remember what the adapters do.

We should rename all adapters to indicate their use / function in the library.

Add automatic bam / pbi indexing when creating files

Since we're writing out all the data anyway, it should be straightforward to write out an index file to go along with the output.

Ideally both a bai and pbi file should be created, and only if we are outputting to a file and not stdout.

`segment` UMI/CBC filtering is currently broken

Segment used to filter out reads that should have a CBC or UMI but didn't actually contain them.

Currently this functionality seems to be broken:

[INFO 2022-08-18 10:42:09  segment] Model has UMI annotation.
[INFO 2022-08-18 10:42:09  segment] Model has Cell Barcode annotation.
[WARNING 2022-08-18 10:42:09  segment] Model does not have Cell Barcode or UMI tags.  All segments will be emitted.

How to select models for anno

Hello:

How to prepare special defined models to suit custom library type , such as one prepared a standard 15 array element model, or one use new linker primers.

$ longbow model --list-models
[INFO 2021-11-02 15:27:28 model] Invoked via: longbow model --list-models
Longbow includes the following models:
Name Version Description
mas15 1.0.0 The standard MAS-seq 15 array element model.
mas15v2 2.0.0 The standard MAS-seq 15 array element model.
mas15threeP 2.0.0 The 3' kit MAS-seq 15 array element model.
mas15BulkWithIndices 1.0.0 A MAS-seq 15 array element model with a 10 base index just before the 3' adapter for bulk sequencing.
mas10 1.0.0 The MAS-seq 10 array element model.
mas10v2 2.0.0 The MAS-seq 10 array element model.
mas10threeP 2.0.0 The 3' kit MAS-seq 10 array element model.
slide-seq 0.0.1 The Slide-seq 15 array element model.
slide-seqV2 2.0.0 The Slide-seq 15 array element model.
mas8prototype 1.0.0 The prototype MAS-seq 8 array element model.
mas8prototypeV2 2.0.0 The prototype MAS-seq 8 array element model.

Thank you!

Add in a maximum length for reads to run through the HMM

In practice, we only run Longbow on reads of length <= 60kb. This is to prevent pathological cases where the HMM takes a very long time to run and consumes a lot of memory.

We should make this the default behavior for all Longbow tools that run reads through the HMM (e.g. annotate, demultiplex).

We can add in an override for the length threshold so a user can change this to their liking.

`Sift` tool must be updated to work for arbitrary models / libraries

Currently sift only works for mas_15_sc_10x5p_single_none.

We must expand this tool to work for any input library by removing the "validation model" (10x_sc_10x5p_single_none) and deriving a custom validation model at runtime for the input model.

In addition, we must address the following:

  • sift should be updated to include information about MAS adapters, rather than relying on random segments at the start of each array element
  • model::validate_model must be updated to remove the "skip" for the validation model. Validation models should be valid models, or the logic must be updated to somehow validate them.

Aligned data causes longbow tools to crash

Currently we are using a dummy/empty bam header to re-create read objects in each subprocess (serialization is necessary because of how passing objects between processes works).

Because of how pysam parses serialized reads (pysam.AlignedSegment.tostring and pysam.AlignedSegment.fromstring), aligned data are not properly deserialized back into read objects with this empty header - only the read name and sam flags are properly parsed.

This is a showstopper for running on any data that have been aligned already.

Mitigation of this problem will require refactoring all tools to have their subprocess worker process functions ingest an actual pysam.AlignmentHeader dictionary so it can be used to deserialize the reads (the output process functions should already have a required header, so we should double-check that this header is used to deserialize output reads in the output processes).

Here is a snippet of code that will do what we need for a worker process function:

with pysam.AlignmentFile(input_bam, "rb", check_sq=False, require_index=False) as bam_file:
        # Start worker sub-processes:
        worker_process_pool = []
        for _ in range(num_procs):
            p = mp.Process(target=dummy_work_fun, args=(process_input_data_queue, results, bam_file.header.to_dict()))
            p.start()
            worker_process_pool.append(p)

...

def dummy_work_fun(in_queue, out_queue, bam_header_dict):
    while True:
        raw_data = in_queue.get()

        # Check for exit sentinel:
        if raw_data is None:
            return

        print("Re-creating read object from string...")
        read = pysam.AlignedSegment.fromstring(
            raw_data, pysam.AlignmentHeader.from_dict(bam_header_dict)
        )
        print("Created.")
        print(f"Read object: {read}")

The correct command will soon be refactored so it works in this manner (when branch jts_correct_symspellpy is merged).

New read name hashes have collisions.

The hash algorithm being used to generate the new ZMWs in the read names seems to have collision. This causes critical issues in downstream processing.

Review docker build steps to remove experimental docker features.

In the latest major update, I added some ssh commands to the docker build, however these features require docker experimental features.

We need to investigate the added ssh commands to see if they are still necessary. Alternatively, we should see if the docker experimental features required have been rolled into a new docker version and update.

0.3.0 Regression in model performance

Refactoring in the 0.3.0 release caused a serious regression in model performance. Something about how the models are now being constructed is causing the annotate step to produce incorrect results.

Longbow 0.3.0 should not be used - use 0.2.2 instead.

missing umi and/or cell tag from alligned reads after longbow processing

Hi,

we just received an MAS-Iso seq long read data for single cell 3'p for test which I am trying to process as per the Longbow pipeline , the bam file from the longbow output is giving an error when I am trying to group the detected UMIs and mark the molecules with the same UMIs using UMI-Tools , it's throwing an error
""" Read skipped, missing umi and/or cell tag: 61628044"""

here are the steps which I did
1- annotation, filtering, segmentation , extract
'longbow annotate -m mas15v2 -t 30 hifi_reads.bam | longbow filter | longbow segment | longbow extract -o filter_passed.bam'

2 , primary allignment of extracted bam with the hg38 genome using minimap2
'samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq Homo_sapiens.GRCh38.dna.primary_assembly.fa - |
samtools sort > align.bam &&
samtools index align.bam

3 - baseline transcript annotations via stringtie 2 to create new transcriptome annotations specific to our test data
'stringtie allign.bam -Lv -G gencode.v38.annotation2.gtf -o annotations.gtf -A gene_abund.out

4 - Based on the resulting transcript annotations we created a new transcript reference FASTA file using gffread
'gffread -w transcriptome.fa -g Homo_sapiens.GRCh38.dna.primary_assembly.fa annotations.gtf'

5 - aligned the extracted reads to the novel transcriptome reference using minimap2 v2.17-r941
''samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq transcriptome.fa - |
samtools sort > align_2.bam &&
samtools index align_2.bam

6 - group the detected UMIs and mark the molecules with the same UMI using UMI-Tools

'umi_tools group --stdin= align_2.bam --buffer-whole-contig --no-sort-output --per-gene --gene-tag XG --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam

here it is throwing an error with ''missing umi and/or cell tag from alligned reads''

could you please tell, why it's missing UMI from the longbow output files,

Thanks
'

Inconsistency in adapter retention in first segment of read vs others

Looks like the 10x adapter is being trimmed off the first segment, but retained in all subsequent segments:

m64020_201213_022403_23_ccs_23_807_10x_Adapter-3p_Adapter
m64020_201213_022403_23_ccs_1449_2603_D-3p_Adapter
m64020_201213_022403_23_ccs_2644_3215_E-3p_Adapter

Thing is, I don't actually know which way is correct. If we're going to be feeding these BAM files into the PacBio IsoSeq tools (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-deduplication.md) at the tag stage, it might be necessary to trim the 10x adapter off of all the reads first.

So, we'll have to figure out what that tool is expecting and make our changes accordingly.

longbow correct consumes very large amount of memory

Hello,

I am using longbow to correct the barcodes but I am constantly running out of memory. I am executing the sample on a system that has 247G of memory. I reduced the number of threads to 10 and also brought down the number of raw reads to 10K but the problem still persists.

Here is the command I am running.

longbow correct --force --output-bam merged_segmented_corrected.bam --barcode-uncorrectable-bam merged_segmented_uncorrectable.bam --allow-list /var/lib/cwl/stgfd2864f9-b944-4a63-bf01-85757c06fad1/3M-february-2018.txt --model mas10threeP --threads 10 /var/lib/cwl/stg5e0294c1-7b30-4699-ad51-408d4fb594dc/merged_segmented.bam

I am running 0.5.29 version of longbow since the mas10threeP model was used for the experiment. Also, I am executing the longbow correct after the segment step. Please let me know if I need to do anything differently.

Thank you,
Sagnik.

Multiple concurrent runs overwriting ._longbow_filter_*.bam?

Hi, I'm running multiple samples through longbow at the same time on our compute cluster. When I do this, it seems that longbow filter writes out the ._longbow_filter_*.bam files for each separate sample to the same file. Is this an issue and is there a way to specify the name of these temporary files?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.