broadinstitute / longbow Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 3.0 39.99 MB

Annotation and segmentation of MAS-seq data

Home Page: https://broadinstitute.github.io/longbow/

License: BSD 3-Clause "New" or "Revised" License

Python 53.62% Dockerfile 0.26% WDL 45.95% Makefile 0.17%

longbow's People

Contributors

Stargazers

Watchers

Forkers

hmyh1202 krobison13 mayunlong89

longbow's Issues

New `models` command no longer dumps all file formats of models to disk.

The new models command must be updated to dump the following:

full dot file
simple dot file
dense transition matrix
emission distributions
JSON file

help. For the CD8 datasets

Hello,

Nice work! I have had read your paper and where can I download the CD8 datasets.
I do need some big data for testing and comparision.

Thank you, The best !

`XQ` tag has an off-by-one error in its calculations

This seems to be a boundary issue when cutting up the string and performing the comparison between annotated known sequences and the truth for those sequences.

Move to `.sam` files for all test data

git doesn't appreciate when we put binary files in a repo. We need to migrate all tests from bam files to sam files. Some tests do this already.

3' model UMI should be changed to 12 bases for latest 10x kits

The current model of the 3' kit uses 10 bases for the UMI. This was accurate for older 10x kits, but not the current ones.

The model should be updated to use a 12bp UMI.

Splitting the segmented bam file?

Will it be possible to have a read tag for the MAS adapter used in the segmented bam file so that we can split the bam file if needed? Like if we are using 10 MAS adapter for ligation; can we split the segmented bam file into 10 independent bam file?

Also what is the memory requirement for the "logbow correct" step? Its continuously giving memory error.

Thanks

`tagfix` does not update the cigar tag

Currently tagfix changes the segments tag but not the segment cigar tag. Both should be updated when this tool is run.

Fix integration tests to be more independent of each other

Right now, integration tests run data through successive commands until the command under test has the appropriate input for execution. This makes testing each command dependent on all other commands. Instead, we should write the appropriate files to the test_data directory and test directly on those as input, thus removing the command dependencies.

processes issue

Requested 16 cores, #SBATCH --ntasks=16

cannot find Process-18 (?)

[INFO 2022-02-24 19:20:19 segment] Invoked via: longbow segment -t 16
[INFO 2022-02-24 19:20:19 annotate] Invoked via: longbow annotate -t 16 1395T.ccs.bam
[INFO 2022-02-24 19:20:19 annotate] Running with 16 worker subprocess(es)
[INFO 2022-02-24 19:20:19 segment] Running with 16 worker subprocess(es)
[INFO 2022-02-24 19:20:19 segment] Using simple splitting mode.
[INFO 2022-02-24 19:20:19 extract] Invoked via: longbow extract -o extracted.bam
[INFO 2022-02-24 19:20:19 extract] Writing extracted read segments to: extracted.bam
[INFO 2022-02-24 19:20:19 extract] Including 2 flanking bases.
[INFO 2022-02-24 19:20:19 annotate] Annotating 2492060 reads
Process Process-18:
Traceback (most recent call last):
File "/mnt/projects/CCR-SF/active/Software/tools/Anaconda/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/mnt/projects/CCR-SF/active/Software/tools/Anaconda/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/mnt/projects/CCR-SF/active/Software/tools/longbow/longbow-0.5.21/src/longbow/annotate/command.py", line 271, in _write_thread_fn
segments = bam_utils.collapse_annotations(ppath)
File "/mnt/projects/CCR-SF/active/Software/tools/longbow/longbow-0.5.21/src/longbow/utils/bam_utils.py", line 292, in collapse_annotations
for i, seg in enumerate(path):
TypeError: 'NoneType' object is not iterable

Fix PyPi failures

My fork of Pomegranate is preventing release to PyPi properly:

…
Checking dist/maslongbow-0.5.39-py2.py3-none-any.whl: PASSED with warnings
WARNING  `long_description_content_type` missing. defaulting to `text/x-rst`.   
Checking dist/maslongbow-0.5.39.tar.gz: PASSED with warnings
WARNING  `long_description_content_type` missing. defaulting to `text/x-rst`.   
Uploading distributions to https://upload.pypi.org/legacy/
Uploading maslongbow-0.5.39-py2.py3-none-any.whl
[25](https://github.com/broadinstitute/longbow/runs/8102287644?check_suite_focus=true#step:13:26)l
  0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/110.2 kB • --:-- • ?
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.2/110.2 kB • 00:00 • 6.3 MB/s
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.2/110.2 kB • 00:00 • 6.3 MB/s
25hWARNING  Error during upload. Retry with the --verbose option for more details. 
ERROR    HTTPError: 400 Bad Request from https://upload.pypi.org/legacy/        
         Invalid value for requires_dist. Error: Can't have direct dependency:  
         'pomegranate @                                                         
         git+https://github.com/jonn-smith/pomegranate.git@d80ec9[30](https://github.com/broadinstitute/longbow/runs/8102287644?check_suite_focus=true#step:13:31)4bba4c42ee394
         3c887754fd78885cd55'

This needs to be fixed by either moving to a proper version of pomegranate (assuming the PR gets merged in to fix the heap overflow), by releasing my fork and using that, or by removing pomegranate and using something else.

`models --dump` graph `png` output needs a better layout algorithm.

The current layout of the models --dump results is not very readable.

We should have nx apply a good layout algorithm (e.g. forceAtlas2 from Gephi) before rendering the file.

Annotate runs into dictionary issue when iterating

I tried running annmas in the docker container, and annotate produces the following:

[INFO 2021-02-08 15:01:18] annmas: annotate started
[INFO 2021-02-08 15:01:18] Running with 4 worker subprocess(es)
Traceback (most recent call last):
  File "/annmas/venv/bin/annmas", line 11, in <module>
    load_entry_point('annmas', 'console_scripts', 'annmas')()
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/annmas/src/annmas/annotate/command.py", line 91, in main
    m = build_default_model()
  File "/annmas/src/annmas/utils/model.py", line 272, in build_default_model
    full_model.bake()
  File "pomegranate/hmm.pyx", line 794, in pomegranate.hmm.HiddenMarkovModel.bake
  File "/annmas/venv/lib/python3.8/site-packages/networkx/classes/reportviews.py", line 718, in <genexpr>
    for nbr, dd in nbrs.items()
RuntimeError: dictionary keys changed during iteration
double free or corruption (out)
Aborted (core dumped)

Must add `sift` step to `longbow stats`

Longbow `stats` should include the number of array reads that pass longbow filtering

We have a lot of other summary stats, but not this one.

Specifying UMI length

Hello, I ran longbow to split up my reads and the UMI length I was expecting was 12, however the output UMIs are all 10bp long. Do I need to specify an alternate model when I run longbow annotate? Or is there an alternate way to specify the UMI length?

Thanks!
Matt

longbow `stats` uses shortcuts for array ligation profiles and can be misleading for some models

Because longbow stats uses the length of a segment name for whether to determine if that segment is a MAS-seq key segment (single-character segment names are canonically used), sometimes it can create misleading ligation profiles.

For example, if one were to take mas15 model data and run stats on it using the mas15threeP model, the ligation profiles would look fine, but when it came time to do the splitting into individual array elements, longbow could not actually perform any splitting because the non-key segments end up in the wrong order for mas15threeP (becauase they are actually mas15).

In practice this should only occur if one created a library with one order and annotated / collected stats for a completely separate library order / model.

SG tag may have an extra element on the front and back of the array

Right now, the SG tag appears to be annotating an extra region of the state sequence on the front and back of the array. For example, for this read,

m64020_201213_022403/23/ccs/23_807/10x_Adapter-3p_Adapter ... SG:Z:10x_Adapter:0-22,random:23-777,Poly_A:778-807,3p_Adapter:808-832

the state sequence says the first 22 bases should be the 10x_Adapter and the last 24 bases should be the 3p_Adapter. But the read actually contains this:

To compensate for now, I've written the inspect tool to ignore the first and last array elements of the SG tag (https://github.com/broadinstitute/annmas/blob/3ddb711b1bf5fb926c59cdc9f264e89075976e25/src/annmas/inspect/command.py#L180). But it does technically mean that the base ranges specified by the SG tag don't actually match up to the read sequence. We should fix this so that they do indeed match, and then change inspect so that it no longer has to do any trimming on the SG array.

Cannot run in python 3.8

There seem to be some issues in the libraries we use when running in python 3.8.

I think it's pomegranate, but I'm not 100% sure. I was able to get around this by using python 3.7.

See #18 for details.

Add `barcode-peek` command to identify which barcode list to use

Should cue off a bam tag for the barcode and operate like a combination of peek and correct but without using symspell because the indices would be too big to keep in memory.

The `XQ` tag must be truncated during `segment`

Currently the XQ tag is not clipped during segmentation to only include the sections in the resulting segmented read. This should be fixed by similar logic to how the SG tag is being clipped.

Creating an isoseq3 compatible output

Hi guys, just documenting the spec for isoseq3 support

What I did:
As you know, I've posted the data and processing script here

Basically, I processed the reads with longbow and ran the combined extracted data through isoseq3 tag.
When trying to dedup the reads with isoseq3 dedup, it would say terminate called without an active exception. Aborted.
Liz says this is because the read group tag is messed up. We have a script to iterate through our bams and change the RG:Z: tag so that the read group is constant. We also change the zm:i: tag to a unique counter since ZMWs are no longer unique. Unfortunately, this is ridiculously slow and it would be nice to have an isoseq3 compatible output from longbow.

Old tags (after isoseq3 tag --design 16B-10U-T):

ec:f:9.77338
np:i:8
rq:f:0.999347
sn:B:f,13.13,19.2964,4.94478,8.61356
we:i:2756052
ws:i:110592
zm:i:59572499
YS:f:-5961.52
RC:i:0
XQ:Z:44/46,0/0,56/60,48/50,30/32,46/46,0/0,44/60,48/50,30/32,46/46,0/0,58/60,48/50,32/32
YQ:Z:0.9397
YN:Z:mas15
YV:i:1
YG:i:3
YK:Z:N
SG:Z:10x_Adapter:0-22,random:23-698,Poly_A:699-727,3p_Adapter:728-752
RG:Z:9e129d4c
XC:Z:TCAGGTGCAGGTCGGA
XM:Z:TCCTGCGCAT
XA:Z:XC-XM

New tags:

XC:Z:TCAGGTGCAGGTCGGA
XM:Z:TCCTGCGCAT
ic:i:1
im:Z:m64013e_211031_055434/1/ccs
is:i:1
it:Z:TCAGGTGCAGGTCGGATCCTGCGCAT
zm:i:60314
RG:Z:e4927d21

Lmk if you need more info

New segment cigars need to be refactored

The new cigars that are produced by the model for each segment need to be updated in 2 ways:

The cigar operator and count are currently in the wrong order. The proper order is: <count><op>
The cigar operators are the states in the HMM. They should be changed to reflect the real cigar operators, including = and X.

This will disproportionately effect inspect, but will also have ramifications in other tools.

bump python requirement to <3.8

python 3.8 is listed in setup.py, so it should be supported. It looks like there are a few test failures as well as syntax warnings:

is not "" is now a SyntaxWarning because literals are not guaranteed a constant identity, needs to be changed to != ""
tests are failing due to some change to the language, need to investigate. edit this is a pysam issue

Need a `model` command with sub-commands to get information about the built-in models.

sub-commands could be:

list
info MODEL
dump MODEL

etc.

Must unify/expand bam tags with PacBio formats so data can be ingested with PacBio tools

Our current read tags are listed in the FAQ in our documentation.

We need to unify/expand this list with what PacBio tools expect so our reads can seamlessly work in their toolset.

Investigate test failure for demultiplex command

The integration test for demultiplex is failing. It seems to not like data from a pipe for some reason.

documentation on how to create new models

There will be certainly be interest in creating additional models -- e.g. with fewer segments to accommodate larger inserts or other schemes of information embedded in each segment. A tutorial on how to train new models would be very useful

Models need more descriptive adapter names.

The new adapter names are based on ancient mythical figures. These are not descriptive of what the function or location of the adapters actually are, which makes it harder to remember what the adapters do.

We should rename all adapters to indicate their use / function in the library.

Refactor tool "preamble" sections into utility method

The tools have a lot of repeated code to spin things up (e.g. model selection, input parameter parsing, etc.).

This code should be moved to a utility module and reused across all tools.

Add automatic bam / pbi indexing when creating files

Since we're writing out all the data anyway, it should be straightforward to write out an index file to go along with the output.

Ideally both a bai and pbi file should be created, and only if we are outputting to a file and not stdout.

`segment` UMI/CBC filtering is currently broken

Segment used to filter out reads that should have a CBC or UMI but didn't actually contain them.

Currently this functionality seems to be broken:

[INFO 2022-08-18 10:42:09  segment] Model has UMI annotation.
[INFO 2022-08-18 10:42:09  segment] Model has Cell Barcode annotation.
[WARNING 2022-08-18 10:42:09  segment] Model does not have Cell Barcode or UMI tags.  All segments will be emitted.

We need to store the segments that `sift` generates in a new tag in the reads (NOT the `sg` tag).

How to select models for anno

Hello:

How to prepare special defined models to suit custom library type , such as one prepared a standard 15 array element model, or one use new linker primers.

$ longbow model --list-models
[INFO 2021-11-02 15:27:28 model] Invoked via: longbow model --list-models
Longbow includes the following models:
Name Version Description
mas15 1.0.0 The standard MAS-seq 15 array element model.
mas15v2 2.0.0 The standard MAS-seq 15 array element model.
mas15threeP 2.0.0 The 3' kit MAS-seq 15 array element model.
mas15BulkWithIndices 1.0.0 A MAS-seq 15 array element model with a 10 base index just before the 3' adapter for bulk sequencing.
mas10 1.0.0 The MAS-seq 10 array element model.
mas10v2 2.0.0 The MAS-seq 10 array element model.
mas10threeP 2.0.0 The 3' kit MAS-seq 10 array element model.
slide-seq 0.0.1 The Slide-seq 15 array element model.
slide-seqV2 2.0.0 The Slide-seq 15 array element model.
mas8prototype 1.0.0 The prototype MAS-seq 8 array element model.
mas8prototypeV2 2.0.0 The prototype MAS-seq 8 array element model.

Thank you!

Review `pyro` and newest version of `pomegranate` for HMM improvements

Currently the pomegranate HMM implementation works well, but is somewhat slow and brittle.

We should investigate pyro and the latest / upcoming version of pomegranate (which is supposedly based on pytorch) to see if these issues can be mitigated or improved.

Add in a maximum length for reads to run through the HMM

In practice, we only run Longbow on reads of length <= 60kb. This is to prevent pathological cases where the HMM takes a very long time to run and consumes a lot of memory.

We should make this the default behavior for all Longbow tools that run reads through the HMM (e.g. annotate, demultiplex).

We can add in an override for the length threshold so a user can change this to their liking.

`Sift` tool must be updated to work for arbitrary models / libraries

Currently sift only works for mas_15_sc_10x5p_single_none.

We must expand this tool to work for any input library by removing the "validation model" (10x_sc_10x5p_single_none) and deriving a custom validation model at runtime for the input model.

In addition, we must address the following:

sift should be updated to include information about MAS adapters, rather than relying on random segments at the start of each array element
model::validate_model must be updated to remove the "skip" for the validation model. Validation models should be valid models, or the logic must be updated to somehow validate them.

Add warning to `correct` if the user is using more than 1 thread.

Correct uses a lot of ram for the lookup table.

If a user is requesting more than 1 thread, warn them that this is probably a bad idea and tell them how much approximate memory the lookup table will take.

(Inspired by a separate issue filed by @sagnikbanerjee15)

Aligned data causes longbow tools to crash

Currently we are using a dummy/empty bam header to re-create read objects in each subprocess (serialization is necessary because of how passing objects between processes works).

Because of how pysam parses serialized reads (pysam.AlignedSegment.tostring and pysam.AlignedSegment.fromstring), aligned data are not properly deserialized back into read objects with this empty header - only the read name and sam flags are properly parsed.

This is a showstopper for running on any data that have been aligned already.

Mitigation of this problem will require refactoring all tools to have their subprocess worker process functions ingest an actual pysam.AlignmentHeader dictionary so it can be used to deserialize the reads (the output process functions should already have a required header, so we should double-check that this header is used to deserialize output reads in the output processes).

Here is a snippet of code that will do what we need for a worker process function:

with pysam.AlignmentFile(input_bam, "rb", check_sq=False, require_index=False) as bam_file:
        # Start worker sub-processes:
        worker_process_pool = []
        for _ in range(num_procs):
            p = mp.Process(target=dummy_work_fun, args=(process_input_data_queue, results, bam_file.header.to_dict()))
            p.start()
            worker_process_pool.append(p)

...

def dummy_work_fun(in_queue, out_queue, bam_header_dict):
    while True:
        raw_data = in_queue.get()

        # Check for exit sentinel:
        if raw_data is None:
            return

        print("Re-creating read object from string...")
        read = pysam.AlignedSegment.fromstring(
            raw_data, pysam.AlignmentHeader.from_dict(bam_header_dict)
        )
        print("Created.")
        print(f"Read object: {read}")

The correct command will soon be refactored so it works in this manner (when branch jts_correct_symspellpy is merged).

Add second pass UMI / CBC annotation to Longbow when pulling out these sections

To fix any potential shifts / off-by-one errors, we should add a second pass to the checks to make sure we have the right bases pulled out for these regions.

New read name hashes have collisions.

The hash algorithm being used to generate the new ZMWs in the read names seems to have collision. This causes critical issues in downstream processing.

Review docker build steps to remove experimental docker features.

In the latest major update, I added some ssh commands to the docker build, however these features require docker experimental features.

We need to investigate the added ssh commands to see if they are still necessary. Alternatively, we should see if the docker experimental features required have been rolled into a new docker version and update.

Need conda environment setup

While we rely on the built-in venv commands in python, we should also support conda environments.

0.3.0 Regression in model performance

Refactoring in the 0.3.0 release caused a serious regression in model performance. Something about how the models are now being constructed is causing the annotate step to produce incorrect results.

Longbow 0.3.0 should not be used - use 0.2.2 instead.

missing umi and/or cell tag from alligned reads after longbow processing

Hi,

we just received an MAS-Iso seq long read data for single cell 3'p for test which I am trying to process as per the Longbow pipeline , the bam file from the longbow output is giving an error when I am trying to group the detected UMIs and mark the molecules with the same UMIs using UMI-Tools , it's throwing an error
""" Read skipped, missing umi and/or cell tag: 61628044"""

here are the steps which I did
1- annotation, filtering, segmentation , extract
'longbow annotate -m mas15v2 -t 30 hifi_reads.bam | longbow filter | longbow segment | longbow extract -o filter_passed.bam'

2 , primary allignment of extracted bam with the hg38 genome using minimap2
'samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq Homo_sapiens.GRCh38.dna.primary_assembly.fa - |
samtools sort > align.bam &&
samtools index align.bam

3 - baseline transcript annotations via stringtie 2 to create new transcriptome annotations specific to our test data
'stringtie allign.bam -Lv -G gencode.v38.annotation2.gtf -o annotations.gtf -A gene_abund.out

4 - Based on the resulting transcript annotations we created a new transcript reference FASTA file using gffread
'gffread -w transcriptome.fa -g Homo_sapiens.GRCh38.dna.primary_assembly.fa annotations.gtf'

5 - aligned the extracted reads to the novel transcriptome reference using minimap2 v2.17-r941
''samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq transcriptome.fa - |
samtools sort > align_2.bam &&
samtools index align_2.bam

6 - group the detected UMIs and mark the molecules with the same UMI using UMI-Tools

'umi_tools group --stdin= align_2.bam --buffer-whole-contig --no-sort-output --per-gene --gene-tag XG --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam

here it is throwing an error with ''missing umi and/or cell tag from alligned reads''

could you please tell, why it's missing UMI from the longbow output files,

Thanks
'

Inconsistency in adapter retention in first segment of read vs others

Looks like the 10x adapter is being trimmed off the first segment, but retained in all subsequent segments:

Thing is, I don't actually know which way is correct. If we're going to be feeding these BAM files into the PacBio IsoSeq tools (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-deduplication.md) at the tag stage, it might be necessary to trim the 10x adapter off of all the reads first.

So, we'll have to figure out what that tool is expecting and make our changes accordingly.

longbow correct consumes very large amount of memory

Hello,

I am using longbow to correct the barcodes but I am constantly running out of memory. I am executing the sample on a system that has 247G of memory. I reduced the number of threads to 10 and also brought down the number of raw reads to 10K but the problem still persists.

Here is the command I am running.

longbow correct --force --output-bam merged_segmented_corrected.bam --barcode-uncorrectable-bam merged_segmented_uncorrectable.bam --allow-list /var/lib/cwl/stgfd2864f9-b944-4a63-bf01-85757c06fad1/3M-february-2018.txt --model mas10threeP --threads 10 /var/lib/cwl/stg5e0294c1-7b30-4699-ad51-408d4fb594dc/merged_segmented.bam

I am running 0.5.29 version of longbow since the mas10threeP model was used for the experiment. Also, I am executing the longbow correct after the segment step. Please let me know if I need to do anything differently.

Thank you,
Sagnik.

`Segment` annotates all leading array elements with `START` instead of their leading adapter

Add in file chunking to allow parallel processing from a single location.

Chunking can be based off pbi file.

Update Multiprocessing to use Ray

https://github.com/ray-project/ray

Multiple concurrent runs overwriting ._longbow_filter_*.bam?

Hi, I'm running multiple samples through longbow at the same time on our compute cluster. When I do this, it seems that longbow filter writes out the ._longbow_filter_*.bam files for each separate sample to the same file. Is this an issue and is there a way to specify the name of these temporary files?

Thanks!