broadinstitute / longbow Goto Github PK
View Code? Open in Web Editor NEWAnnotation and segmentation of MAS-seq data
Home Page: https://broadinstitute.github.io/longbow/
License: BSD 3-Clause "New" or "Revised" License
Annotation and segmentation of MAS-seq data
Home Page: https://broadinstitute.github.io/longbow/
License: BSD 3-Clause "New" or "Revised" License
The new models
command must be updated to dump the following:
Hello,
Nice work! I have had read your paper and where can I download the CD8 datasets.
I do need some big data for testing and comparision.
Thank you, The best !
This seems to be a boundary issue when cutting up the string and performing the comparison between annotated known sequences and the truth for those sequences.
git
doesn't appreciate when we put binary files in a repo. We need to migrate all tests from bam
files to sam
files. Some tests do this already.
The current model of the 3' kit uses 10 bases for the UMI. This was accurate for older 10x kits, but not the current ones.
The model should be updated to use a 12bp UMI.
Will it be possible to have a read tag for the MAS adapter used in the segmented bam file so that we can split the bam file if needed? Like if we are using 10 MAS adapter for ligation; can we split the segmented bam file into 10 independent bam file?
Also what is the memory requirement for the "logbow correct" step? Its continuously giving memory error.
Thanks
Currently tagfix
changes the segments tag but not the segment cigar tag. Both should be updated when this tool is run.
Right now, integration tests run data through successive commands until the command under test has the appropriate input for execution. This makes testing each command dependent on all other commands. Instead, we should write the appropriate files to the test_data directory and test directly on those as input, thus removing the command dependencies.
Requested 16 cores, #SBATCH --ntasks=16
cannot find Process-18 (?)
[INFO 2022-02-24 19:20:19 segment] Invoked via: longbow segment -t 16
[INFO 2022-02-24 19:20:19 annotate] Invoked via: longbow annotate -t 16 1395T.ccs.bam
[INFO 2022-02-24 19:20:19 annotate] Running with 16 worker subprocess(es)
[INFO 2022-02-24 19:20:19 segment] Running with 16 worker subprocess(es)
[INFO 2022-02-24 19:20:19 segment] Using simple splitting mode.
[INFO 2022-02-24 19:20:19 extract] Invoked via: longbow extract -o extracted.bam
[INFO 2022-02-24 19:20:19 extract] Writing extracted read segments to: extracted.bam
[INFO 2022-02-24 19:20:19 extract] Including 2 flanking bases.
[INFO 2022-02-24 19:20:19 annotate] Annotating 2492060 reads
Process Process-18:
Traceback (most recent call last):
File "/mnt/projects/CCR-SF/active/Software/tools/Anaconda/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/mnt/projects/CCR-SF/active/Software/tools/Anaconda/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/mnt/projects/CCR-SF/active/Software/tools/longbow/longbow-0.5.21/src/longbow/annotate/command.py", line 271, in _write_thread_fn
segments = bam_utils.collapse_annotations(ppath)
File "/mnt/projects/CCR-SF/active/Software/tools/longbow/longbow-0.5.21/src/longbow/utils/bam_utils.py", line 292, in collapse_annotations
for i, seg in enumerate(path):
TypeError: 'NoneType' object is not iterable
My fork of Pomegranate is preventing release to PyPi properly:
…
Checking dist/maslongbow-0.5.39-py2.py3-none-any.whl: PASSED with warnings
WARNING `long_description_content_type` missing. defaulting to `text/x-rst`.
Checking dist/maslongbow-0.5.39.tar.gz: PASSED with warnings
WARNING `long_description_content_type` missing. defaulting to `text/x-rst`.
Uploading distributions to https://upload.pypi.org/legacy/
Uploading maslongbow-0.5.39-py2.py3-none-any.whl
[25](https://github.com/broadinstitute/longbow/runs/8102287644?check_suite_focus=true#step:13:26)l
0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/110.2 kB • --:-- • ?
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.2/110.2 kB • 00:00 • 6.3 MB/s
100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.2/110.2 kB • 00:00 • 6.3 MB/s
25hWARNING Error during upload. Retry with the --verbose option for more details.
ERROR HTTPError: 400 Bad Request from https://upload.pypi.org/legacy/
Invalid value for requires_dist. Error: Can't have direct dependency:
'pomegranate @
git+https://github.com/jonn-smith/pomegranate.git@d80ec9[30](https://github.com/broadinstitute/longbow/runs/8102287644?check_suite_focus=true#step:13:31)4bba4c42ee394
3c887754fd78885cd55'
This needs to be fixed by either moving to a proper version of pomegranate (assuming the PR gets merged in to fix the heap overflow), by releasing my fork and using that, or by removing pomegranate and using something else.
I tried running annmas in the docker container, and annotate
produces the following:
[INFO 2021-02-08 15:01:18] annmas: annotate started
[INFO 2021-02-08 15:01:18] Running with 4 worker subprocess(es)
Traceback (most recent call last):
File "/annmas/venv/bin/annmas", line 11, in <module>
load_entry_point('annmas', 'console_scripts', 'annmas')()
File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/annmas/venv/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/annmas/src/annmas/annotate/command.py", line 91, in main
m = build_default_model()
File "/annmas/src/annmas/utils/model.py", line 272, in build_default_model
full_model.bake()
File "pomegranate/hmm.pyx", line 794, in pomegranate.hmm.HiddenMarkovModel.bake
File "/annmas/venv/lib/python3.8/site-packages/networkx/classes/reportviews.py", line 718, in <genexpr>
for nbr, dd in nbrs.items()
RuntimeError: dictionary keys changed during iteration
double free or corruption (out)
Aborted (core dumped)
We have a lot of other summary stats, but not this one.
Hello, I ran longbow to split up my reads and the UMI length I was expecting was 12, however the output UMIs are all 10bp long. Do I need to specify an alternate model when I run longbow annotate? Or is there an alternate way to specify the UMI length?
Thanks!
Matt
Because longbow stats
uses the length of a segment name for whether to determine if that segment is a MAS-seq key segment (single-character segment names are canonically used), sometimes it can create misleading ligation profiles.
For example, if one were to take mas15
model data and run stats
on it using the mas15threeP
model, the ligation profiles would look fine, but when it came time to do the splitting into individual array elements, longbow could not actually perform any splitting because the non-key segments end up in the wrong order for mas15threeP
(becauase they are actually mas15
).
In practice this should only occur if one created a library with one order and annotated / collected stats for a completely separate library order / model.
Right now, the SG tag appears to be annotating an extra region of the state sequence on the front and back of the array. For example, for this read,
m64020_201213_022403/23/ccs/23_807/10x_Adapter-3p_Adapter ... SG:Z:10x_Adapter:0-22,random:23-777,Poly_A:778-807,3p_Adapter:808-832
the state sequence says the first 22 bases should be the 10x_Adapter and the last 24 bases should be the 3p_Adapter. But the read actually contains this:
To compensate for now, I've written the inspect
tool to ignore the first and last array elements of the SG tag (https://github.com/broadinstitute/annmas/blob/3ddb711b1bf5fb926c59cdc9f264e89075976e25/src/annmas/inspect/command.py#L180). But it does technically mean that the base ranges specified by the SG tag don't actually match up to the read sequence. We should fix this so that they do indeed match, and then change inspect
so that it no longer has to do any trimming on the SG array.
There seem to be some issues in the libraries we use when running in python 3.8
.
I think it's pomegranate, but I'm not 100% sure. I was able to get around this by using python 3.7
.
See #18 for details.
Should cue off a bam tag for the barcode and operate like a combination of peek
and correct
but without using symspell
because the indices would be too big to keep in memory.
Currently the XQ
tag is not clipped during segmentation to only include the sections in the resulting segmented read. This should be fixed by similar logic to how the SG
tag is being clipped.
Hi guys, just documenting the spec for isoseq3 support
What I did:
As you know, I've posted the data and processing script here
Basically, I processed the reads with longbow and ran the combined extracted data through isoseq3 tag
.
When trying to dedup the reads with isoseq3 dedup
, it would say terminate called without an active exception. Aborted
.
Liz says this is because the read group tag is messed up. We have a script to iterate through our bams and change the RG:Z:
tag so that the read group is constant. We also change the zm:i:
tag to a unique counter since ZMWs are no longer unique. Unfortunately, this is ridiculously slow and it would be nice to have an isoseq3
compatible output from longbow
.
Old tags (after isoseq3 tag --design 16B-10U-T
):
New tags:
Lmk if you need more info
The new cigars that are produced by the model for each segment need to be updated in 2 ways:
<count><op>
=
and X
.This will disproportionately effect inspect
, but will also have ramifications in other tools.
python 3.8 is listed in setup.py
, so it should be supported. It looks like there are a few test failures as well as syntax warnings:
is not ""
is now a SyntaxWarning
because literals are not guaranteed a constant identity, needs to be changed to != ""
sub-commands could be:
etc.
Our current read tags are listed in the FAQ in our documentation.
We need to unify/expand this list with what PacBio tools expect so our reads can seamlessly work in their toolset.
The integration test for demultiplex is failing. It seems to not like data from a pipe for some reason.
There will be certainly be interest in creating additional models -- e.g. with fewer segments to accommodate larger inserts or other schemes of information embedded in each segment. A tutorial on how to train new models would be very useful
The new adapter names are based on ancient mythical figures. These are not descriptive of what the function or location of the adapters actually are, which makes it harder to remember what the adapters do.
We should rename all adapters to indicate their use / function in the library.
The tools have a lot of repeated code to spin things up (e.g. model selection, input parameter parsing, etc.).
This code should be moved to a utility module and reused across all tools.
Since we're writing out all the data anyway, it should be straightforward to write out an index file to go along with the output.
Ideally both a bai
and pbi
file should be created, and only if we are outputting to a file and not stdout.
Segment
used to filter out reads that should have a CBC
or UMI
but didn't actually contain them.
Currently this functionality seems to be broken:
[INFO 2022-08-18 10:42:09 segment] Model has UMI annotation.
[INFO 2022-08-18 10:42:09 segment] Model has Cell Barcode annotation.
[WARNING 2022-08-18 10:42:09 segment] Model does not have Cell Barcode or UMI tags. All segments will be emitted.
Hello:
How to prepare special defined models to suit custom library type , such as one prepared a standard 15 array element model, or one use new linker primers.
$ longbow model --list-models
[INFO 2021-11-02 15:27:28 model] Invoked via: longbow model --list-models
Longbow includes the following models:
Name Version Description
mas15 1.0.0 The standard MAS-seq 15 array element model.
mas15v2 2.0.0 The standard MAS-seq 15 array element model.
mas15threeP 2.0.0 The 3' kit MAS-seq 15 array element model.
mas15BulkWithIndices 1.0.0 A MAS-seq 15 array element model with a 10 base index just before the 3' adapter for bulk sequencing.
mas10 1.0.0 The MAS-seq 10 array element model.
mas10v2 2.0.0 The MAS-seq 10 array element model.
mas10threeP 2.0.0 The 3' kit MAS-seq 10 array element model.
slide-seq 0.0.1 The Slide-seq 15 array element model.
slide-seqV2 2.0.0 The Slide-seq 15 array element model.
mas8prototype 1.0.0 The prototype MAS-seq 8 array element model.
mas8prototypeV2 2.0.0 The prototype MAS-seq 8 array element model.
Thank you!
Currently the pomegranate HMM implementation works well, but is somewhat slow and brittle.
We should investigate pyro
and the latest / upcoming version of pomegranate
(which is supposedly based on pytorch
) to see if these issues can be mitigated or improved.
In practice, we only run Longbow on reads of length <= 60kb. This is to prevent pathological cases where the HMM takes a very long time to run and consumes a lot of memory.
We should make this the default behavior for all Longbow tools that run reads through the HMM (e.g. annotate
, demultiplex
).
We can add in an override for the length threshold so a user can change this to their liking.
Currently sift
only works for mas_15_sc_10x5p_single_none
.
We must expand this tool to work for any input library by removing the "validation model" (10x_sc_10x5p_single_none
) and deriving a custom validation model at runtime for the input model.
In addition, we must address the following:
sift
should be updated to include information about MAS adapters, rather than relying on random
segments at the start of each array elementmodel::validate_model
must be updated to remove the "skip" for the validation model. Validation models should be valid models, or the logic must be updated to somehow validate them.Correct
uses a lot of ram for the lookup table.
If a user is requesting more than 1 thread, warn them that this is probably a bad idea and tell them how much approximate memory the lookup table will take.
(Inspired by a separate issue filed by @sagnikbanerjee15)
Currently we are using a dummy/empty bam header to re-create read objects in each subprocess (serialization is necessary because of how passing objects between processes works).
Because of how pysam
parses serialized reads (pysam.AlignedSegment.tostring
and pysam.AlignedSegment.fromstring
), aligned data are not properly deserialized back into read objects with this empty header - only the read name and sam flags are properly parsed.
This is a showstopper for running on any data that have been aligned already.
Mitigation of this problem will require refactoring all tools to have their subprocess worker process functions ingest an actual pysam.AlignmentHeader
dictionary so it can be used to deserialize the reads (the output process functions should already have a required header, so we should double-check that this header is used to deserialize output reads in the output processes).
Here is a snippet of code that will do what we need for a worker process function:
with pysam.AlignmentFile(input_bam, "rb", check_sq=False, require_index=False) as bam_file:
# Start worker sub-processes:
worker_process_pool = []
for _ in range(num_procs):
p = mp.Process(target=dummy_work_fun, args=(process_input_data_queue, results, bam_file.header.to_dict()))
p.start()
worker_process_pool.append(p)
...
def dummy_work_fun(in_queue, out_queue, bam_header_dict):
while True:
raw_data = in_queue.get()
# Check for exit sentinel:
if raw_data is None:
return
print("Re-creating read object from string...")
read = pysam.AlignedSegment.fromstring(
raw_data, pysam.AlignmentHeader.from_dict(bam_header_dict)
)
print("Created.")
print(f"Read object: {read}")
The correct
command will soon be refactored so it works in this manner (when branch jts_correct_symspellpy
is merged).
To fix any potential shifts / off-by-one errors, we should add a second pass to the checks to make sure we have the right bases pulled out for these regions.
The hash algorithm being used to generate the new ZMWs in the read names seems to have collision. This causes critical issues in downstream processing.
In the latest major update, I added some ssh
commands to the docker build, however these features require docker experimental features.
We need to investigate the added ssh
commands to see if they are still necessary. Alternatively, we should see if the docker experimental features required have been rolled into a new docker version and update.
While we rely on the built-in venv commands in python, we should also support conda environments.
Refactoring in the 0.3.0 release caused a serious regression in model performance. Something about how the models are now being constructed is causing the annotate step to produce incorrect results.
Longbow 0.3.0 should not be used - use 0.2.2 instead.
Hi,
we just received an MAS-Iso seq long read data for single cell 3'p for test which I am trying to process as per the Longbow pipeline , the bam file from the longbow output is giving an error when I am trying to group the detected UMIs and mark the molecules with the same UMIs using UMI-Tools , it's throwing an error
""" Read skipped, missing umi and/or cell tag: 61628044"""
here are the steps which I did
1- annotation, filtering, segmentation , extract
'longbow annotate -m mas15v2 -t 30 hifi_reads.bam | longbow filter | longbow segment | longbow extract -o filter_passed.bam'
2 , primary allignment of extracted bam with the hg38 genome using minimap2
'samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq Homo_sapiens.GRCh38.dna.primary_assembly.fa - |
samtools sort > align.bam &&
samtools index align.bam
3 - baseline transcript annotations via stringtie 2 to create new transcriptome annotations specific to our test data
'stringtie allign.bam -Lv -G gencode.v38.annotation2.gtf -o annotations.gtf -A gene_abund.out
4 - Based on the resulting transcript annotations we created a new transcript reference FASTA file using gffread
'gffread -w transcriptome.fa -g Homo_sapiens.GRCh38.dna.primary_assembly.fa annotations.gtf'
5 - aligned the extracted reads to the novel transcriptome reference using minimap2 v2.17-r941
''samtools fastq filtered_passed.bam | minimap2 -ayYL --MD -x splice:hq transcriptome.fa - |
samtools sort > align_2.bam &&
samtools index align_2.bam
6 - group the detected UMIs and mark the molecules with the same UMI using UMI-Tools
'umi_tools group --stdin= align_2.bam --buffer-whole-contig --no-sort-output --per-gene --gene-tag XG --extract-umi-method tag --umi-tag ZU --group-out=out.tsv --output-bam
here it is throwing an error with ''missing umi and/or cell tag from alligned reads''
could you please tell, why it's missing UMI from the longbow output files,
Thanks
'
Looks like the 10x adapter is being trimmed off the first segment, but retained in all subsequent segments:
Thing is, I don't actually know which way is correct. If we're going to be feeding these BAM files into the PacBio IsoSeq tools (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-deduplication.md) at the tag
stage, it might be necessary to trim the 10x adapter off of all the reads first.
So, we'll have to figure out what that tool is expecting and make our changes accordingly.
Hello,
I am using longbow to correct the barcodes but I am constantly running out of memory. I am executing the sample on a system that has 247G of memory. I reduced the number of threads to 10 and also brought down the number of raw reads to 10K but the problem still persists.
Here is the command I am running.
longbow correct --force --output-bam merged_segmented_corrected.bam --barcode-uncorrectable-bam merged_segmented_uncorrectable.bam --allow-list /var/lib/cwl/stgfd2864f9-b944-4a63-bf01-85757c06fad1/3M-february-2018.txt --model mas10threeP --threads 10 /var/lib/cwl/stg5e0294c1-7b30-4699-ad51-408d4fb594dc/merged_segmented.bam
I am running 0.5.29 version of longbow since the mas10threeP model was used for the experiment. Also, I am executing the longbow correct after the segment step. Please let me know if I need to do anything differently.
Thank you,
Sagnik.
Chunking can be based off pbi
file.
Hi, I'm running multiple samples through longbow at the same time on our compute cluster. When I do this, it seems that longbow filter writes out the ._longbow_filter_*.bam
files for each separate sample to the same file. Is this an issue and is there a way to specify the name of these temporary files?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.