cgatoxford / umi-tools Goto Github PK
View Code? Open in Web Editor NEWTools for handling Unique Molecular Identifiers in NGS data sets
License: MIT License
Tools for handling Unique Molecular Identifiers in NGS data sets
License: MIT License
Unless I'm misunderstanding what is going on in _get_adj_list_directional_adjacency
(https://github.com/CGATOxford/UMI-tools/blob/master/umi_tools/dedup.py#L308), it appears that the comment and the code do not match. The comments claim that the UMI counts are compared with a greater than comparison, but the code compares them with a greater than or equal to symbol. I noticed this because at locations where each UMI is unique some reads were still getting removed. (For example 1 >= (12)-1 is true, but 1 > (12)-1 is false). This was unexpected based on my understanding of the blog post about the tool (https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/) and the command help output.
def _get_adj_list_directional_adjacency(self, umis, counts, threshold):
''' identify all umis within the hamming distance threshold
and where the counts of the first umi is > (2 * second umi counts)-1'''
return {umi: [umi2 for umi2 in umis if
edit_distance(umi.encode('utf-8'),
umi2.encode('utf-8')) == 1 and
counts[umi] >= (counts[umi2]*2)-1] for umi in umis}
I was opting to get the version that doesn't depend on CGAT, but simply running umi_tools dedup --help
produces this error:
Traceback (most recent call last): File "/usr/local/bin/umi_tools", line 9, in <module> load_entry_point('umi-tools==0.0.1', 'console_scripts', 'umi_tools')() File "/usr/local/lib/python2.7/dist-packages/umi_tools-0.0.1-py2.7.egg/umi_tools/umi_tools.py", line 39, in main module = imp.load_module(command, file, pathname, description) File "/usr/local/lib/python2.7/dist-packages/umi_tools-0.0.1-py2.7.egg/umi_tools/dedup.py", line 182, in <module> from umi_tools._dedup_umi import edit_distance ImportError: No module named _dedup_umi
I get an error when I follow the usage in the documentation for dedup.py.
cat infile.bam | python dedup_umi.py > deduped.bam 2> deduped.log
ValueError: fetch called on bamfile without index
Presumably, this is because pysam uses the filename to obtain the index? If so, is it possible to read from the stdin and pipe to dedup_umi.py?
I've removed this usage suggestion from the documentation for the time being.
Lastest changes make the code dependent on psutil
, which is not part of the standard library. At a minimum, this should be noted in the README, but since it is only used to output the memory used, perhaps we should reconsider if it is necessary.
What is the requirement for a bam file for umi_tools to take in? I am encountering some error on some generic bam. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam
dedup -I wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam -S wgEncodeUwRepliSeqBg02esG1bAlnRep1.dedup.bam
File "/usr/local/bin/umi_tools", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/umi_tools/umi_tools.py", line 42, in main
module.main(sys.argv)
File "/usr/local/lib/python2.7/dist-packages/umi_tools/dedup.py", line 924, in main
options.stats, options.further_stats)
File "/usr/local/lib/python2.7/dist-packages/umi_tools/dedup.py", line 443, in call
min(len_umis), max(len(umis))))
TypeError: 'int' object is not iterable
Hi
I've taken the liberty of adding the latest release of your tool to the bioconda repository.
I'm also working on a wrapper to be able to use the tool inside Galaxy
Hope you don't mind!
Cheers
M
Currently, dedup can work on a per-position or per-contig basis.The per-contig method is implemented for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. However, this requires that the reads be aligned to a single contig, i.e a representative gene model. Ideally we would like to align to all transcripts and give dedup a file mapping contigs (transcripts) to meta-contigs (genes) and introduce a "per-meta-contig" method.
This is straightforward to implement within UMI-Tools dedup but requires the bam to be sorted by a meta-contig feature which I don't think is possible with standard command line tools. We could of course make a new command umi_tools sort to perform the sort using pysam?
@IanSudbery - thoughts?
Biostars question on custom sam sort tool here
Some UMI-methods add UMIs to both paired ends. We need to enable stripping of both these UMIs from the paired reads and addition of a concatenated UMI to both paired reads.
Not really an issue, more of a question. I am trying to run umi_tools with a python script and instead of using subprocess
I just imported it:
from umi_tools import extract
extract.main(['-I', '/path/to/input.fastq',
'-S', path/to/output.fastq',
'-p', 'NNNNNNXXXXXXX',
'-L', '/path/to/log',
'read2_in', '/path/to/input_2.fastq',
'read2_out', 'path/to/output_2.fastq'])
But it only outputs an empty file. Any suggestions on what I am doing wrong? Thank you.
Hi,
I have a short question related to the --multimapping-detection-method
the help specifies that "Setting this option to NH, X0 or XT will use these tags when selecting the best read amongst reads with the same position and umi"
Does it means that if several reads map at the same position/same strand/same umi, a read with a tag stating it is a secondary alignment will be less preferably selected?
I am specifically interested in the expression of Transposable Elements .
After this dedup step I will input the resulting bam file in TEtranscript for a subsequent analysis. This algorithm uses the information of the distribution of the multimapped reads to estimate the most likely alignment of each repeat, so I must make sure that the dedup from umi-tools is not going to bias my data by potentially arbitrarily eliminating reads because there are secondary alignment.
Thank you,
Camille
Hi,
Sorry to open another issue. The fact that the UMI is removed from the sequence is a problem for downstream analyses. For example for building a bam file where the reads are filtered for UMI in which bases have a low quality. Or simply to keep a column with the UMI in the bam file, which can be more practical than having the UMI in the reads name for certain downstream analyses.
Is there no way to avoid the read to be truncated?
Thank you,
Camille
Hi,
Is a way to output the number of reads that support each UMI before and after deduplication? It is useful for check the quality of the data.
Thanks!
Hi,
I got a couple of problems with I was using dedup_umi.py to handle a set of single-end scRNAseq reads ( I used the method "adjacency"), I would be very grateful if you would help me with them:
HISEQ1:340:C7K5CACXX:2:1109:1824:92364_GGAATT 16 1 629266 255 3S41M1S * 0 0 AGACTAGTTCGGACCTCCCTGTTCTTATGAATTCGAACAGCATAG JJJJJJJJJJJJHJJJJJJJJJIJJIJJJJJJJJJJJJJJJJHHF NH:i:1 HI:i:1 AS:i:36 nM:i:2
these two reads with the same UMI and same mapping position are both retained in the output...
Could you please help me to figure out these two issues? Thank you very much in advance.
Kaiyang
Hello,
I'm using the latest version of UMI-tools (Master branch) and I noted an unexpected output with the extract.py script.
Here my command line :
python extract.py --bc-pattern=NNNTTGTNN -L extract.log -I input.fastq -S output.fastq
Here the original read (from the input) :
@NS500801:5:H7YMFBGXX:1:11101:12845:1158 1:N:0:0
CAATTGTCGGCCGCCGGTGAAATACCNNNNNNNNNNNNNGGAAGAGCGGTTCAGCAGGAATGGCGAGACCGATCTCGTAT
+
AAAAAF<F<FFFFFF..FFFFFFFFF#############FF<AFFFFFFFFFFF77.F<F7FAFAFF<FAFAF<AAAA.<
Here the output read :
@NS500801:5:H7YMFBGXX:1:11101:12845:1158_CAACG 1:N:0:0
GCCGCCGGTGAAATACCNNNNNNNNNNNNNGGAAGAGCGGTTCAGCAGGAATGGCGAGACCGATCTCGTAT
+
FFFFFF..FFFFFFFFF#############FF<AFFFFFFFFFFF77.F<F7FAFAFF<FAFAF<AAAA.<
We can see that the random bases were extracted from the read and add to the ID, but the remaining bases (normally TTGT) were not added to the beginning of the read (i.e my pattern were just trimmed from this original read).
Can you explain me why this happened and how I can fix this, please ?
Best regards,
Alexandra
Hi,
I miss an important point, sorry for that ...
Does one BAM file have all the reads for only one Cell Barcode?
Otherwise, where do you put the cell barcode information in the BAM file?
Thanks a lot,
Iñaki
Currently UMI-tools dedup_umi.py uses an adjacency list with a distance threshold to define connected components. By including a further threshold for relative counts in two UMIs, we can create a directional adjacency list. Each group of connected UMIs will then be considered to have originated from a single parent which has the highest counts.
Proposed solution - Use a threshold of A > 2B - 1 to define the adjacency list. Re-write the required functions to use the directional adjacency list (connected_components, find_best, etc)
dedup_umi is not very happy if the NH tag is missing from a bam entry:
Traceback (most recent call last):
File "/ifs/projects/proj028/src/dedup_umi.py", line 413, in <module>
sys.exit(main(sys.argv))
File "/ifs/projects/proj028/src/dedup_umi.py", line 390, in main
soft_clip_threshold=options.soft):
File "/ifs/projects/proj028/src/dedup_umi.py", line 288, in get_bundles
if reads_dict[pos][key][umi]["read"].opt("NH") < read.opt("NH"):
File "pysam/calignmentfile.pyx", line 3571, in pysam.calignmentfile.AlignedSegment.opt (pysam/calignmentfile.c:38787)
File "pysam/calignmentfile.pyx", line 3286, in pysam.calignmentfile.AlignedSegment.get_tag (pysam/calignmentfile.c:33926)
KeyError: "tag 'NH' not present"
This is an optional BAM tag, and is not added by either bowtie or BWA. On the other hand, selecting which of a selection of reads maps to the fewest unique locations is part of the read selection process.
There are three options here.
try
statements and ignore this criteria if not present.Views?
I believe the following code snipped from dedup_umi.py shows how a random representative read is selected for a bundle when there is a tie for mapping quality.
read_counts[pos][key][umi] += 1
prob = 1.0/read_counts[pos][key][umi]
if random.random() < prob:
reads_dict[pos][key][umi]["read"] = read
Yet, if I understand correctly, the read_counts[pos][key][umi]
counter is not reset afterward and continues to accumulate counts in case other copies of the same molecule are encountered.
Wouldn't that lead to lower probabilities of replacing the representative read with the current one as the analysis progresses? Am I misunderstanding the logic?
Hi,
I am trying to extract my UMIs from my reads to add it to the reads names. The UMIs are on the 3prime end so I used the --3prime option.
The help does not specify anything else to do but add this option in the command line when we want to use it. However the output seems to suggest that some arguments should be added to this option. Which ones are they?
Here is my command:
srun umi_tools extract -I /path/to/file/of/reads1 -S /reads1.UMIextracted.fq --bc-pattern XXXXXXNNNNNNNNNN --read2-in /path to file of reads2 --3prime --read2-out /reads2.UMIextracted.fq -L extractUMI.log
Here is the error message I get:
TypeError: _joiner_3prime() takes exactly 4 arguments (3 given)
Thank you,
Camille
Posted in error. Sorry.
File "UMI-tools-0.2.5/umi_tools/dedup.py", line 569, in call
adj_list = self.get_adj_list(umis, counts, threshold)
AttributeError: ClusterAndReducer instance has no attribute 'get_adj_list'
When using positions, the alogrithm uses the following code to decide whether to output or not:
if whole_contig:
do_output = not read.tid == last_chr
else:
do_output = start > (last_pos+1000) and not read.tid == last_chr
note that in the second case the condition is that we are 1kb on AND on a different chr. Surely this should be OR?
This was leading to reads being output in a very strange order. Could possibly lead to reads from difference chrs being combined in the same bundle, and is possibly to blame for some of the high memory usage (will almost certainly read whole file into memory this way).
Again, seems like a simple fix, but I feel like we've been here before, and don't want to make the change if I'm missing something.
Hi there
Nice work on this tool - unfortunately installation on Mac OS X Sierra failed for me due to missing dependencies that were unstated:
$ conda install -c https://conda.anaconda.org/toms umi_tools
Fetching package metadata .........
Solving package specifications: .
PackageNotFoundError: Package not found: '' Dependencies missing in current osx-64 channels:
- umi_tools -> bcftools ==1.3
- umi_tools -> htslib >=1.3
- umi_tools -> pysam ==0.8.4
- umi_tools -> samtools >=1.3
Close matches found; did you mean one of these?
htslib: tblib, html5lib
samtools: apptools, umi_tools
You can search for packages on anaconda.org with
anaconda search -t conda samtools
(and similarly for the other packages)
Perhaps you could list these dependencies in the installation instructions and link to installation instructions for these tools?
Best
Davis
Currently the adjacency method is not available for the group command as I was unable to assign each read to a single group correctly. I'll see to this...
See #54 for the motivation for the group command
I have paired end data (2 fastq files, 77 mil reads) and I have 8 nt UMI bases on the 5' end of the first fq file.
I have a shell script file with this content:
newfilename="${1/_*/}"
cat "$1" | umi_tools extract \
--bc-pattern="NNNNNNNN" \
--read2-in="$2" \
--read2-out="$newfilename-um_2.fq" \
-L "umitools-extract.log" > "$newfilename-um_1.fq"
where $1
is the first fq file (fwd) and $2
is the second fq file (rev). And run as script.sh file_1.fq file_2.fq
. I get the below error after the script runs for a couple of minutes.
sbatch-umitools-extract.sh: line 42: 32582 Broken pipe cat "$1"
32583 Killed | umi_tools extract --bc-pattern="NNNNNNNN" --read2-in="$2" --read2-out="$newfilename-um_2.fq" -L "umitools-extract.log" > "$newfilename-um_1.fq"
Strangely, it works fine on a small test dataset (paired-end fastq) of 40 lines each. The output fq files show the UMIs have been correctly removed and added to the read name. But it doesn't work for my big dataset. I wonder if it is just a script syntax error or something else.
There is a section of the code that tries to deduce the position of the read, taking into account any soft_clipping. It looks like this:
if read.is_reverse:
pos = read.aend
if read.cigar[-1][0] == 4:
pos = pos + read.cigar[-1][1]
start = read.pos
if ('N' in read.cigarstring or
(read.cigar[0][0] == 4 and
read.cigar[0][1] > soft_clip_threshold)):
is_spliced = True
else:
pos = read.pos
if read.cigar[0][0] == 4:
pos = pos - read.cigar[0][1]
start = pos
if ('N' in read.cigarstring or
(read.cigar[-1][0] == 4 and
read.cigar[-1][1] > soft_clip_threshold)):
is_spliced = True
Note that if the read is reversed, then it takes aend, finds how many bases are add for softclipping.... and then forgets all that and uses read.pos.
Where if the read is not reversed then it does use the clipping.
Any ideas why this might be? It goes right back to the initial commit. I'm sure there is a reason for it, but I can't for the life of me figure out what it might be.
Hi all,
I have a set of UMI-tagged paired-end target sequencing data, and the UMIs attached to the read1 and read2 are different. I am wondering how UMI-tools handle this kind of data during deduplication.
I was try to use bwamem for alignment, but got the error " paired reads have different names" as the UMIs of read1 and read2 are different and they were attached to read names. Do you have suggestions to handle this?
Here are an example of the reads after UMI attached to names:
Read1:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACC 1:N:0:ATCACGAC+NAGGTTCA
CTGCCCCTGTTCCCTCCAACTCATCTTTTCATTTTTGTTGTAACTTGACTTTTCTGCTTTCTTTGTAACCTGCCCTCCCCACAGCGAATTGCGACGATCGTTGCATTAACTCGCGAATCACGAC
+
EEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEE6EEEEEE/EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEAA
Read2:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGTGGG 2:N:0:ATCACGAC+NAGGTTCA
GAGGGCAGGTTACAAAGAAAGCAGAAAAGTCAAGTTACAACAAAAATGAAAAGATGAGTTGGAGGGAACAGGGGCAGGGTCCA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEE
Thank you very much for your help in advance!
Kaiyang
For sRNA-Seq, it's possible for two reads to have the same start position but different read lengths and therefore to have originated from different molecules (assuming no quality trimming has been performed). dedup_umi.py should have an option to consider read length along with the start position.
I propose adding the read length to the reads_dict key (defaulting to no value when read length not required).
Any thoughts before I go ahead?
Hi,
I was trying to install umi-tools with "pip install umi_tools", but it failed with the error "IOError: [Errno 2] No such file or directory: 'requirements.txt'", do you have an idea how to fix it?
Thanks!
--Kaiyang
I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc
on both files (1 chromosome bam before dedup and after dedup).
file | before dedup | after dedup |
---|---|---|
reads | 5,104,918 | 197,280 |
dup rate | 38.74% | 2.27% |
So, I suppose the deduplication must be working.
Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.
This is my script:
umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"
Make dedup use more than one CPU.
First attempt at an implementation see: branch IS_parallel.
Need large data file to benchmark.
cc: @TomSmithCGAT
The output from the group and dedup commands is inconsistent between py2 and py3. I thought this was due to the python dictionary hash in py3 - quoting from the documentation '[hash] values of str, bytes and datetime are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python . Hash randomization is intended to provide protection against a denial-of-service'. However, setting the environmental variable PYTHONHASHSEED=0 for both py2 and py3 so they both use the same hash for the dictionaries doesn't ensure consistency between them.
I'm not sure what the difference is between py2 and py3 which is causing this?
See the Py3.5-testing branch for code which is fulling Py3.5 compatible.
Hi Ian and Tom,
I very much like the theory behind this tool and am trying to use it in my research. I realize I may be conflating two specific issues..if so I am sorry, but here goes anyway!
We have performed paired-end strand-specific RNA-seq with a kit utilizing UMIs from BIOO scientific. As they state:
"The NEXTflex Rapid Directional qRNA-Seq Kit contains a set of 96 distinct molecular labels on the sequencing adapters. Each label consists of an 8 nucleotide barcode tag. During the ligation reaction, each cDNA fragment independently and randomly ligates to a single label from this pool of 96 adapters to result in a total of 96 x 96 = 9,216 possible combinations across both ends."
I have included the cartoon of the cloning strategy from their website below.
I have been using umi-tools and the 'dedup' step takes a long time ( > 30hrs for 42 million reads). I have followed suggestions from earlier posts and do not output stats.
Reads: ~ 30-40 million, paired-end 75 nt reads
UMI: 8 nt long on both pairs
Reference: human genome and some scaffolds and ERCC sequences
Alignment: star, limiting multi-mappers
version: umi-tools-0.2.3
umi method used: "directional-adjacency"
cpu: one node of compute cluster, 36G mem
I created two bam files using a subset of reads that map to either A) chr19 (104 M bam file) or B) the 92 ERCC RNA spike-ins (34 M bam file).
For chr19, the dedup process took 74 seconds.
2,428,230 before dedup
1,383,172 after dedup
This is a non-trivial level of replication! Ideally I would like to test different umi dedup methods using ERCC counts and see what correlates the best with known ERCC concentrations.
For ERCC, the dedup process has taken > 1hr and i killed it!!
894,572 before dedup
??? after dedup
The reason I tested this is when I cranked up the verbosity in the log I noticed that the dedup process was really getting bogged down once it got to scaffolds (~160 in my assembly) and ERCCs (92 spike-ins).
Would an increase in the number of "chromosomes"/contigs result in such a drop in performance? If so, what would you recommend? I would rather not remove the scaffolds. The ERCC spike-ins are important to help assess the "truth" by comparing how the different approaches (directional/adjacency/cluster/percentile/unique) correlate with the known concentrations, particularly relative to the original alignments containing duplicates.
Do you think it is useful (accuracy and/or speed) for your algorithm to know the identity of the UMI a priori? Particularly considering the following properties:
I am stuck at this step and need to perform the deduplication before moving forward with downstream analysis. Any help would be greatly appreciated. I can provide whatever files you might find useful.
Kind regards,
Neel
Currently Conda installation works well, but installation via pip or downloading the tarball works much less well.
The current problems are:
In order to fix:
In the help content for umi_tools extract
, the explanation for pattern is a bit confusing. If Ns are random bases and Xs are UMIs, then Xs should be extracted for name and Ns must be reattached as read? If so, in the example below, GG should be added to the read name.
--bc-pattern
Use this option to specify the format of the UMI/barcode. Use Ns to
represent the random positions and Xs to indicate the bc positions.
Bases with Ns will be extracted and added to the read name. Remaining
bases, marked with an X will be reattached to the read.
E.g. If the pattern is NNXXNN,
Then the read:
@HISEQ:87:00000000 read1
AAGGTTGCTGATTGGATGGGCTAG
DA1AEBFGGCG01DFH00B1FF0B
+
will become:
@HISEQ:87:00000000_AATT read1
GGGCTGATTGGATGGGCTAG
1AFGGCG01DFH00B1FF0B
+
Dear developers.
I'm testing your awesome toolkit, but I don't know what to do with the following problem:
Used parameters
python dedup.py \
-I xxx_sorted.bam \
-S xxx_dedup_chr19.bam \
--paired \
--edit-distance-threshold 1 \
--output-stats xxx_dedup_chr19 \
--chrom chr19 \
-v 5
Traceback
...
2016-10-24 13:54:43,213 DEBUG Outputted 780000
2016-10-24 13:55:03,757 DEBUG Dumping 857871 mates for contig chr19
2016-10-24 13:57:13,608 DEBUG 3997 mates remaining
2016-10-24 13:57:13,608 INFO Searching for mates for 3997 unmatched alignments
Traceback (most recent call last):
File "./UMI-tools-0.2.3-edited/umi_tools/dedup.py", line 1145, in <module>
sys.exit(main(sys.argv))
File "./UMI-tools-0.2.3-edited/umi_tools/dedup.py", line 1050, in main
outfile.close()
File "./UMI-tools-0.2.3-edited/umi_tools/dedup.py", line 309, in close
for read in self.outfile.fetch(start=pos, end=pos+1, tid=chrom):
File "pysam/calignmentfile.pyx", line 874, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:10986)
ValueError: fetch called on bamfile without index
Files in directory
Software versions
UMI-tools v0.2.3
pysam 0.8.4
Tried
Adding the until_eof=True flag to the self.outfile.fetch() function (to allow fetching if there's no index), but then I get an error that the BAM-file is truncated.
In this tool, you are assuming that the UMI will be found in the ID, but there can be cases where the UMI is in other places. In our case, the UMI is located in a tag. Specifically in the RX tag (but others use the BC/BX).
Also, when assigning the UMI, the tags are competing (you use FU, Picard uses MI).
I was wondering, how do you merge the duplicate read or do you simply discard them? I would be interested in looking at the full sequence and possibly deriving a consensus to correct for pcr/seq errors.
@IanSudbery Just had a chat with Nils Koelling here at the WIMM. He's given some very helpful feedback on documentation which I'll feed into the master branch shortly.
He's using UMIs in a variant calling framework and therefore wants to group reads by their UMI and then perform a downstream step to identify the correct base call for each position in the set of reads to correctly resolve sequencing errors. We spoke about a couple of ways to achieve this:
Theoretically, these could be achieved within the dedup command but this is getting away from a pure deduplication problem as such. I think the best approach might be to provide a new umi_tools command "group" which uses the same network methods to identify the set of reads with the same UMI or related UMIs and then outputs either a tagged BAM or csv. We'd need to abstract some of the dedup.py code to a module file and reuse it in group.py.
This might also be a good way to address this issue regarding counting reads supporting each UMI: #45
What do you think?
Traceback (most recent call last):
File "/home/chovanec/anaconda3/bin/umi_tools", line 11, in <module>
sys.exit(main())
File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/umi_tools.py", line 42, in main
module.main(sys.argv)
File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/dedup.py", line 972, in main
read_gn = random_read_generator(infile.filename, chrom=options.chrom)
File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/dedup.py", line 763, in __init__
self.fill()
File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/dedup.py", line 782, in fill
for observed_umi, freq in self.umis_counter.iteritems():
AttributeError: 'Counter' object has no attribute 'iteritems'
Seems like it should be something more along these lines?
# Python 2 and 3: option 3
from future.utils import iteritems
# or
from six import iteritems
for (key, value) in iteritems(heights):
...
turns out future
, as used by dedup.py
is not installed by default and should be listed as a dependency.
Currently UMI-tools dedup_umi.py only outputs the first pair for paired end BAMs although it uses the template to dedup paired end input.
Proposed solution: Use pysam.AlignmentFile.mate() to return the mate pair.
Hi, I was trying out the dedup_umi script (using adjacency as the method) for a set of bam alignment files. The following message was displayed and I am not sure if this has something to do with the input or something else.
Traceback (most recent call last):
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 1046, in
sys.exit(main(sys.argv))
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 858, in main
read_gn = random_read_generator(infile.filename, chrom=options.chrom)
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 665, in init
self.fill()
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 679, in fill
self.observed_umis, freq = zip(*self.umis.iteritems())
ValueError: need more than 0 values to unpack
Thanks,
neha
I have design a long umi barcode 20xN. Due to primer synthesis error and sequencing error, the barcode may be 19bp or 21bp in length. Dose UMI_tools take these situation into account?
The inclusion of calls to the random_read_generator function increases memory usage (around 0.8G for a 2.7G infile). Around a third of this is the self.umi list object which stores all the umis observed. Currently, the list is grown as required for random sampling to save some memory initially.
Proposed solution: Read through whole bam once and stores umis detected in a dictionary. Convert keys and values to numpy arrays and use numpy.random.choice to randomly select umi using values array as weights.
We need some tests. Automated travis testing would be brilliant. I'll have a word with Sebastian about this.
Hello,
My UMIs were added to read names using bcl2fastq and aligned with STAR. Therefore, each read has the UMI added to the end of the read name by ":NNNNNNNNNN" instead of "_NNNNNNNNNN". Is there a way to use umi-tools with this configuration?
Thank you!
Best,
Matt
I get the following syntax error when testing out the script. Is this some python version issue or something else (using version 2.6.6)
python dedup_umi.py --help
File "dedup_umi.py", line 296
for umi in umis}
^
SyntaxError: invalid syntax
Thanks,
Neha
In a test run with real data, I've got things like:
M01660:68:000000000-AMT35:1:1107:14665:23414_TTTTATGTGGTA chr1 11291633 TTTTATGTGGTA 3 TTTTATGTGGTA 4 231
M01660:68:000000000-AMT35:1:1111:17174:11107_TTTTATGTGGTA chr1 11291633 TTTTATGTGGTA 3 TTTTATGTGGTA 4 231
M01660:68:000000000-AMT35:1:1119:12490:6844_TTTTATGTGGTA chr1 11291633 TTTTATGTGGTA 3 TTTTATGTGGTA 4 231
M01660:68:000000000-AMT35:1:2109:18893:17840_TTTTATTTGGTA chr1 11291633 TTTTATTTGGTA 3 TTTTATGTGGTA 4 231
If I don't interpret the data incorrectly (I'm new to this tool), the last line should say that there is 1 UMI that has been joined to the group 231 of 4 equal UMIs in total. But instead of a "1", I see a "3". That's incorrect, right?
Currently, the cluster_and_reduce function call functions to create an adjacency list, identify connected components, reduce clusters down to a parent UMI, and return reads to be written out. Actually, there are now two cluster_and_reduce functions as a seperate function was written to allow stats to be collecting during the process.
Ultimately, it would be useful if a choice of umi deduping methods were available so we can compare previously published methods (e.g drop UMIs below 1/10th average counts at position) with our own method. This is difficult in the current framework.
Proposed solution: replace the cluster_and_reduce function with a ClusterAndReduce functor. Upon initialization of the functor, the methods for generating the adjacency list, identifying components and reducing clusters will be defined depending on the method supplied. Calls to the ClusterAndReduce functor will generate calls to its methods and the reads to be written out will be returned
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.