cgatoxford / umi-tools Goto Github PK

View Code? Open in Web Editor NEW

480.0 480.0 190.0 28.2 MB

Tools for handling Unique Molecular Identifiers in NGS data sets

License: MIT License

Python 99.87% Cython 0.13%

umi-tools's People

Contributors

Stargazers

Watchers

Forkers

yfu jdblischak microtsiu rmtsoa messersc oguya peterch405 arthuryxt imb-computational-genomics-lab xuesoso bresf zzygyx9119 y9c jfnavarro jz314 gshiba madsheilskov mandylr gpratt genomicsnx dbichko hoohm cwt1 dyohanne sunbymoon fishinwind k3yavi popucui lybird300 iansudbery vd4mmind cibiv agillen mpaperlee idoiag kohlkopf blankclemens justincg327 simexin haoziyeung vincent-van-hoef mparker2 nbartonicek mlebeur tvjagan heiloworld rebeccagj elenimaniati bioqxu biounix abodak-gh hulf025 cramirezs leonfodoulian bdoughty sridhar0605 picsaver martenj haroon123 dfajar2 mxdistefano katyaputintseva mattjones315 zhaoning2016 bdemaree jeremy9959 soh-i chrishendra93 iyhaoo johanneskoester wychytu kmcgovernsgmo sthahub wxb263stu sinhaeti davidhbrann dburkhardt reineckef zorrodong carlosalfonsog polojacky xchromosome219 zengfengbo jun-lizst datirium maxhills huyvuong enisberk barkasn bianshuhui lehnt alfredyu2017 dauss75 vallurumk yumikoga robinmeyers vishnubob uttapone sherryh1229 bowhan

umi-tools's Issues

UMI directional adjacency count comparison

Unless I'm misunderstanding what is going on in _get_adj_list_directional_adjacency (https://github.com/CGATOxford/UMI-tools/blob/master/umi_tools/dedup.py#L308), it appears that the comment and the code do not match. The comments claim that the UMI counts are compared with a greater than comparison, but the code compares them with a greater than or equal to symbol. I noticed this because at locations where each UMI is unique some reads were still getting removed. (For example 1 >= (12)-1 is true, but 1 > (12)-1 is false). This was unexpected based on my understanding of the blog post about the tool (https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/) and the command help output.

    def _get_adj_list_directional_adjacency(self, umis, counts, threshold):
        ''' identify all umis within the hamming distance threshold
        and where the counts of the first umi is > (2 * second umi counts)-1'''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi.encode('utf-8'),
                                    umi2.encode('utf-8')) == 1 and
                      counts[umi] >= (counts[umi2]*2)-1] for umi in umis}

Is the TS-RefactorTools branch not ready yet?

I was opting to get the version that doesn't depend on CGAT, but simply running umi_tools dedup --help produces this error:

Traceback (most recent call last): File "/usr/local/bin/umi_tools", line 9, in <module> load_entry_point('umi-tools==0.0.1', 'console_scripts', 'umi_tools')() File "/usr/local/lib/python2.7/dist-packages/umi_tools-0.0.1-py2.7.egg/umi_tools/umi_tools.py", line 39, in main module = imp.load_module(command, file, pathname, description) File "/usr/local/lib/python2.7/dist-packages/umi_tools-0.0.1-py2.7.egg/umi_tools/dedup.py", line 182, in <module> from umi_tools._dedup_umi import edit_distance ImportError: No module named _dedup_umi

index error

I get an error when I follow the usage in the documentation for dedup.py.

cat infile.bam | python dedup_umi.py > deduped.bam 2> deduped.log

ValueError: fetch called on bamfile without index

Presumably, this is because pysam uses the filename to obtain the index? If so, is it possible to read from the stdin and pipe to dedup_umi.py?

I've removed this usage suggestion from the documentation for the time being.

dependency on psutil

Lastest changes make the code dependent on psutil, which is not part of the standard library. At a minimum, this should be noted in the README, but since it is only used to output the memory used, perhaps we should reconsider if it is necessary.

dedup works on all bam?

What is the requirement for a bam file for umi_tools to take in? I am encountering some error on some generic bam. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeUwRepliSeq/wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam

dedup -I wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam -S wgEncodeUwRepliSeqBg02esG1bAlnRep1.dedup.bam

File "/usr/local/bin/umi_tools", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/umi_tools/umi_tools.py", line 42, in main
module.main(sys.argv)
File "/usr/local/lib/python2.7/dist-packages/umi_tools/dedup.py", line 924, in main
options.stats, options.further_stats)
File "/usr/local/lib/python2.7/dist-packages/umi_tools/dedup.py", line 443, in call
min(len_umis), max(len(umis))))
TypeError: 'int' object is not iterable

added UMI-tools to bioconda

I've taken the liberty of adding the latest release of your tool to the bioconda repository.
I'm also working on a wrapper to be able to use the tool inside Galaxy

Hope you don't mind!

Cheers
M

Dedup by per-gene

Currently, dedup can work on a per-position or per-contig basis.The per-contig method is implemented for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. However, this requires that the reads be aligned to a single contig, i.e a representative gene model. Ideally we would like to align to all transcripts and give dedup a file mapping contigs (transcripts) to meta-contigs (genes) and introduce a "per-meta-contig" method.

This is straightforward to implement within UMI-Tools dedup but requires the bam to be sorted by a meta-contig feature which I don't think is possible with standard command line tools. We could of course make a new command umi_tools sort to perform the sort using pysam?

@IanSudbery - thoughts?
Biostars question on custom sam sort tool here

extract_umi.py only works with UMIs on read 1

Some UMI-methods add UMIs to both paired ends. We need to enable stripping of both these UMIs from the paired reads and addition of a concatenated UMI to both paired reads.

Use within python script

Not really an issue, more of a question. I am trying to run umi_tools with a python script and instead of using subprocess I just imported it:

from umi_tools import extract
extract.main(['-I', '/path/to/input.fastq', 
              '-S', path/to/output.fastq', 
              '-p', 'NNNNNNXXXXXXX', 
              '-L', '/path/to/log', 
              'read2_in', '/path/to/input_2.fastq', 
              'read2_out', 'path/to/output_2.fastq'])

But it only outputs an empty file. Any suggestions on what I am doing wrong? Thank you.

--multimapping-detection-method

Hi,
I have a short question related to the --multimapping-detection-method

the help specifies that "Setting this option to NH, X0 or XT will use these tags when selecting the best read amongst reads with the same position and umi"
Does it means that if several reads map at the same position/same strand/same umi, a read with a tag stating it is a secondary alignment will be less preferably selected?

I am specifically interested in the expression of Transposable Elements .
After this dedup step I will input the resulting bam file in TEtranscript for a subsequent analysis. This algorithm uses the information of the distribution of the multimapped reads to estimate the most likely alignment of each repeat, so I must make sure that the dedup from umi-tools is not going to bias my data by potentially arbitrarily eliminating reads because there are secondary alignment.

Thank you,
Camille

UMI removal

Hi,
Sorry to open another issue. The fact that the UMI is removed from the sequence is a problem for downstream analyses. For example for building a bam file where the reads are filtered for UMI in which bases have a low quality. Or simply to keep a column with the UMI in the bam file, which can be more practical than having the UMI in the reads name for certain downstream analyses.

Is there no way to avoid the read to be truncated?

Thank you,
Camille

Number of reads supporting each UMI

Hi,

Is a way to output the number of reads that support each UMI before and after deduplication? It is useful for check the quality of the data.

Thanks!

problems using dedup_umi.py (method = adjacency)

Hi,
I got a couple of problems with I was using dedup_umi.py to handle a set of single-end scRNAseq reads ( I used the method "adjacency"), I would be very grateful if you would help me with them:

it is reported "Number of reads out: 34903" in the stdout log, but in my output bam file, there are actually 55847 reads, do you have any idea why they don't match?
it seems that there are still duplicate reads in the output, i.e:
HISEQ1:340:C7K5CACXX:2:2209:17010:36435_GGAATT 16 1 629266 255 3S41M * 0 0 AGACTAGTTCGGACCTCCCTGTTCTTATGAATTCGAACAGCATA FFHHHHHJJIIHFJIIJJJIJJJJJJJJJJJJJJJJIJJJJIHH NH:i:1 HI:i:1 AS:i:36 nM:i:2

HISEQ1:340:C7K5CACXX:2:1109:1824:92364_GGAATT 16 1 629266 255 3S41M1S * 0 0 AGACTAGTTCGGACCTCCCTGTTCTTATGAATTCGAACAGCATAG JJJJJJJJJJJJHJJJJJJJJJIJJIJJJJJJJJJJJJJJJJHHF NH:i:1 HI:i:1 AS:i:36 nM:i:2

these two reads with the same UMI and same mapping position are both retained in the output...

Could you please help me to figure out these two issues? Thank you very much in advance.

Kaiyang

extract.py : problem with Xs bases reattachment to the read

Hello,

I'm using the latest version of UMI-tools (Master branch) and I noted an unexpected output with the extract.py script.

Here my command line :
python extract.py --bc-pattern=NNNTTGTNN -L extract.log -I input.fastq -S output.fastq

Here the original read (from the input) :
@NS500801:5:H7YMFBGXX:1:11101:12845:1158 1:N:0:0
CAATTGTCGGCCGCCGGTGAAATACCNNNNNNNNNNNNNGGAAGAGCGGTTCAGCAGGAATGGCGAGACCGATCTCGTAT
+
AAAAAF<F<FFFFFF..FFFFFFFFF#############FF<AFFFFFFFFFFF77.F<F7FAFAFF<FAFAF<AAAA.<

Here the output read :
@NS500801:5:H7YMFBGXX:1:11101:12845:1158_CAACG 1:N:0:0
GCCGCCGGTGAAATACCNNNNNNNNNNNNNGGAAGAGCGGTTCAGCAGGAATGGCGAGACCGATCTCGTAT
+
FFFFFF..FFFFFFFFF#############FF<AFFFFFFFFFFF77.F<F7FAFAFF<FAFAF<AAAA.<

We can see that the random bases were extracted from the read and add to the ID, but the remaining bases (normally TTGT) were not added to the beginning of the read (i.e my pattern were just trimmed from this original read).

Can you explain me why this happened and how I can fix this, please ?

Best regards,
Alexandra

Cell Barcode

Hi,

I miss an important point, sorry for that ...
Does one BAM file have all the reads for only one Cell Barcode?
Otherwise, where do you put the cell barcode information in the BAM file?

Thanks a lot,
Iñaki

Use a directional adjacency list to identify clusters

Currently UMI-tools dedup_umi.py uses an adjacency list with a distance threshold to define connected components. By including a further threshold for relative counts in two UMIs, we can create a directional adjacency list. Each group of connected UMIs will then be considered to have originated from a single parent which has the highest counts.

Proposed solution - Use a threshold of A > 2B - 1 to define the adjacency list. Re-write the required functions to use the directional adjacency list (connected_components, find_best, etc)

Missing NH tags

dedup_umi is not very happy if the NH tag is missing from a bam entry:

Traceback (most recent call last):
  File "/ifs/projects/proj028/src/dedup_umi.py", line 413, in <module>
    sys.exit(main(sys.argv))
  File "/ifs/projects/proj028/src/dedup_umi.py", line 390, in main
    soft_clip_threshold=options.soft):
  File "/ifs/projects/proj028/src/dedup_umi.py", line 288, in get_bundles
    if reads_dict[pos][key][umi]["read"].opt("NH") < read.opt("NH"):
  File "pysam/calignmentfile.pyx", line 3571, in pysam.calignmentfile.AlignedSegment.opt (pysam/calignmentfile.c:38787)
  File "pysam/calignmentfile.pyx", line 3286, in pysam.calignmentfile.AlignedSegment.get_tag (pysam/calignmentfile.c:33926)
KeyError: "tag 'NH' not present"

This is an optional BAM tag, and is not added by either bowtie or BWA. On the other hand, selecting which of a selection of reads maps to the fewest unique locations is part of the read selection process.

There are three options here.

wrap the NH statements in try statements and ignore this criteria if not present.
Have specific hacks for the way BWA and bowtie specifiy multiple hits
Require users to edit their BAMs not make sure NH is present.

Views?

Selecting a random read as representative for a bundle

I believe the following code snipped from dedup_umi.py shows how a random representative read is selected for a bundle when there is a tie for mapping quality.

read_counts[pos][key][umi] += 1
prob = 1.0/read_counts[pos][key][umi]
if random.random() < prob:
    reads_dict[pos][key][umi]["read"] = read

Yet, if I understand correctly, the read_counts[pos][key][umi] counter is not reset afterward and continues to accumulate counts in case other copies of the same molecule are encountered.

Wouldn't that lead to lower probabilities of replacing the representative read with the current one as the analysis progresses? Am I misunderstanding the logic?

--3prime option

Hi,

I am trying to extract my UMIs from my reads to add it to the reads names. The UMIs are on the 3prime end so I used the --3prime option.

The help does not specify anything else to do but add this option in the command line when we want to use it. However the output seems to suggest that some arguments should be added to this option. Which ones are they?

Here is my command:
srun umi_tools extract -I /path/to/file/of/reads1 -S /reads1.UMIextracted.fq --bc-pattern XXXXXXNNNNNNNNNN --read2-in /path to file of reads2 --3prime --read2-out /reads2.UMIextracted.fq -L extractUMI.log

Here is the error message I get:
TypeError: _joiner_3prime() takes exactly 4 arguments (3 given)

Thank you,
Camille

Correction / merge of sequencing errors

Posted in error. Sorry.

ClusterAndReducer instance has no attribute 'get_adj_list'

File "UMI-tools-0.2.5/umi_tools/dedup.py", line 569, in call
adj_list = self.get_adj_list(umis, counts, threshold)
AttributeError: ClusterAndReducer instance has no attribute 'get_adj_list'

deciding when to output

When using positions, the alogrithm uses the following code to decide whether to output or not:

           if whole_contig:
                do_output = not read.tid == last_chr
            else:
                do_output = start > (last_pos+1000) and not read.tid == last_chr

note that in the second case the condition is that we are 1kb on AND on a different chr. Surely this should be OR?

This was leading to reads being output in a very strange order. Could possibly lead to reads from difference chrs being combined in the same bundle, and is possibly to blame for some of the high memory usage (will almost certainly read whole file into memory this way).

Again, seems like a simple fix, but I feel like we've been here before, and don't want to make the change if I'm missing something.

Conda install fails - missing dependencies

Hi there

Nice work on this tool - unfortunately installation on Mac OS X Sierra failed for me due to missing dependencies that were unstated:

$ conda install -c https://conda.anaconda.org/toms umi_tools
Fetching package metadata .........
Solving package specifications: .


PackageNotFoundError: Package not found: '' Dependencies missing in current osx-64 channels:
  - umi_tools -> bcftools ==1.3
  - umi_tools -> htslib >=1.3
  - umi_tools -> pysam ==0.8.4
  - umi_tools -> samtools >=1.3

Close matches found; did you mean one of these?

    htslib: tblib, html5lib
    samtools: apptools, umi_tools

You can search for packages on anaconda.org with

    anaconda search -t conda samtools

(and similarly for the other packages)

Perhaps you could list these dependencies in the installation instructions and link to installation instructions for these tools?

Best
Davis

Adjacency method is not available for group command

Currently the adjacency method is not available for the group command as I was unable to assign each read to a single group correctly. I'll see to this...

See #54 for the motivation for the group command

umi_tools extract run error

I have paired end data (2 fastq files, 77 mil reads) and I have 8 nt UMI bases on the 5' end of the first fq file.

I have a shell script file with this content:

newfilename="${1/_*/}"

cat "$1" | umi_tools extract \
--bc-pattern="NNNNNNNN" \
--read2-in="$2" \
--read2-out="$newfilename-um_2.fq" \
-L "umitools-extract.log" > "$newfilename-um_1.fq"

where $1 is the first fq file (fwd) and $2 is the second fq file (rev). And run as script.sh file_1.fq file_2.fq. I get the below error after the script runs for a couple of minutes.

sbatch-umitools-extract.sh: line 42: 32582 Broken pipe             cat "$1"
     32583 Killed                  | umi_tools extract --bc-pattern="NNNNNNNN" --read2-in="$2" --read2-out="$newfilename-um_2.fq" -L "umitools-extract.log" > "$newfilename-um_1.fq"

Strangely, it works fine on a small test dataset (paired-end fastq) of 40 lines each. The output fq files show the UMIs have been correctly removed and added to the read name. But it doesn't work for my big dataset. I wonder if it is just a script syntax error or something else.

Deducing read position

There is a section of the code that tries to deduce the position of the read, taking into account any soft_clipping. It looks like this:

           if read.is_reverse:
                pos = read.aend
                if read.cigar[-1][0] == 4:
                    pos = pos + read.cigar[-1][1]
                start = read.pos

                if ('N' in read.cigarstring or
                    (read.cigar[0][0] == 4 and
                     read.cigar[0][1] > soft_clip_threshold)):
                    is_spliced = True
            else:
                pos = read.pos
                if read.cigar[0][0] == 4:
                    pos = pos - read.cigar[0][1]
                start = pos

                if ('N' in read.cigarstring or
                    (read.cigar[-1][0] == 4 and
                     read.cigar[-1][1] > soft_clip_threshold)):
                    is_spliced = True

Note that if the read is reversed, then it takes aend, finds how many bases are add for softclipping.... and then forgets all that and uses read.pos.

Where if the read is not reversed then it does use the clipping.

Any ideas why this might be? It goes right back to the initial commit. I'm sure there is a reason for it, but I can't for the life of me figure out what it might be.

paired-end reads different UMIs

Hi all,

I have a set of UMI-tagged paired-end target sequencing data, and the UMIs attached to the read1 and read2 are different. I am wondering how UMI-tools handle this kind of data during deduplication.

I was try to use bwamem for alignment, but got the error " paired reads have different names" as the UMIs of read1 and read2 are different and they were attached to read names. Do you have suggestions to handle this?

Here are an example of the reads after UMI attached to names:
Read1:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGGACC 1:N:0:ATCACGAC+NAGGTTCA
CTGCCCCTGTTCCCTCCAACTCATCTTTTCATTTTTGTTGTAACTTGACTTTTCTGCTTTCTTTGTAACCTGCCCTCCCCACAGCGAATTGCGACGATCGTTGCATTAACTCGCGAATCACGAC
+
EEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEEEEEEEEEEEEEEE6EEEEEE/EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEAA

Read2:
@NS500683:84:HGMGMAFXX:1:11101:19517:1060_TGTGGG 2:N:0:ATCACGAC+NAGGTTCA
GAGGGCAGGTTACAAAGAAAGCAGAAAAGTCAAGTTACAACAAAAATGAAAAGATGAGTTGGAGGGAACAGGGGCAGGGTCCA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEE

Thank you very much for your help in advance!

Kaiyang

Add option to consider read length when identifying duplicates

For sRNA-Seq, it's possible for two reads to have the same start position but different read lengths and therefore to have originated from different molecules (assuming no quality trimming has been performed). dedup_umi.py should have an option to consider read length along with the start position.

I propose adding the read length to the reads_dict key (defaulting to no value when read length not required).

Any thoughts before I go ahead?

Failed to install with pip

Hi,

I was trying to install umi-tools with "pip install umi_tools", but it failed with the error "IOError: [Errno 2] No such file or directory: 'requirements.txt'", do you have an idea how to fix it?

Thanks!

--Kaiyang

dedup dir-adjacency too slow/does not complete

I started with a 12gb BAM (77 mil reads) using 3 cores and 24GB RAM. It did not complete in 2.5 hours or even close. Output BAM after 2.5 hours was 2.3MB. Then I extracted one chromosome to create a 314MB BAM. Dedup was run on this small BAM for 1 hour and it was still incomplete. A truncated BAM file of 12.4MB was exported. I ran qualimap bamqc on both files (1 chromosome bam before dedup and after dedup).

file	before dedup	after dedup
reads	5,104,918	197,280
dup rate	38.74%	2.27%

So, I suppose the deduplication must be working.

Extrapolating from this run time, my 12gb bam would take lot more than 39 hours. Would this be an expected run time? I was wondering if I was doing something wrong or some setting/switches are not optimal.

This is my script:

umi_tools dedup \
--method="directional-adjacency" \
--paired \
--output-stats="file-dedup.stats" \
-v 3 \
-E "file-dedup.error" \
-L "file-dedup.log" \
-I "file.bam" \
-S "file-dedup.bam"

Make dedup parallel

Make dedup use more than one CPU.

First attempt at an implementation see: branch IS_parallel.

Need large data file to benchmark.

cc: @TomSmithCGAT

Output order is not consistent between py2 and py3

The output from the group and dedup commands is inconsistent between py2 and py3. I thought this was due to the python dictionary hash in py3 - quoting from the documentation '[hash] values of str, bytes and datetime are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python . Hash randomization is intended to provide protection against a denial-of-service'. However, setting the environmental variable PYTHONHASHSEED=0 for both py2 and py3 so they both use the same hash for the dictionaries doesn't ensure consistency between them.

I'm not sure what the difference is between py2 and py3 which is causing this?

See the Py3.5-testing branch for code which is fulling Py3.5 compatible.

Impact of scaffolds/ERCCs AND/OR pre-defined UMIs on speed/accuracy

Hi Ian and Tom,
I very much like the theory behind this tool and am trying to use it in my research. I realize I may be conflating two specific issues..if so I am sorry, but here goes anyway!

We have performed paired-end strand-specific RNA-seq with a kit utilizing UMIs from BIOO scientific. As they state:

"The NEXTflex Rapid Directional qRNA-Seq Kit contains a set of 96 distinct molecular labels on the sequencing adapters. Each label consists of an 8 nucleotide barcode tag. During the ligation reaction, each cDNA fragment independently and randomly ligates to a single label from this pool of 96 adapters to result in a total of 96 x 96 = 9,216 possible combinations across both ends."

I have included the cartoon of the cloning strategy from their website below.

Problem:

I have been using umi-tools and the 'dedup' step takes a long time ( > 30hrs for 42 million reads). I have followed suggestions from earlier posts and do not output stats.

Information:

Reads: ~ 30-40 million, paired-end 75 nt reads
UMI: 8 nt long on both pairs
Reference: human genome and some scaffolds and ERCC sequences
Alignment: star, limiting multi-mappers
version: umi-tools-0.2.3
umi method used: "directional-adjacency"
cpu: one node of compute cluster, 36G mem

Observations:

I created two bam files using a subset of reads that map to either A) chr19 (104 M bam file) or B) the 92 ERCC RNA spike-ins (34 M bam file).

For chr19, the dedup process took 74 seconds.
2,428,230 before dedup
1,383,172 after dedup
This is a non-trivial level of replication! Ideally I would like to test different umi dedup methods using ERCC counts and see what correlates the best with known ERCC concentrations.

For ERCC, the dedup process has taken > 1hr and i killed it!!
894,572 before dedup
??? after dedup

The reason I tested this is when I cranked up the verbosity in the log I noticed that the dedup process was really getting bogged down once it got to scaffolds (~160 in my assembly) and ERCCs (92 spike-ins).

Questions:

Would an increase in the number of "chromosomes"/contigs result in such a drop in performance? If so, what would you recommend? I would rather not remove the scaffolds. The ERCC spike-ins are important to help assess the "truth" by comparing how the different approaches (directional/adjacency/cluster/percentile/unique) correlate with the known concentrations, particularly relative to the original alignments containing duplicates.
Do you think it is useful (accuracy and/or speed) for your algorithm to know the identity of the UMI a priori? Particularly considering the following properties:

long-ish (8nts)
9216 possible combination (96 UMI one on each pair)

I am stuck at this step and need to perform the deduplication before moving forward with downstream analysis. Any help would be greatly appreciated. I can provide whatever files you might find useful.

Kind regards,
Neel

Improve non-conda setup

Currently Conda installation works well, but installation via pip or downloading the tarball works much less well.

The current problems are:

Although setuptools allows downloading and installing of dependencies, the current setup.py requires them to be installed before it has the chance to install them
Assumes that setuptools is installed

In order to fix:

Bootstrap setuptools to install or update setuptools before importing using ez_setup.py
Remove requirements for normal dependencies and rely on install_requires to install them correctly
Distribute pre-cythonised extensions so that Cython is not required just to run setup.py
Create detailed instructions on installation.
Test on Ubuntu, CentOS and Mac OSX

Extract pattern format

In the help content for umi_tools extract, the explanation for pattern is a bit confusing. If Ns are random bases and Xs are UMIs, then Xs should be extracted for name and Ns must be reattached as read? If so, in the example below, GG should be added to the read name.

   --bc-pattern
   Use this option to specify the format of the UMI/barcode. Use Ns to
   represent the random positions and Xs to indicate the bc positions.
   Bases with Ns will be extracted and added to the read name. Remaining
   bases, marked with an X will be reattached to the read.

   E.g. If the pattern is NNXXNN,
   Then the read:

   @HISEQ:87:00000000 read1
   AAGGTTGCTGATTGGATGGGCTAG
   DA1AEBFGGCG01DFH00B1FF0B
   +

   will become:
   @HISEQ:87:00000000_AATT read1
   GGGCTGATTGGATGGGCTAG
   1AFGGCG01DFH00B1FF0B
   +

dedup.py needs deduplicated BAM-file to be indexed?

Dear developers.
I'm testing your awesome toolkit, but I don't know what to do with the following problem:

Used parameters

python dedup.py \
-I xxx_sorted.bam \
-S xxx_dedup_chr19.bam \
--paired \
--edit-distance-threshold 1 \
--output-stats xxx_dedup_chr19 \
--chrom chr19 \
-v 5

Traceback

...
2016-10-24 13:54:43,213 DEBUG Outputted 780000
2016-10-24 13:55:03,757 DEBUG Dumping 857871 mates for contig chr19
2016-10-24 13:57:13,608 DEBUG 3997 mates remaining
2016-10-24 13:57:13,608 INFO Searching for mates for 3997 unmatched alignments
Traceback (most recent call last):
  File "./UMI-tools-0.2.3-edited/umi_tools/dedup.py", line 1145, in <module>
    sys.exit(main(sys.argv))
  File "./UMI-tools-0.2.3-edited/umi_tools/dedup.py", line 1050, in main
    outfile.close()
  File "./UMI-tools-0.2.3-edited/umi_tools/dedup.py", line 309, in close
    for read in self.outfile.fetch(start=pos, end=pos+1, tid=chrom):
  File "pysam/calignmentfile.pyx", line 874, in pysam.calignmentfile.AlignmentFile.fetch (pysam/calignmentfile.c:10986)
ValueError: fetch called on bamfile without index

Files in directory

xxx_sorted.bam
xxx_sorted.bam.bai
xxx_dedup_chr19.bam

Software versions
UMI-tools v0.2.3
pysam 0.8.4

Tried
Adding the until_eof=True flag to the self.outfile.fetch() function (to allow fetching if there's no index), but then I get an error that the BAM-file is truncated.

Read UMI from tag

In this tool, you are assuming that the UMI will be found in the ID, but there can be cases where the UMI is in other places. In our case, the UMI is located in a tag. Specifically in the RX tag (but others use the BC/BX).
Also, when assigning the UMI, the tags are competing (you use FU, Picard uses MI).

It would be nice to be able to specify the tags to be used for all of these operations.
And it would be nice to be able to read the UMI from the given tag, not only from the ReadID.

Seq error correction after dedup

I was wondering, how do you merge the duplicate read or do you simply discard them? I would be interested in looking at the full sequence and possibly deriving a consensus to correct for pcr/seq errors.

Output read groups rather than deduplicated BAM

@IanSudbery Just had a chat with Nils Koelling here at the WIMM. He's given some very helpful feedback on documentation which I'll feed into the master branch shortly.

He's using UMIs in a variant calling framework and therefore wants to group reads by their UMI and then perform a downstream step to identify the correct base call for each position in the set of reads to correctly resolve sequencing errors. We spoke about a couple of ways to achieve this:

Add a tag to the BAM with a unique group identifier (1 -> infinity)
Output a csv mapping read identifier to group

Theoretically, these could be achieved within the dedup command but this is getting away from a pure deduplication problem as such. I think the best approach might be to provide a new umi_tools command "group" which uses the same network methods to identify the set of reads with the same UMI or related UMIs and then outputs either a tagged BAM or csv. We'd need to abstract some of the dedup.py code to a module file and reuse it in group.py.

This might also be a good way to address this issue regarding counting reads supporting each UMI: #45

What do you think?

python3.5 - iteritems error

Traceback (most recent call last):
  File "/home/chovanec/anaconda3/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/umi_tools.py", line 42, in main
    module.main(sys.argv)
  File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/dedup.py", line 972, in main
    read_gn = random_read_generator(infile.filename, chrom=options.chrom)
  File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/dedup.py", line 763, in __init__
    self.fill()
  File "/home/chovanec/anaconda3/lib/python3.5/site-packages/umi_tools/dedup.py", line 782, in fill
    for observed_umi, freq in self.umis_counter.iteritems():
AttributeError: 'Counter' object has no attribute 'iteritems'

Seems like it should be something more along these lines?

# Python 2 and 3: option 3
from future.utils import iteritems
# or
from six import iteritems

for (key, value) in iteritems(heights):
    ...

http://python-future.org/compatible_idioms.html

future module should be listed as a dependency

turns out future, as used by dedup.py is not installed by default and should be listed as a dependency.

Add paired end functionality

Currently UMI-tools dedup_umi.py only outputs the first pair for paired end BAMs although it uses the template to dedup paired end input.

Proposed solution: Use pysam.AlignmentFile.mate() to return the mate pair.

ValueError: need more than 0 values to unpack

Hi, I was trying out the dedup_umi script (using adjacency as the method) for a set of bam alignment files. The following message was displayed and I am not sure if this has something to do with the input or something else.

Traceback (most recent call last):
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 1046, in
sys.exit(main(sys.argv))
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 858, in main
read_gn = random_read_generator(infile.filename, chrom=options.chrom)
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 665, in init
self.fill()
File "/data/home/nsarode3/git_repos/UMI-tools/dedup_umi.py", line 679, in fill
self.observed_umis, freq = zip(*self.umis.iteritems())
ValueError: need more than 0 values to unpack

Thanks,
neha

Dose umi tools support dealing insert and deletion in barcode?

I have design a long umi barcode 20xN. Due to primer synthesis error and sequencing error, the barcode may be 19bp or 21bp in length. Dose UMI_tools take these situation into account?

Random_read_generator - memory usage

The inclusion of calls to the random_read_generator function increases memory usage (around 0.8G for a 2.7G infile). Around a third of this is the self.umi list object which stores all the umis observed. Currently, the list is grown as required for random sampling to save some memory initially.

Proposed solution: Read through whole bam once and stores umis detected in a dictionary. Convert keys and values to numpy arrays and use numpy.random.choice to randomly select umi using values array as weights.

Add tests

We need some tests. Automated travis testing would be brilliant. I'll have a word with Sebastian about this.

umi added to readname with bcl2fastq, can I still use umi-tools dedup?

Hello,
My UMIs were added to read names using bcl2fastq and aligned with STAR. Therefore, each read has the UMI added to the end of the read name by ":NNNNNNNNNN" instead of "_NNNNNNNNNN". Is there a way to use umi-tools with this configuration?
Thank you!
Best,
Matt

Syntax error

I get the following syntax error when testing out the script. Is this some python version issue or something else (using version 2.6.6)
python dedup_umi.py --help
File "dedup_umi.py", line 296
for umi in umis}
^
SyntaxError: invalid syntax

Thanks,
Neha

Bad value in flatfile describing the read groups

In a test run with real data, I've got things like:

M01660:68:000000000-AMT35:1:1107:14665:23414_TTTTATGTGGTA	chr1	11291633	TTTTATGTGGTA	3	TTTTATGTGGTA	4	231
M01660:68:000000000-AMT35:1:1111:17174:11107_TTTTATGTGGTA	chr1	11291633	TTTTATGTGGTA	3	TTTTATGTGGTA	4	231
M01660:68:000000000-AMT35:1:1119:12490:6844_TTTTATGTGGTA	chr1	11291633	TTTTATGTGGTA	3	TTTTATGTGGTA	4	231
M01660:68:000000000-AMT35:1:2109:18893:17840_TTTTATTTGGTA	chr1	11291633	TTTTATTTGGTA	3	TTTTATGTGGTA	4	231

If I don't interpret the data incorrectly (I'm new to this tool), the last line should say that there is 1 UMI that has been joined to the group 231 of 4 equal UMIs in total. But instead of a "1", I see a "3". That's incorrect, right?

ClusterAndReducer functor

Currently, the cluster_and_reduce function call functions to create an adjacency list, identify connected components, reduce clusters down to a parent UMI, and return reads to be written out. Actually, there are now two cluster_and_reduce functions as a seperate function was written to allow stats to be collecting during the process.

Ultimately, it would be useful if a choice of umi deduping methods were available so we can compare previously published methods (e.g drop UMIs below 1/10th average counts at position) with our own method. This is difficult in the current framework.

Proposed solution: replace the cluster_and_reduce function with a ClusterAndReduce functor. Upon initialization of the functor, the methods for generating the adjacency list, identifying components and reducing clusters will be defined depending on the method supplied. Calls to the ClusterAndReduce functor will generate calls to its methods and the reads to be written out will be returned

cgatoxford / umi-tools Goto Github PK

umi-tools's People

Contributors

Stargazers

Watchers

Forkers

umi-tools's Issues

Problem:

Information:

Observations:

Questions:

Recommend Projects

Recommend Topics

Recommend Org