single-cell-genetics / cellsnp Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 10.0 229 KB

Pileup biallelic SNPs from single-cell and bulk RNA-seq data

License: Apache License 2.0

Python 94.27% Shell 5.73%

genetic-variants genotyping rna-seq single-cell

cellsnp's People

Stargazers

Watchers

Forkers

yxian9 tuqiang2014 yandgong307 anahit19 cohenp05 mengchengyao hxj5 lichenbiostat alaminzju jeffhsu3

cellsnp's Issues

generating VCF for REGION_VCF from RNA-seq data

Hi,

This is more naive question as I do not have experience in SNPs.

I have 2 cell-lines scRNAseq data, combined sequenced. Now would like to separate using cellSNP (& vireo).
I got bulk RNA-seq data for this 2 cell-lines. So, question is how to make VCF file for REGION_VCF option?

thanks,

Does it work for mouse？

Thanks for your great tool.
Now I want to use it for mouse, does it work? Any suggestions?

Running cellSNP on transcriptomic BAM from salmon alevin

Hi, thanks for your previous responses a while back.

I am now trying to use cellSNP/Vireo with a SAM/BAM output file from salmon alevin instead of Cell Ranger, and wanted to see if this would be feasible somehow.

The main issue seems to be that salmon alevin maps to the transcriptome and creates a transcriptomic BAM, while Cell Ranger maps to the genome and creates a genomic BAM. Since cellSNP expects a genomic BAM, it then gives me errors due to not finding chromosome references in the transcriptomic BAM.

I just checked with the salmon alevin authors on Slack this morning to find out more about this - I believe the chromosome reference is normally stored in the RNAME field in the genomic BAM, but in the transcriptomic BAM this field contains the transcript name. The genomic/transcriptomic coordinates are also different.

Do you know if there would be a way to run cellSNP directly on a transcriptomic BAM, or convert this internally somehow? Alternatively, the salmon alevin authors pointed me towards some tools to directly convert transcriptomic to genomic BAM, which could be a way around this - but I thought I would check here first.

The main reason this has come up is that we tend to have various issues with Cell Ranger (slow runtime, memory problems, multi-mapping reads), so we would prefer to use salmon alevin in our pipelines, and were wondering if this has already been considered. Thanks!

AD in cells

对于之前的提问，感谢您的快速回答，这里还有另外一个问题：我想知道的是，为什么会在单个细胞中鉴定ＳＮＰ会有等位基因频率一说呢，不应该是在一个群体中鉴定ＳＮＰ的时候才存在等位基因频率？　对于一个基因来说，我以为突变了之后，所有的ＲＮＡ转录出来的序列都应该是在该位置有突变。　还是说这个ＡＤ有其他的含义呢？

For the previous question, thank you for your quick answer. Here is another question: I would like to know why there is allele frequency in the identification of SNP in a single cell. It should not be the allele frequency when identifying SNP in a population? For a gene, I think that after mutation, all RNA transcripts should have mutations at this location. Or does this ＡＤ　have other meanings?

Default nan-handling policy is a memory hog

As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:."
(11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.

Run time estimation

Similar to #3 but wondering if things have changed.
Running cellSNP v0.1.7 as

  cellSNP --samFile ${CELLRANGERDIR}/"${SAMPLE}"/outs/possorted_genome_bam.bam \
          --outDir ${OUTDIR} \
          --regionsVCF genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz \
          --barcodeFile ${PROJECT_ROOT}/data/emptyDrops/"${SAMPLE}".barcodes.txt \
          --nproc 20 \
          --minMAF 0.1 \
          --minCOUNT 20

with 31,707 barcodes on a 25G BAM file has been going for > 18 days!
It's still writing output, too (as of 2020-02-03 5PM):

% ll -t data/cellSNP/cellSNP.cells.vcf.gz.temp_*
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_17_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_3_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_11_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_15_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 17:01 data/cellSNP/cellSNP.cells.vcf.gz.temp_16_
-rw-r----- 1 hickey grpu_mritchie_1 1.1G Feb  3 16:59 data/cellSNP/cellSNP.cells.vcf.gz.temp_19_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 16:56 data/cellSNP/cellSNP.cells.vcf.gz.temp_12_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:55 data/cellSNP/cellSNP.cells.vcf.gz.temp_8_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_6_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_9_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:51 data/cellSNP/cellSNP.cells.vcf.gz.temp_10_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 16:50 data/cellSNP/cellSNP.cells.vcf.gz.temp_1_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb  3 16:46 data/cellSNP/cellSNP.cells.vcf.gz.temp_2_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:38 data/cellSNP/cellSNP.cells.vcf.gz.temp_14_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb  3 16:37 data/cellSNP/cellSNP.cells.vcf.gz.temp_13_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb  3 16:32 data/cellSNP/cellSNP.cells.vcf.gz.temp_4_
-rw-r----- 1 hickey grpu_mritchie_1 2.5G Feb  3 16:19 data/cellSNP/cellSNP.cells.vcf.gz.temp_7_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb  3 15:04 data/cellSNP/cellSNP.cells.vcf.gz.temp_18_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb  3 13:44 data/cellSNP/cellSNP.cells.vcf.gz.temp_0_
-rw-r----- 1 hickey grpu_mritchie_1 1.8G Feb  2 23:52 data/cellSNP/cellSNP.cells.vcf.gz.temp_5_

I've run cellSNP before and although it took a few days it certainly didn't take this long.
I'm wondering:

What particular parts of this (e.g., size of BAM, number of barcodes, number of loci in --regionVCF, ...) might be causing this huge runtime?
What might I do to speed cellSNP up for subsequent datasets (I'm anticipating several datasets, many larger than this, over the course of the year)?
How can I estimate how much longer this particular process has to run?

Thanks,
Pete

Empty AD&DP mtx

Hi,
Thanks for this nice tool!

I am trying to use this tool to pile up SNPs on chrX here is my code:
cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --chrom chrX

Both AD.matrix and DP.mtx are empty. I suspect that it maybe because my bam file does not have -CB as sam tag but rather in the read name. Please see attached my bam file.
Could you please suggest a workaround or if there was some other underlying problem? Thanks!

bam (1).txt

Adapting cellSNP for gene-of-interest / indels / multiple scRNA-seq samples

Hi Yuanhua,

I have three suggestions/questions about cellSNP and cellsnp-lite:

Would you consider using the "start" and "stop" parameters in pysam.pileup to allow users to achieve faster SNP detection within a region/gene of interest?
One of my gene of interest is known to be frequently disrupted by frameshift mutations / indels, but cellSNP only support single-nucleotide variant detection. Could the pipeline be modified to consider indels?
Is it possible to input two or multiple scRNA-seq samples and identify sites that are different across samples?

Thank you!

Best regards,
Yiyun

Problem with New Version (0.1.3)

I updated cellSNP yesterday and found that data that had previously worked with the tool no longer did. I also tested the data used in your test.sh file and got the same traceback. Here is an example of the output:

[cellSNP] mode 2: pileup 22 whole chromosomes in 4246 single cells.
Traceback (most recent call last):
File "/home/rstudio/.local/bin/cellSNP", line 11, in
sys.exit(main())
File "/home/rstudio/.local/lib/python2.7/site-packages/cellSNP/cellSNP.py", line 198, in main
result = [res.get() if nproc > 1 else res for res in result]
File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
RuntimeWarning: divide by zero encountered in log

I ultimately uninstalled cellSNP and reinstalled version 0.0.8 and found that it works just fine again. I'm still not sure what caused the issue in the newer version but wanted to alert you to it.

Run time estimation

How long would it approximately take to run cellSNP (Mode 1 with region_vcf) on HiSeq data with ~200M reads and 4000 cells?
Do you have some examples of cellSNP runtimes for some number of cells, number of reads, and number of subprocesses (-p option)?

Some SNPs are not in gene

I used cellSNP to identify SNP of scRNAseq. But I found a problem many SNP sites are not in gene. In my opinion, All Reads of scRNAseq should come from gene, So all SNP sites should locate in gene.

Wrong number of chromosomes in output file?

Hello, I used a VCF file made with the mouse MM10 genome and a 10x BAM file aligned to the mouse MM10 genome. This means my output should have Chromosomes 1-19, M, X, and Y like my input files. But my output from CellSNP contains chromosomes 1-22, X, and Y. Any tips on what I can do to troubleshoot this?

output sparse matrix

hi, what do AD, DP and OTH stand for (variant x cell sparse matrix output #6 )
thanks!

No SNPs called in more than 1 barcodes

Hello,

I am running cellSNP in mode 2 for my bam file with the following command:

cellSNP -s <my coordinate sorted bam file> \
-O <output directory> \
-b <barcode file - one barcode on each line> \
-p 10 \
--cellTAG=None \
--UMItag=UB \
--minMAF 0.1 \
--minCOUNT 100 \
--chrom=chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY

The barcode file I am providing is a test file with only 5 barcodes. But, however the output cellSNP.cells.vcf has SNPs piled up from all the chromosomes only for the 1st barcode in the file.

What do you think could be the reason for this ? Please see below a snapshot of my bam file:

Any help would be appreciated.

Thanks,
Somesh

Is it expected to generate a "position x barcode" matrix?

Hi, I wonder if this is the expected result? A VCF with positions as rows and droplet barcode as columns?
Just to confirm if I am using it correctly.

Thanks!

Change the flag filtering default to include PCR duplicates

From my cursory understanding of cellSNP you filter all of the reads that are marked as PCR duplicates by Cell Ranger. However, this would remove a large number of UMI duplicates as noted by the vartrix documentation:

ignore alignments marked as duplicates? Take care when turning this on with scRNA-seq data, as duplicates are marked in that pipeline for every extra read sharing the same UMI/CB pair, which will result in most variant data being lost.

I wouldn't be surprised if this negatively affects the performance of vireo on some datasets.

Chromosome naming in reference VCF files

Hi all,

starting with Cell Ranger reference 2020-A, chromosome naming now follows Gencode convention (chr1-chr23, chrM) instead of ensembl style (1-23, MT). It's probably worth it to update the reference VCF files provided since it'll definitely cause confusion among users.

Cheers

-- Alex

Optimisation suggestions

I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in

https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68

There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change

if chrom != None:
    if chrom not in samFile.references:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in samFile.references:
        print("Can't find references %s in samFile" %chrom)
        return None, None

if chrom != None:
    sam_refs = samFile.references
    if chrom not in sam_refs:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in sam_refs:
        print("Can't find references %s in samFile" %chrom)
        return None, None

Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical

def check_pysam_chrom(samFile, chrom=None):
    """Chech if samFile is a file name or pysam object, and if chrom format. 
    """
    global CHROM_CACHE
    global CHROM_CACHE_ID
    global CHROM_CHACH_SAMFILE

    CHROM_CACHE_SAMFILE = samFile

    if CHROM_CACHE is not None:
        if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
            return CHROM_CACHE, CHROM_CACHE_ID

    if type(samFile) == str:
        ftype = samFile.split(".")[-1]
        if ftype != "bam" and ftype != "sam" and ftype != "cram" :
            print("Error: file type need suffix of bam, sam or cram.")
            sys.exit(1)
        if ftype == "cram":
            samFile = pysam.AlignmentFile(samFile, "rc")
        elif ftype == "bam":
            samFile = pysam.AlignmentFile(samFile, "rb")
        else:
            samFile = pysam.AlignmentFile(samFile, "r")

    if chrom != None:
        if chrom not in samFile.references:
            if chrom.startswith("chr"):
                chrom = chrom.split("chr")[1]
            else:
                chrom = "chr" + chrom
        if chrom not in samFile.references:
            print("Can't find references %s in samFile" %chrom)
            return None, None
    
    CHROM_CACHE = samFile
    CHROM_CACHE_ID = chrom
    return samFile, chrom

This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.

The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using

samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)

however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.

Reference Genome

Hello,

Does cellSNP take in a reference genome?

Best,

Tae

Question about running cellSNP on merged BAM files from multiple samples

Hi, I have been running into some difficulties with running cellSNP/Vireo, and thought here would be a good place to ask. This is a follow-up to the questions Stephanie Hicks asked via Twitter a few weeks ago.

As discussed on Twitter, we are considering a multiplexed design for some future 10x scRNA-seq samples, and in anticipation of this, we are trying to do some evaluations on our existing (non-multiplexed) samples, to see how well Vireo would have performed at demultiplexing these if they were multiplexed.

The way we are trying to do this is: (i) combine BAM files from 3 separate scRNA-seq samples (i.e. simply merge them using samtools merge), (ii) run cellSNP to genotype cells, using either mode 1 (using known variants provided from the 1000 Genomes Project), or mode 3 (using our separate matched bulk RNA-seq samples), and then (iii) run Vireo.

However I have been getting stuck when trying to run cellSNP. I think it is either due to the way I have merged the BAM files, or due to the list of cell barcodes I am providing to cellSNP.

Do you have any advice about how you would go about doing this? E.g. maybe simply merging the BAM files is too simple (I have had a look in https://github.com/single-cell-genetics/vireo/tree/master/simulate, and see that you also used samtools merge there, but there is also quite a lot else going on in that script). For the list of cell barcodes, I have tried providing both the entire Cell Ranger barcode whitelist, or a parsed list of barcodes from the Cell Ranger output, but both are giving me errors -- I can't quite figure out from the documentation what I need to provide here; sorry if I missed something. (cellSNP also gives me a message saying it is actually running in mode 2, so maybe I am also providing the inputs incorrectly).

Any advice you could provide would be greatly appreciated. Or if you have a full worked example showing how to run a cellSNP analysis, this would also be very helpful -- I had a look but could only find Vireo examples beginning from the cellSNP output; sorry again if I missed something.

KeyError: '.' in cellSNP/pileup_utils.py causing failure of temp file merging

Hello, I'm running into an error when calling cellSNP in mode 1 on mouse .bam data, specifically on chromosome 11, with the following command:

~/.local/bin/cellSNP -s file.bam -b barcodes.tsv -O cellSNP_output_11 -R mgp.v5.snps_subset_11.vcf.gz -p 36 --minMAF 0.1 --minCOUNT 20

This causes the following error after the pipeline has finished processing all positions:

100.00% positions processed.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/u/local/apps/anaconda3/2020.11/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/utils/pileup_utils.py", line 294, in fetch_positions
    vcf_line = get_vcf_line(base_merge, base_cells, qual_cells,
  File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/utils/pileup_utils.py", line 366, in get_vcf_line
    ALT_cnt = base_merge[ALT]
KeyError: '.'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/u/home/m/margolis/.local/bin/cellSNP", line 8, in <module>
    sys.exit(main())
  File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/cellSNP.py", line 268, in main
    result = [res.get() for res in result]
  File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/cellSNP.py", line 268, in <listcomp>
    result = [res.get() for res in result]
  File "/u/local/apps/anaconda3/2020.11/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: '.'

Here is the segment of code in the pileup_util.py file throwing an error:

I'm assuming that this means an alternate allele wasn't found in one of the temp files, causing an issue with combining the temp files into the final output file. Any suggestions on how to prevent this in the future? I have found that changing minMAF and minCOUNT (making them more stringent) seems to get rid of the issue, but I would like to keep these values consistent across all my analyses. This issue seems to pop up randomly, though more often with smaller chromosomes. Mouse chromosomes 1-10, 12-XY ran without issue.

RuntimeWarning: divide by zero encountered in log

[cellSNP] mode 2: pileup 22 whole chromosomes in 23111 single cells.
Traceback (most recent call last):
  File "/net/1000g/fanzhang/bin/miniconda2/bin/cellSNP", line 9, in <module>
    load_entry_point('cellSNP==0.1.4', 'console_scripts', 'cellSNP')()
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/cellSNP.py", line 196, in main
    max_FLAG, min_LEN, doubletGL, True)
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_regions.py", line 111, in pileup_regions
    cell_list, UMIs_list, barcodes)
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 309, in map_barcodes
    qual_cells[_idx][BASE_IDX[_base]] += qual_vector(_qual)
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 101, in qual_vector
    RV = [np.log(1-Q), np.log(3/4 - 2/3*Q), np.log(1/2 - 1/3*Q), np.log(Q)]
RuntimeWarning: divide by zero encountered in log

Hello, I encountered this error, what could be the reason?

single-cell-genetics / cellsnp Goto Github PK

cellsnp's People

Stargazers

Watchers

Forkers

cellsnp's Issues

Recommend Projects

Recommend Topics

Recommend Org