single-cell-genetics / cellsnp Goto Github PK
View Code? Open in Web Editor NEWPileup biallelic SNPs from single-cell and bulk RNA-seq data
License: Apache License 2.0
Pileup biallelic SNPs from single-cell and bulk RNA-seq data
License: Apache License 2.0
Hi,
This is more naive question as I do not have experience in SNPs.
I have 2 cell-lines scRNAseq data, combined sequenced. Now would like to separate using cellSNP (& vireo).
I got bulk RNA-seq data for this 2 cell-lines. So, question is how to make VCF file for REGION_VCF option?
thanks,
Thanks for your great tool.
Now I want to use it for mouse, does it work? Any suggestions?
Hi, thanks for your previous responses a while back.
I am now trying to use cellSNP/Vireo with a SAM/BAM output file from salmon alevin
instead of Cell Ranger, and wanted to see if this would be feasible somehow.
The main issue seems to be that salmon alevin
maps to the transcriptome and creates a transcriptomic BAM, while Cell Ranger maps to the genome and creates a genomic BAM. Since cellSNP expects a genomic BAM, it then gives me errors due to not finding chromosome references in the transcriptomic BAM.
I just checked with the salmon alevin
authors on Slack this morning to find out more about this - I believe the chromosome reference is normally stored in the RNAME
field in the genomic BAM, but in the transcriptomic BAM this field contains the transcript name. The genomic/transcriptomic coordinates are also different.
Do you know if there would be a way to run cellSNP directly on a transcriptomic BAM, or convert this internally somehow? Alternatively, the salmon alevin
authors pointed me towards some tools to directly convert transcriptomic to genomic BAM, which could be a way around this - but I thought I would check here first.
The main reason this has come up is that we tend to have various issues with Cell Ranger (slow runtime, memory problems, multi-mapping reads), so we would prefer to use salmon alevin
in our pipelines, and were wondering if this has already been considered. Thanks!
对于之前的提问,感谢您的快速回答,这里还有另外一个问题:我想知道的是,为什么会在单个细胞中鉴定SNP会有等位基因频率一说呢,不应该是在一个群体中鉴定SNP的时候才存在等位基因频率? 对于一个基因来说,我以为突变了之后,所有的RNA转录出来的序列都应该是在该位置有突变。 还是说这个AD有其他的含义呢?
For the previous question, thank you for your quick answer. Here is another question: I would like to know why there is allele frequency in the identification of SNP in a single cell. It should not be the allele frequency when identifying SNP in a population? For a gene, I think that after mutation, all RNA transcripts should have mutations at this location. Or does this AD have other meanings?
As single-cell datasets are really sparse, it's important to handle missing values in a way that doesn't consume too much memory. Currently, CellSNP labels missing entries with ".:.:.:.:.:."
(11 bits at best). I would strongly suggest using an empty string instead of that stub. I have been processing the output of CellSNP, and when I manually replaced all occurrences of ".:.:.:.:.:." with an empty string, I reduced the file size from 25.6Gb to 2.5Gb. This is dramatic. Not only that this choice of nan-filling value wastes the memory but it also makes the file harder to process using some convenient tools in Python/R.
Similar to #3 but wondering if things have changed.
Running cellSNP v0.1.7 as
cellSNP --samFile ${CELLRANGERDIR}/"${SAMPLE}"/outs/possorted_genome_bam.bam \
--outDir ${OUTDIR} \
--regionsVCF genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz \
--barcodeFile ${PROJECT_ROOT}/data/emptyDrops/"${SAMPLE}".barcodes.txt \
--nproc 20 \
--minMAF 0.1 \
--minCOUNT 20
with 31,707 barcodes on a 25G BAM file has been going for > 18 days!
It's still writing output, too (as of 2020-02-03 5PM):
% ll -t data/cellSNP/cellSNP.cells.vcf.gz.temp_*
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb 3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_17_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb 3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_3_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb 3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_11_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb 3 17:02 data/cellSNP/cellSNP.cells.vcf.gz.temp_15_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb 3 17:01 data/cellSNP/cellSNP.cells.vcf.gz.temp_16_
-rw-r----- 1 hickey grpu_mritchie_1 1.1G Feb 3 16:59 data/cellSNP/cellSNP.cells.vcf.gz.temp_19_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb 3 16:56 data/cellSNP/cellSNP.cells.vcf.gz.temp_12_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb 3 16:55 data/cellSNP/cellSNP.cells.vcf.gz.temp_8_
-rw-r----- 1 hickey grpu_mritchie_1 2.4G Feb 3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_6_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb 3 16:54 data/cellSNP/cellSNP.cells.vcf.gz.temp_9_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb 3 16:51 data/cellSNP/cellSNP.cells.vcf.gz.temp_10_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb 3 16:50 data/cellSNP/cellSNP.cells.vcf.gz.temp_1_
-rw-r----- 1 hickey grpu_mritchie_1 2.2G Feb 3 16:46 data/cellSNP/cellSNP.cells.vcf.gz.temp_2_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb 3 16:38 data/cellSNP/cellSNP.cells.vcf.gz.temp_14_
-rw-r----- 1 hickey grpu_mritchie_1 2.3G Feb 3 16:37 data/cellSNP/cellSNP.cells.vcf.gz.temp_13_
-rw-r----- 1 hickey grpu_mritchie_1 2.0G Feb 3 16:32 data/cellSNP/cellSNP.cells.vcf.gz.temp_4_
-rw-r----- 1 hickey grpu_mritchie_1 2.5G Feb 3 16:19 data/cellSNP/cellSNP.cells.vcf.gz.temp_7_
-rw-r----- 1 hickey grpu_mritchie_1 2.1G Feb 3 15:04 data/cellSNP/cellSNP.cells.vcf.gz.temp_18_
-rw-r----- 1 hickey grpu_mritchie_1 1.9G Feb 3 13:44 data/cellSNP/cellSNP.cells.vcf.gz.temp_0_
-rw-r----- 1 hickey grpu_mritchie_1 1.8G Feb 2 23:52 data/cellSNP/cellSNP.cells.vcf.gz.temp_5_
I've run cellSNP before and although it took a few days it certainly didn't take this long.
I'm wondering:
--regionVCF
, ...) might be causing this huge runtime?Thanks,
Pete
Hi,
Thanks for this nice tool!
I am trying to use this tool to pile up SNPs on chrX here is my code:
cellSNP -s $BAM_file -b $BARCODE_list -O $OUT_DIR -p 22 --minMAF 0.1 --minCOUNT 100 --chrom chrX
Both AD.matrix and DP.mtx are empty. I suspect that it maybe because my bam file does not have -CB as sam tag but rather in the read name. Please see attached my bam file.
Could you please suggest a workaround or if there was some other underlying problem? Thanks!
Hi Yuanhua,
I have three suggestions/questions about cellSNP and cellsnp-lite:
Would you consider using the "start" and "stop" parameters in pysam.pileup to allow users to achieve faster SNP detection within a region/gene of interest?
One of my gene of interest is known to be frequently disrupted by frameshift mutations / indels, but cellSNP only support single-nucleotide variant detection. Could the pipeline be modified to consider indels?
Is it possible to input two or multiple scRNA-seq samples and identify sites that are different across samples?
Thank you!
Best regards,
Yiyun
I updated cellSNP yesterday and found that data that had previously worked with the tool no longer did. I also tested the data used in your test.sh file and got the same traceback. Here is an example of the output:
[cellSNP] mode 2: pileup 22 whole chromosomes in 4246 single cells.
Traceback (most recent call last):
File "/home/rstudio/.local/bin/cellSNP", line 11, in
sys.exit(main())
File "/home/rstudio/.local/lib/python2.7/site-packages/cellSNP/cellSNP.py", line 198, in main
result = [res.get() if nproc > 1 else res for res in result]
File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
RuntimeWarning: divide by zero encountered in log
I ultimately uninstalled cellSNP and reinstalled version 0.0.8 and found that it works just fine again. I'm still not sure what caused the issue in the newer version but wanted to alert you to it.
How long would it approximately take to run cellSNP (Mode 1 with region_vcf) on HiSeq data with ~200M reads and 4000 cells?
Do you have some examples of cellSNP runtimes for some number of cells, number of reads, and number of subprocesses (-p option)?
I used cellSNP to identify SNP of scRNAseq. But I found a problem many SNP sites are not in gene. In my opinion, All Reads of scRNAseq should come from gene, So all SNP sites should locate in gene.
Hello, I used a VCF file made with the mouse MM10 genome and a 10x BAM file aligned to the mouse MM10 genome. This means my output should have Chromosomes 1-19, M, X, and Y like my input files. But my output from CellSNP contains chromosomes 1-22, X, and Y. Any tips on what I can do to troubleshoot this?
hi, what do AD, DP and OTH stand for (variant x cell sparse matrix output #6 )
thanks!
Hello,
I am running cellSNP in mode 2 for my bam file with the following command:
cellSNP -s <my coordinate sorted bam file> \
-O <output directory> \
-b <barcode file - one barcode on each line> \
-p 10 \
--cellTAG=None \
--UMItag=UB \
--minMAF 0.1 \
--minCOUNT 100 \
--chrom=chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY
The barcode file I am providing is a test file with only 5 barcodes. But, however the output cellSNP.cells.vcf
has SNPs piled up from all the chromosomes only for the 1st barcode in the file.
What do you think could be the reason for this ? Please see below a snapshot of my bam file:
Any help would be appreciated.
Thanks,
Somesh
Hi, I wonder if this is the expected result? A VCF with positions as rows and droplet barcode as columns?
Just to confirm if I am using it correctly.
Thanks!
From my cursory understanding of cellSNP you filter all of the reads that are marked as PCR duplicates by Cell Ranger. However, this would remove a large number of UMI duplicates as noted by the vartrix documentation:
ignore alignments marked as duplicates? Take care when turning this on with scRNA-seq data, as duplicates are marked in that pipeline for every extra read sharing the same UMI/CB pair, which will result in most variant data being lost.
I wouldn't be surprised if this negatively affects the performance of vireo on some datasets.
Hi all,
starting with Cell Ranger reference 2020-A, chromosome naming now follows Gencode convention (chr1-chr23, chrM) instead of ensembl style (1-23, MT). It's probably worth it to update the reference VCF files provided since it'll definitely cause confusion among users.
Cheers
-- Alex
I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in
https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68
There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change
if chrom != None:
if chrom not in samFile.references:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in samFile.references:
print("Can't find references %s in samFile" %chrom)
return None, None
to
if chrom != None:
sam_refs = samFile.references
if chrom not in sam_refs:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in sam_refs:
print("Can't find references %s in samFile" %chrom)
return None, None
Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical
def check_pysam_chrom(samFile, chrom=None):
"""Chech if samFile is a file name or pysam object, and if chrom format.
"""
global CHROM_CACHE
global CHROM_CACHE_ID
global CHROM_CHACH_SAMFILE
CHROM_CACHE_SAMFILE = samFile
if CHROM_CACHE is not None:
if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
return CHROM_CACHE, CHROM_CACHE_ID
if type(samFile) == str:
ftype = samFile.split(".")[-1]
if ftype != "bam" and ftype != "sam" and ftype != "cram" :
print("Error: file type need suffix of bam, sam or cram.")
sys.exit(1)
if ftype == "cram":
samFile = pysam.AlignmentFile(samFile, "rc")
elif ftype == "bam":
samFile = pysam.AlignmentFile(samFile, "rb")
else:
samFile = pysam.AlignmentFile(samFile, "r")
if chrom != None:
if chrom not in samFile.references:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in samFile.references:
print("Can't find references %s in samFile" %chrom)
return None, None
CHROM_CACHE = samFile
CHROM_CACHE_ID = chrom
return samFile, chrom
This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.
The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using
samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)
however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.
Hello,
Does cellSNP take in a reference genome?
Best,
Tae
Hi, I have been running into some difficulties with running cellSNP/Vireo, and thought here would be a good place to ask. This is a follow-up to the questions Stephanie Hicks asked via Twitter a few weeks ago.
As discussed on Twitter, we are considering a multiplexed design for some future 10x scRNA-seq samples, and in anticipation of this, we are trying to do some evaluations on our existing (non-multiplexed) samples, to see how well Vireo would have performed at demultiplexing these if they were multiplexed.
The way we are trying to do this is: (i) combine BAM files from 3 separate scRNA-seq samples (i.e. simply merge them using samtools merge
), (ii) run cellSNP to genotype cells, using either mode 1 (using known variants provided from the 1000 Genomes Project), or mode 3 (using our separate matched bulk RNA-seq samples), and then (iii) run Vireo.
However I have been getting stuck when trying to run cellSNP. I think it is either due to the way I have merged the BAM files, or due to the list of cell barcodes I am providing to cellSNP.
Do you have any advice about how you would go about doing this? E.g. maybe simply merging the BAM files is too simple (I have had a look in https://github.com/single-cell-genetics/vireo/tree/master/simulate, and see that you also used samtools merge
there, but there is also quite a lot else going on in that script). For the list of cell barcodes, I have tried providing both the entire Cell Ranger barcode whitelist, or a parsed list of barcodes from the Cell Ranger output, but both are giving me errors -- I can't quite figure out from the documentation what I need to provide here; sorry if I missed something. (cellSNP also gives me a message saying it is actually running in mode 2, so maybe I am also providing the inputs incorrectly).
Any advice you could provide would be greatly appreciated. Or if you have a full worked example showing how to run a cellSNP analysis, this would also be very helpful -- I had a look but could only find Vireo examples beginning from the cellSNP output; sorry again if I missed something.
Hello, I'm running into an error when calling cellSNP in mode 1 on mouse .bam data, specifically on chromosome 11, with the following command:
~/.local/bin/cellSNP -s file.bam -b barcodes.tsv -O cellSNP_output_11 -R mgp.v5.snps_subset_11.vcf.gz -p 36 --minMAF 0.1 --minCOUNT 20
This causes the following error after the pipeline has finished processing all positions:
100.00% positions processed.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/u/local/apps/anaconda3/2020.11/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/utils/pileup_utils.py", line 294, in fetch_positions
vcf_line = get_vcf_line(base_merge, base_cells, qual_cells,
File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/utils/pileup_utils.py", line 366, in get_vcf_line
ALT_cnt = base_merge[ALT]
KeyError: '.'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/u/home/m/margolis/.local/bin/cellSNP", line 8, in <module>
sys.exit(main())
File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/cellSNP.py", line 268, in main
result = [res.get() for res in result]
File "/u/home/m/margolis/.local/lib/python3.8/site-packages/cellSNP/cellSNP.py", line 268, in <listcomp>
result = [res.get() for res in result]
File "/u/local/apps/anaconda3/2020.11/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
KeyError: '.'
Here is the segment of code in the pileup_util.py file throwing an error:
I'm assuming that this means an alternate allele wasn't found in one of the temp files, causing an issue with combining the temp files into the final output file. Any suggestions on how to prevent this in the future? I have found that changing minMAF and minCOUNT (making them more stringent) seems to get rid of the issue, but I would like to keep these values consistent across all my analyses. This issue seems to pop up randomly, though more often with smaller chromosomes. Mouse chromosomes 1-10, 12-XY ran without issue.
[cellSNP] mode 2: pileup 22 whole chromosomes in 23111 single cells.
Traceback (most recent call last):
File "/net/1000g/fanzhang/bin/miniconda2/bin/cellSNP", line 9, in <module>
load_entry_point('cellSNP==0.1.4', 'console_scripts', 'cellSNP')()
File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/cellSNP.py", line 196, in main
max_FLAG, min_LEN, doubletGL, True)
File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_regions.py", line 111, in pileup_regions
cell_list, UMIs_list, barcodes)
File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 309, in map_barcodes
qual_cells[_idx][BASE_IDX[_base]] += qual_vector(_qual)
File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 101, in qual_vector
RV = [np.log(1-Q), np.log(3/4 - 2/3*Q), np.log(1/2 - 1/3*Q), np.log(Q)]
RuntimeWarning: divide by zero encountered in log
Hello, I encountered this error, what could be the reason?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.