wheaton5 / souporcell Goto Github PK

View Code? Open in Web Editor NEW

155.0 11.0 44.0 1.69 MB

Clustering scRNAseq by genotypes

License: MIT License

Python 50.30% Rust 44.90% Singularity 2.29% Dockerfile 0.23% Roff 2.29%

scrna-seq scrnaseq scrna-seq-analysis bioinformatics computational-biology genomics

souporcell's Introduction

souporcell

Preprint manuscript of this method available at https://www.biorxiv.org/content/10.1101/699637v1

souporcell is a method for clustering mixed-genotype scRNAseq experiments by individual.

The inputs are just the possorted_genome_bam.bam, and barcodes.tsv as output from cellranger. souporcell is comprised of 6 steps with the first 3 using external tools and the final using the code provided here.

Remapping (minimap2)
Calling candidate variants (freebayes)
Cell allele counting (vartrix)
Clustering cells by genotype (souporcell.py)
Calling doublets (troublet)
Calling cluster genotypes and inferring amount of ambient RNA (consensus.py)

Easy Installation (Linux) (recommended)

Download singularity image (1.3gb) (singularity is similar to docker but safe for clusters)

singularity pull --arch amd64 library://wheaton5/souporcell/souporcell:release

If you are running on a scientific cluster, they will likely have singularity, contact your sysadmin for more details. If you are running on your own linux box you may need to install singularity

requires singularity >= 3.0

which singularity
singularity --version

You should now be able to run souporcell_pipeline.py through the singularity container. Singularity automatically mounts the current working directory and directories downstream from where you run it, otherwise you would need to manually mount those directories. Therefor you can run it from a directory that is upstream of all of the inputs. Input files are the cellranger bam, cellranger barcodes file, and a reference fasta. The cellranger bam is located in the cellranger outs directory and is called possorted_genome_bam.bam. The barcodes file is located in the cellranger outs/filtered_gene_bc_matrices/<ref_name>/barcodes.tsv. The reference fasta should be of the same species but does not necessarily need to be the exact cellranger reference.

The options for using souporcell are:

singularity exec souporcell_latest.sif souporcell_pipeline.py -h
usage: souporcell_pipeline.py [-h] -i BAM -b BARCODES -f FASTA -t THREADS -o
                              OUT_DIR -k CLUSTERS [-p PLOIDY]
                              [--min_alt MIN_ALT] [--min_ref MIN_REF]
                              [--max_loci MAX_LOCI] [--restarts RESTARTS]
                              [--common_variants COMMON_VARIANTS]
                              [--known_genotypes KNOWN_GENOTYPES]
                              [--known_genotypes_sample_names KNOWN_GENOTYPES_SAMPLE_NAMES [KNOWN_GENOTYPES_SAMPLE_NAMES ...]]
                              [--skip_remap SKIP_REMAP] [--ignore IGNORE]

single cell RNAseq mixed genotype clustering using sparse mixture model
clustering with tensorflow.

optional arguments:
  -h, --help            show this help message and exit
  -i BAM, --bam BAM     cellranger bam
  -b BARCODES, --barcodes BARCODES
                        barcodes.tsv from cellranger
  -f FASTA, --fasta FASTA
                        reference fasta file
  -t THREADS, --threads THREADS
                        max threads to use
  -o OUT_DIR, --out_dir OUT_DIR
                        name of directory to place souporcell files
  -k CLUSTERS, --clusters CLUSTERS
                        number cluster, tbd add easy way to run on a range of
                        k
  -p PLOIDY, --ploidy PLOIDY
                        ploidy, must be 1 or 2, default = 2
  --min_alt MIN_ALT     min alt to use locus, default = 10.
  --min_ref MIN_REF     min ref to use locus, default = 10.
  --max_loci MAX_LOCI   max loci per cell, affects speed, default = 2048.
  --restarts RESTARTS   number of restarts in clustering, when there are > 12
                        clusters we recommend increasing this to avoid local
                        minima
                         --common_variants COMMON_VARIANTS
                        common variant loci or known variant loci vcf, must be
                        vs same reference fasta
  --known_genotypes KNOWN_GENOTYPES
                        known variants per clone in population vcf mode, must
                        be .vcf right now we dont accept gzip or bcf sorry
  --known_genotypes_sample_names KNOWN_GENOTYPES_SAMPLE_NAMES [KNOWN_GENOTYPES_SAMPLE_NAMES ...]
                        which samples in population vcf from known genotypes
                        option represent the donors in your sample
  --skip_remap SKIP_REMAP
                        don't remap with minimap2 (not recommended unless in
                        conjunction with --common_variants
  --ignore IGNORE       set to True to ignore data error assertions

A typical command looks like

singularity exec /path/to/souporcell_latest.sif souporcell_pipeline.py -i /path/to/possorted_genome_bam.bam -b /path/to/barcodes.tsv -f /path/to/reference.fasta -t num_threads_to_use -o output_dir_name -k num_clusters

The above command will run all six steps of the pipeline and it will require up to 24gb of ram for human (minimap2 bam index is high water mark for memory). For smaller genomes, fewer clusters, lower --max-loci will require less memory. Note that souporcell will require roughly 2x the amount of diskspace that the input bam file takes up. This dataset should take several hours to run on 8 threads mostly due to read processing, remapping, and variant calling.

If you have a common snps file you may want to use the --common_variants option with or without the --skip_remap option. This option will skip conversion to fastq, remapping with minimap2, and reattaching barcodes, and the --common_variants will remove the freebayes step. Each which will save a significant amount of time, but --skip-remap isn't recommended without --common_variants.

Common variant files from 1k genomes filtered to variants >= 2% allele frequency in the population and limited to SNPs can be found here for GRCh38

curl ftp://ftp.eng.auburn.edu/pub/whh0027/common_variants_grch38.vcf.gz -o common_variants_grch38.vcf.gz

or for hg19 here

curl ftp://ftp.eng.auburn.edu/pub/whh0027/filtered_2p_1kgenomes_hg19.vcf.gz -o common_variants_hg19.vcf.gz

Practice/Testing data set: Demuxlet paper data

wget https://sra-pub-src-1.s3.amazonaws.com/SRR5398235/A.merged.bam.1 -O A.merged.bam
wget ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2560nnn/GSM2560245/suppl/GSM2560245_barcodes.tsv.gz
gunzip GSM2560245_barcodes.tsv.gz

And if you don't have a human reference sitting around, grab one here

wget http://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-3.0.0.tar.gz
tar -xzvf refdata-cellranger-GRCh38-3.0.0.tar.gz

Now you should be ready to test it out

singularity exec /path/to/souporcell_latest.sif souporcell_pipeline.py -i A.merged.bam -b GSM2560245_barcodes.tsv -f refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa -t 8 -o demux_data_test -k 4

This should require about 20gb of ram mostly because of the minimap2 indexing step. I might soon host an index and reference for human to make this less painful.

The important files are

clusters.tsv
cluster_genotypes.vcf
ambient_rna.txt

clusters.tsv will look like

barcode status  assignment      log_loss_singleton      log_loss_doublet        cluster0        cluster1
AAACCTGAGATCCGAG-1      singlet 0       -152.7778890920112      -190.5463095948822      -43.95302689281067      -101.63377524087669
AAACCTGAGCACCGTC-1      singlet 0       -78.56014177554212      -96.66255440088581      -20.949294849836267     -52.57478083591962
AAACCTGAGTACGATA-1      singlet 0       -216.0188863327174      -281.3888392065457      -63.059016939362536     -159.5450834682198
AAACCTGGTACATGTC-1      singlet 1       -47.189434469216565     -96.30865717225866      -62.652900832546955     -15.284168900754413
AAACCTGTCTACTCAT-1      singlet 0       -129.30104434183454     -167.87811467946756     -41.09158213888751      -106.3201962010145
AAACCTGTCTTGTCAT-1      singlet 0       -85.99781433701455      -110.81701038967158     -24.518165091815554     -60.05279033826837
AAACGGGCACTGTTAG-1      singlet 0       -154.26595878718032     -191.05465308213363     -31.356408693487197     -81.61186496254497
AAACGGGCATCATCCC-1      singlet 1       -46.33205678267174      -80.24152434540565      -50.78221280006256      -14.615983876840312
AAACGGGGTAGGGTAC-1      singlet 0       -240.5237900569412      -302.91575436035924     -71.79370547349878      -154.08594135029728
AAACGGGTCGGCATCG-1      singlet 0       -166.66827966974532     -226.56795157885028     -51.08790637893961      -148.04625123166286

With the cell barcode, singlet/doublet status, cluster, log_loss_singleton, log_loss_doublet, followed by log loss for each cluster.

cluster_genotypes.vcf is a vcf with genotypes for each cluster for each variant in the input vcf from freebayes

and

ambient_rna.txt just contains the ambient RNA percentage detected

Hard install

Instead of using singularity you can install everything independently (not recommended, but shouldn't be too bad)

git clone https://github.com/wheaton5/souporcell.git

put souporcell directory on your PATH requires samtools, bcftools, htslib, python3, freebayes, vartrix, minimap2 all on your PATH I suggest you use the conda env I have set up by using the following command if you have conda or miniconda

conda env create -f /path/to/souporcell/souporcell_env.yaml
conda activate souporcell

You will also need Rust and to compile the two rust binaries

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cd /path/to/souporcell/souporcell && cargo build --release
cd /path/to/souporcell/troublet && cargo build --release

otherwise python packages tensorflow, pyvcf, pystan, pyfaidx, numpy, scipy are required, but as the versions change, I do recommend using the presetup env.

To run through the pipeline script

souporcell_pipeline.py -i /path/to/possorted_genome_bam.bam -b /path/to/barcodes.tsv -f /path/to/reference.fasta -t num_threads_to_use -o output_dir_name -k num_clusters

To run things step by step not through the pipeline script

1. Remapping

We discuss the need for remapping in our manuscript. We need to keep track of cell barcodes and and UMIs, so we first create a fastq with those items encoded in the readname. Requires python 3.0, modules pysam, argparse (pip install/conda install depending on environment) Easiest to first add the souporcell directory to your PATH variable with

export PATH=/path/to/souporcell:$PATH

Then run the renamer.py script to put some of the quality information in the read name. For human data this step will typically take several hours and the output fq file will be somewhat larger than the input bam

python renamer.py --bam possorted_genome_bam.bam --barcodes barcodes.tsv --out fq.fq

Then we must remap these reads using minimap2 (similar results have been seen with hisat2) Requires minimap2 and add /path/to/minimap2 to your PATH. For human data the remapping will typically require more than 12 Gb memory and may take a few hours to run.

minimap2 -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no <reference_fasta_file> <fastq_file> > minimap.sam

(note the -t 8 as the number of threads, change this as needed) Now we must retag the reads with their cell barcodes and UMIs

python retag.py --sam minimap.sam --out minitagged.bam

Then we must sort and index our bam. Requires samtools

samtools sort minitagged.bam minitagged_sorted.bam
samtools index minitagged_sorted.bam

2. Calling candidate variants

You may wish to break this into multiple jobs such as 1 job per chromosome and merge after but the basic command is the following. Requires freebayes and add /path/to/freebayes/bin to your PATH

freebayes -f <reference_fasta> -iXu -C 2 -q 20 -n 3 -E 1 -m 30 --min-coverage 6 --limit-coverage 100000 minitagged_sorted.bam

3. Cell allele counting

Requires vartrix and add /path/to/vartrix to your PATH

vartrix --umi --mapq 30 -b <bam file> -c <barcode tsv> --scoring-method coverage --threads 8 --ref-matrix ref.mtx --out-matrix alt.mtx -v <freebayes vcf> --fasta <fasta file used for remapping>

note the --threads argument and use an appropriate number of threads for your system.

4. Clustering cells by genotype

Rust required. To install rust:

curl https://sh.rustup.rs -sSf | sh

and to build souporcell clustering

cd /path/to/souporcell/souporcell
cargo build --release

And add /path/to/souporcell/souporcell/target/release to your path usage

souporcell -h
souporcell 2.4
Haynes Heaton <[email protected]>
clustering scRNAseq cells by genotype

USAGE:
    souporcell [OPTIONS] --alt_matrix <alt_matrix> --barcodes <barcodes> --num_clusters <num_clusters> --ref_matrix <ref_matrix>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -a, --alt_matrix <alt_matrix>                                           alt matrix from vartrix
    -b, --barcodes <barcodes>                                               cell barcodes
        --initialization_strategy <initialization_strategy>
            cluster initialization strategy, defaults to kmeans++, valid values are kmeans++, random_uniform,
            middle_variance, random_cell_assignment
        --known_cell_assignments <known_cell_assignments>
            tsv with barcodes and their known assignments

    -g, --known_genotypes <known_genotypes>
            NOT YET IMPLEMENTED population vcf/bcf of known genotypes if available.
            
        --known_genotypes_sample_names <known_genotypes_sample_names>...
            NOT YET IMPLEMENTED sample names, must be samples from the known_genotypes vcf

        --min_alt <min_alt>
            minimum number of cells containing the alt allele for the variant to be used for clustering

        --min_alt_umis <min_alt_umis>                                       min alt umis to use locus for clustering
        --min_ref <min_ref>
            minimum number of cells containing the ref allele for the variant to be used for clustering

        --min_ref_umis <min_ref_umis>                                       min ref umis to use locus for clustering
    -k, --num_clusters <num_clusters>                                       number of clusters
    -r, --ref_matrix <ref_matrix>                                           ref matrix from vartrix
    -r, --restarts <restarts>                                               number of random seedings
        --seed <seed>                                                       optional random seed
    -t, --threads <threads>                                                 number of threads to use

So generally something along the lines of

souporcell -a alt.mtx -r ref.mtx -b barcodes.tsv -k <num_clusters> -t 8 > clusters_tmp.tsv

(note clusters_tmp.tsv output as the doublet caller outputs the final clusters file)

5. Calling doublets

Rust required. Build troublet:

cd /path/to/souporcell/troublet
cargo build --release

And add /path/to/souporcell/troublet/target/release to your path The usage is

troublet -h
troublet 2.4
Haynes Heaton <[email protected]>
Intergenotypic doublet detection given cluster assignments and cell allele counts

USAGE:
    troublet [OPTIONS] --alts <alts> --clusters <clusters>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -a, --alts <alts>                              alt allele counts per cell in sparse matrix format out of vartrix
    -c, --clusters <clusters>                      cluster file output from schism
    -b, --debug <debug>...                         print debug info for index of cells listed
    -d, --doublet_prior <doublet_prior>            prior on doublets. Defaults to 0.5
        --doublet_threshold <doublet_threshold>    doublet posterior threshold, defaults to 0.90
    -r, --refs <refs>                              ref allele counts per cell in sparse matrix format out of vartrix
        --singlet_threshold <singlet_threshold>    singlet posterior threshold, defaults to 0.90

So generally

troublet -a alt.mtx -r ref.mtx --clusters clusters_tmp.tsv > clusters.tsv

6. Genotype and ambient RNA coinference

Python3 required with modules pystan, pyvcf, pickle, math, scipy, gzip (pip install should work for each)

consensus.py -h
usage: consensus.py [-h] -c CLUSTERS -a ALT_MATRIX -r REF_MATRIX [-p PLOIDY]
                    --soup_out SOUP_OUT --vcf_out VCF_OUT --output_dir
                    OUTPUT_DIR -v VCF

consensus genotype calling and ambient RNA estimation

optional arguments:
  -h, --help            show this help message and exit
  -c CLUSTERS, --clusters CLUSTERS
                        tsv cluster file from the troublet output
  -a ALT_MATRIX, --alt_matrix ALT_MATRIX
                        alt matrix file
  -r REF_MATRIX, --ref_matrix REF_MATRIX
                        ref matrix file
  -p PLOIDY, --ploidy PLOIDY
                        ploidy, must be 1 or 2, defaults to 2
  --soup_out SOUP_OUT   soup output
  --vcf_out VCF_OUT     vcf output
  --output_dir OUTPUT_DIR
                        output directory
  -v VCF, --vcf VCF     vcf file from which alt and ref matrix were created

So generally

consensus.py -c clusters.tsv -a alt.mtx -r ref.mtx --soup_out soup.txt -v <freebayes vcf> --vcf_out cluster_genotypes.vcf --output_dir .

souporcell's People

Contributors

Stargazers

Watchers

Forkers

10xgenomics ericdeveaud emberwhirl haroon123 yixf-self mengchengyao linquynus apaikari zorrodong prete gokceneraslan xiaomeili1 tanglabsysu2020 junho0508 procha2 lichenbiostat redst4r gcfntnu sschmeier mcbrlab mr-laurent adriandalessandro maxim-h jianguozhou3 alaminzju mikecuoco jdemeul scfurl alexanrna ast87 ccrobertson fuyboo quinn-zhang johnyaku sc-eqtlgen-consortium caofan dheerajganeshn zhenhua-zhang michaelrade rezabehboudi ewowiredu michael-kotliar mschilli87 ashutoshtomar

souporcell's Issues

Error running clustering

Hello, I just got this error running clustering. I see you just recently committed the off-by-one error fix. If these are related, is the singularity image updated?

Traceback (most recent call last):
  File "/opt/souporcell/souporcell.py", line 89, in <module>
    cell_data[cell-1][index] = float(ref_c)/float(ref_c+alt_c)
IndexError: index 7374 is out of bounds for axis 0 with size 7374

Pipeline warnings (scRNA-seq)

Hello!

I used the souporcell pipeline for demultiplexing genotypes of a scRNA-seq dataset consisted of 3 individuals. The dataset was produced with 10x Genomics (v3).

singularity exec souporcell.sif souporcell_pipeline.py --bam /data/sample1_counts/HNTLNBGXC/outs/possorted_genome_bam.bam --out_dir /sample1_genotype_souporcell/ --fasta /data/refdata-cellranger-GRCh38-3.0.0/fasta/genome.fa --common_variants /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf --barcodes /data/barcodes.tsv --threads 8 --clusters 3 --ploidy 2 --skip_remap SKIP_REMAP

I demultiplexed the same dataset using cellSNP (for variant calling) and then vireo. The overlap of cells identified per individual is quite big in both approaches.

Although, I am getting the messages bellow at the end of the pipeline:

156515 excluded for potential RNA editing
562 doublets excluded from genotype and ambient RNA estimation
6424 not used for soup calculation due to possible RNA edit
/usr/local/envs/py36/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
elif np.issubdtype(np.asarray(v).dtype, float):
Initial log joint probability = -573687
Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes
Exception: log_mix: lambda1 is -nan, but must not be nan! (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan! (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan! (in 'unknown file name' at line 73)

Exception: log_mix: lambda1 is -nan, but must not be nan! (in 'unknown file name' at line 73)
   6       -144057    0.00346277      0.918425           1           1       27   
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance
done

So, my question is if I should just ignore this, considering that the pipeline finished and gave me results that I can also validate with an alternative approach.
Or there is something that needs to get fixed in order to avoid these messages?

Thank you very much in advance : )

Best, Anna

How to debug large number of unassigned cells?

Hello Haynes,

In one of the samples I have a quite high number of unassigned cells. If I get it right from the code, unassigned means that posterior probability of it belonging to one of the genotypic clusters or of being a doublet of any pair is not high enough.
In this sample we expect 2 or 3 genotypes, so I first ran souporcell with k=2 and got ~2k out of ~7k cells as unassigned. Then I reran it with k=3, and (to my surprise) got ~2.5k cells as unassigned.
Two questions related to that:

How to debug this situation?
Could the problem be in the cluster definition? Would running the mixture model clustering more times help?

Thank you for the great tool,
Best,
Nick

Parameter K?

Can you explain the -k parameter a bit more? That's for clusters, I see. But what does that mean? Is -k the expected number of different donors we have in our sample?

I used the default -k 4 from the code in the README. In my sample, however, I only had 2 human donors. I'm thinking this is a mistake now because I have 0,1,2,3 assignment in my output, which doesn't jive with the 2 donors I had.

Is that correct? -k should be number of expected donors/subjects in the sample?

please use plain python code instead of subprocess unix command

hello,

there is a lot of subprocess.check_call for std unix commands that can be replaced with, plain stdlib python calls or even pure python code

subprocess.check_call(['rm'         => os.remove, os.unlink, os.rmdir, os.removedirs etc
subprocess.check_call(["touch"  => withth open(filename, 'w') as f: pass
subprocess.check_call(["awk '     => plain python code
subprocess.check_call(['mv'.        => os.rename, os.renames
subprocess.check_call(['bgzip'.    => python odule
subprocess.check_call(["mkdir",    => os.mkdir

this will enhance the portability of your code.

regards

Eric

No such file or directory: 'renamer.py': 'renamer.py'

Hi again,
I am trying it out in conda (using the yaml env) since singularity gave me an error and I am not familiar with it.

I ran the following command
/pathTO/souporcell/souporcell_pipeline.py -i $bam -b $.txt -f $fa -t 4 -o /pathTO/outDIR/ -k 5

and got this error:

Traceback (most recent call last):
  File "/home/karjosukarso/souporcell/souporcell_pipeline.py", line 528, in <module>
    (region_fastqs, all_fastqs) = make_fastqs(args)
  File "/home/karjosukarso/souporcell/souporcell_pipeline.py", line 209, in make_fastqs
    "--umi_tag", args.umi_tag, "--cell_tag", args.cell_tag])
  File "/home/karjosukarso/anaconda3/envs/souporcell/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/home/karjosukarso/anaconda3/envs/souporcell/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'renamer.py': 'renamer.py'

I check the git cloned souporcell dir and renamer.py is there

input BAM:

NS500173:399:HJ2MCBGX5:3:13509:20731:3554    16 chr1    86310   11      63M     *       0       0       TTGACTTTTGAACATACTTGGACTACATACCATTGCTTGAAAAAATAAAATATCTGCAAAATA <EEEEEE<EAEEEEEEEAEEAE/EAEEEEEEEEAEEAEEEAEEEEEEEEEEEEEEEEEAAAAA        CB:Z:TCCTAGCT_TAATCG    UB:Z:TGTAAGGG   MD:Z:61A1       XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 XS:i:-15        YT:Z:UU
NS500173:399:HJ2MCBGX5:1:12303:4069:7998     16 chr1    87535   11      63M     *       0       0       TTTTTTATTTTTTAAAAAATTGCTAATTTACAGAACATGGAGATGAGTATGTTTTGAAGGCCC EEEEEEEEEEEEAEAAA<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA        CB:Z:TGGTTGTC_TACAGC    UB:Z:TGCCTACC   MD:Z:61T0T0     XG:i:0  NM:i:2  XM:i:2  XN:i:0  XO:i:0  AS:i:-10        XS:i:-20        YT:Z:UU
N

Do you probably know what went wrong?
Thanks,
Dyah

Additional ambient RNA/soup information

Would it be possible to get information about the supposed ambient RNA/soup in our experiment, as reported by the file created with the --soup_out parameter when creating the conscensus? Such as the SNPs called in this soup?

We are having some demultiplexing troubles in one of our experiments, and as this tool seems to do more conservative doublet calling, we are interested if and how this 'soup' might be related to the genotype/doublet calling in our experiments/tools

singularity - vcf merging problem

Hello Haynes,

I am having a hard time running the pipeline in singularity: (version singularity version 3.1.0):

/home/asecene/anaconda3/envs/freebayes/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/asecene/anaconda3/envs/freebayes/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/asecene/anaconda3/envs/freebayes/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/asecene/anaconda3/envs/freebayes/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/asecene/anaconda3/envs/freebayes/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/asecene/anaconda3/envs/freebayes/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
File "/home/asecene/souporcell/souporcell_pipeline.py", line 461, in
final_vcf = freebayes(args, bam, fasta)
File "/home/asecene/souporcell/souporcell_pipeline.py", line 362, in freebayes
subprocess.check_call(["bcftools", "concat"] + all_vcfs, stdout = vcfout)
File "/home/asecene/anaconda3/envs/freebayes/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['bcftools', 'concat', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_0_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_1_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_2_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_3_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_4_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_5_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_6_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_7_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_8_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_9_0.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_2_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_5_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_6_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_3_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_1_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_4_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_5_2.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_9_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_9_2.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_9_3.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_7_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_9_4.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_2_2.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_9_5.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_8_1.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_3_2.vcf', '/fast/CF_Sequencing/work_dir/cfische/SP014/CR3_3/SOZE1-Villi/sozevilli/outs/soup_output/soup_output_singul/souporcell_3_3.vcf']' returned non-zero exit status 1.

Somehow bcftools' having a hard time merging the vcf files ?

Thanks,
Kerim

Failing to pick up a genotype in single-genotype sample

Hello, me again, sorry. A mixed donor experiment happened, and I'm trying to run souporcell on it. The biologist's design felt quite elegant - 36 samples with the same two donors mixed, and then in order to recover which donor is which a separate one with just one donor in it. However, when I process the single donor sample with souporcell I get a cluster_genotypes.vcf an order of magnitude shorter than the other samples, and made up of straight ./. calls. This is with an artificial k of 2, after getting a similar result with k set to 1.

Any idea how to proceed here? Should I artificially merge the BAM with another sample's BAM? Or is there something that can be done on a single sample level to rescue this?

cluster assignments using known_genotypes

Hello,
Maybe a simple question. I have a vcf file (merged.vcf) derived from bulk RNA-Seq of 8 individuals. I performed a scRNA-seq experiment on the same individuals and ran souporcell with the following command:

singularity exec ../souporcell.sif souporcell_pipeline.py -i /path/to/possorted_genome_bam.bam -b sc_15/barcodes.tsv -f /path/to/genome.fa -t 16 -o sc_15_genos --known_genotypes /path/to/merged.vcf -k 8

The output file from souporcell clusters.tsv lists the assignment as clusters 0-7, not by the individual names in the merged.vcf. I'm not sure which clusters correspond to which individuals. Does it seem I have made a mistake? I see in the common_variants_covered.vcf file that the individual names are retained, so my first guess is perhaps the clusters are ordered the same as the individuals here.

support for 10x atac-seq

10x atac-seq data does not have UMI. Souporcell gives:

Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 104, in <module> assert float(num_umi) / float(num_read_test) > 0.5, "Less than 50% of first 100000 reads have UMI tag (UB), turn on -- AssertionError: Less than 50% of first 100000 reads have UMI tag (UB), turn on --ignore True to ignore

When run with --ignore True this leads to

0 loaded 0 counts, is this a problem?
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/libcore/option.rs:378:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

A workaround, as you mentioned, is to add dummy UB tags with incremented integer value for each read.

souporcell singularity recipe :: may fail regarding singularity version

Hello,

FYI due to https://github.com/sylabs/singularity/issues/3401

building the container fail with singularity 3.5.2 due to

%post -c /bin/bash

OK with singularity 3.4.0 and 3.5.3 not tested with 3.2*

regards

Eric

Using bulk RNA-seq to match identity with souporcell calls

Hi again. So souporcell worked nicely to cluster my snRNA-seq data into different donors. But now I need to go back and match the donor with the souporcell call. I've done regular bulk RNA-seq on each individual donor. Is there a good way now to use that bulk RNA-seq data to see which donor corresponds to each cluster?

Ratio of cells from different individuals

Hi,

May I ask how does souporcell handle situations with extremely uneven (smth like 95% of cells from individual 1 and 5% - from individual 2, or even more skewed) ratios of cells from different individuals? Generalising, is there a threshold of these ratios that allows to assess the confidence of deconvolution?

Thank you.

Sincerely,
Anna Arutyunyan

Singularity build for v2.0

Is the singularity image download on the main page for v1.0 or has it been updated to v2.0? I'd like to run the latest version, if possible.

Missing cluster_genotypes.vcf

Hi @wheaton5, I have noticed that on some of my souporcell pipeline runs the cluster_genotypes.vcf is missing and when I restart souporcell for that sample, souporcell says that it has finished. However, if I run consensus.py, then it does create the cluster_genotypes.vcf.

I currently have a good work around with my pipeline which is to run consensus.py for those that are missing the cluster_genotypes.vcf so no rush on this but I just wanted to flag this with you to look in to for whenever you release the next version of souporcell and also if anyone else comes up against this problem.

Can't find barcode file

Hi there,

I just installed souporcell without any problems and the help command runs smoothly. However, when I try to run with my data, I get the following error:

checking bam for expected tags
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 63, in <module>
    with open(args.barcodes) as barcodes:
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/barcodes.tsv'

However, when I
head /path/to/barcodes.tsv
where I copy and paste directly from the error file, I get the top 10 barcodes so the file exists.

Here is the command that I am running:

singularity exec /share/ScratchGeneral/drenea/souporcell.sif souporcell_pipeline.py -i $BAM -b $BARCODES -f $FASTA -t $T -o $OUT -k $N

The souporcell.sif is in a parent directory to the barcodes.tsv file and I am providing the entire path to the barcodes.tsv. Do you have any idea why this might be happening?

Thanks,
Drew

singularity recipe :: proposal

Hello,

one of our cluster still run in centos, so we are stuck to singularity containers running a glibc <= 2.25 to be able to be executed.
so this means basically

centos6 based
ubuntu-16.04

so I add to rewrite the singularity.def file in order to accomodate this problem,
I based this recipe on nvcr.io/nvidia/cuda:10.1-devel-ubuntu16.04
in order to take advantages of tensorflow cuda capabilities when running the container with singularity run --nv on cuda nodes
as far I as know, conda.miniconda docker does not embeds cuda support

here is the singularity recipe I wrote.

regards

NB tested with singularity 3.4.0 and singularity 3.5.3
does not works with singularity-3.5.2 see #45

souporcell-2.0..recipe.recipe.txt

ulimit / other issues

Thanks for making this great tool!

I'm curious about estimated run times for the pipeline. I ran the following command:

singularity exec ./souporcell.sif souporcell_pipeline.py -i SLNR_L3_possorted_genome_bam.bam -b SLNR_L3_barcodes.tsv -f GRCh38_full_analysis_set_plus_decoy_hla.fa -t 8 -o SLNR_PBMC_L3_souporcell -k 4 --ignore True

I started the workflow 2.5 hrs ago and it is currently at the "generating fastqs with cell barcodes and umis in readname" stage. I'm wondering if the pipeline has stalled out and, if not, what the estimated total run time will be.

Thanks,
Chris

Parameters to consider

Hi,

Thanks for this amazing tool.
I have managed to replicate the placenta dataset from your paper without no problem !
I am also working on single-cell placenta datasets and I have tried running souporcell on them.
Unfortunately, while I was expecting to see only a few maternal cells, souporcell predicted that my dataset (~3000 cells) contained half maternal, half placenta cells (which is not possible, having verified with female specific XIST expression (only a few maternal cells are present)). So, I was wondering what parameters could be involved in producing a result like this.

Thanks,

Kerim

Memory and Time Difference

Hello,

I have run souporcell with both the pipeline script as well as running each step individually. I have noticed that running each step separately takes significantly more time and memory to run. For example, when I run the pipeline script it takes about 3 hours and 30G of memory but when I remap using

python renamer.py --bam possorted_genome_bam.bam --barcodes barcodes.tsv --out fq.fq

minimap2 -ax splice -t 8 -G50k -k 21 -w 11 --sr -A2 -B8 -O12,32 -E2,1 -r200 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g2000 -2K50m --secondary=no <reference_fasta_file> <fastq_file> > minimap.sam

python retag.py --sam minimap.sam --out minitagged.bam

I have already used ~8 hours and ~30G of memory. Do you have recommendations on how to enhance the speed and memory when running step by step?

Thank you for your help and recommendations!

Error with default pipeline

Hi, I ran into this problem:

Initial log joint probability = -137008
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
Exception: log_mix: lambda1 is -nan, but must not be nan!  (in 'unknown file name' at line 73)

Error evaluating model log probability: Non-finite gradient.

       6      -68531.7   0.000733154     0.0164666           1           1       13
Optimization terminated normally:
  Convergence detected: relative gradient magnitude is below tolerance
/usr/local/envs/py36/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating
` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):
111306 excluded for potential RNA editing
444 doublets excluded from genotype and ambient RNA estimation
2717 not used for soup calculation due to possible RNA edit
Traceback (most recent call last):
  File "/opt/souporcell/consensus.py", line 433, in <module>
    with open("tempsouporcell.vcf") as tmp:
FileNotFoundError: [Errno 2] No such file or directory: 'tempsouporcell.vcf'

and then later

Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 504, in <module>
    consensus(args, ref_mtx, alt_mtx, doublet_file)
  File "/opt/souporcell/souporcell_pipeline.py", line 454, in consensus
    "--soup_out", args.out_dir + "/ambient_rna.txt", "--vcf_out", args.out_dir + "/cluster_genotypes.vcf", "--vcf", final_vcf])
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['consensus.py', '-c', '/data/genotype/SC76/clusters.tsv', '-a', '/data/genotype/SC76/alt.mtx', '-r', '/data/genotype/SC76/ref.mtx', '-p', '2', '--soup_out', '/data/genotype/SC76/ambient_rna.txt', '--vcf_out', '/data/genotype/SC76/cluster_genotypes.vcf', '--vcf', '/data/genotype/SC76/common_variants_covered.vcf']' returned non-zero exit status 1.

How to debug what went wrong? Other 10x samples I ran in parallel did ok
I was using the version of singularity image from the current README

Easiest way to identify which sample is which individual

This is more an experimental question rather than a software one. Souporcell has done a nice job of clustering the different individuals in my pools, but now I want to go back and match a sample with a cluster. If I were to redo the experiment, I would have used the strategy recommended in the latest edition of the preprint, but that's not an option now.

What is the easiest and cheapest way to match the clusters with a sample. PCR based genotyping? What's a good way to identify variants from the clusters that would allow me to identify the sample unambiguously?

Troubleshooting results

Hello again!

This is not a usage/implementation question as before but, rather, a request for troubleshooting advice.

So, we recently performed an experiment where we used MULTI-seq to multiplex primary tissue from 13 different donors. The sample barcode data looks decent, but as we sometimes see with difficult-to-dissociate tissue, some cells were unable to be successfully demultiplexed. Moreover, when applying a different sample classification workflow to the data, we get slightly different results. I wanted to get a 'ground-truth' to decide which classification set to proceed with, so I applied souporcell to the data:

This is barcode space, with cells colored according to their demultiplexing results. Notably, while there are some cells that are doublets or cannot be classified (black), clusters are generally 'pure' and I can find all 13 donors with the deMULTIplex and demuxEM results. However, souporcell is erroneously calling three of the clusters as coming from the same donor (we can infer from gene expression analysis that these are indeed different donors). Notably, the colors are different because of mapping issue @achamess was talking about.

Do you have any insight into what could cause this result? What information would be useful for you to help me troubleshoot?

Here's the command I used to run souporcell:

singularity exec souporcell.sif souporcell_pipeline.py -i ./possorted_genome_bam.bam -b ./cellIDs.tsv -f hg19_3.0.0_genome.fa -t 16 -o LIVE_OLD_souporcell -k 13

Chris

Known Genotype Rerun

Hi @wheaton5 , I am interested in using the --known_genotypes and --known_genotypes_sample_names options for some souporcells runs that I have already done.

I am wondering when the known genotypes are used so I can just delete the files from that step forwards and I don't have to rerun the entire pipeline. I would think that it would be at the clustering step? So I would just have to remove the following files and rerun with those additional parameters?

clusters.err
clusters_tmp.tsv
clustering.done
doublets.err
clusters.tsv
troublet.done
ambient_rna.txt
consensus.done
cluster_genotypes.vcf

Let me know if this is how you would go about rerunning souporcell with known genotypes without having to start from the beginning. Thanks for your help!

Cheers,
Drew

bedtools permission error in singularity

Hello,

Thank you for the fantastic tool and cohesive setup instructions. I set up singularity and figured I'd take the thing for a spin on the test data: singularity exec souporcell.sif souporcell_pipeline.py -i A.merged.bam -b GSM2560245_barcodes.tsv -f ~/cellranger/GRCh38/fasta/genome.fa -t 8 -o demux_data_test -k 4 --skip_remap True --common_variants filtered_2p_1kgenomes_GRCh38.vcf

Unfortunately, it seems to snag for whatever reason:

imports done
checking bam for expected tags
checking fasta
using common variants
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 436, in <module>
	final_vcf = freebayes(args, bam, fasta)
  File "/opt/souporcell/souporcell_pipeline.py", line 249, in freebayes
	subprocess.check_call(["bedtools", "merge", "-i", args.out_dir + "/depth.bed"], stdout = bed)
  File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 306, in check_call  
	retcode = call(*popenargs, **kwargs)
  File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 287, in call
	with Popen(*popenargs, **kwargs) as p:
  File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 729, in __init__
	restore_signals, start_new_session)
  File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 1364, in _execute_child
	raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'bedtools'

Some assistance please? Any idea what's going on here?

Error in Pipeline: troublet call

Hi! Excuse my potential for stupidity, but I'm currently running into the same problem across a couple of datasets at the doublet detection stage below:

singularity exec souporcell.sif souporcell_pipeline.py -i ./CONTROL_SS01.possorted_genome_bam.bam -b ./CONTROL_SS01.barcodes.tsv -f ./cellranger_GRCh38.genome.fa -t 40 -k 3 -o ./CONTROL_SS01/ --ploidy 2 --min_alt 10 --min_ref 10 --common_variants ./filtered_2p_1kgenomes_GRCh38.vcf
checking modules
imports done
checking bam for expected tags
checking fasta
restarting pipeline in existing directory ./CONTROL_SS01/
no bam index found, creating
generating fastqs with cell barcodes and umis in readname
remapping with minimap2

cleaning up tmp fastqs
repopulating cell barcode and UMI tags
sorting retagged bam files
merging sorted bams
cleaning up tmp samfiles
using common variants
40
[main_samview] region "1:77493769-77493768" could not be parsed. Continue anyway.
[main_samview] region "1:154987538-77493768" could not be parsed. Continue anyway.
[main_samview] region "2:107786145-77493768" could not be parsed. Continue anyway.
[main_samview] region "3:91088843-77493768" could not be parsed. Continue anyway.
[main_samview] region "5:90047574-77493768" could not be parsed. Continue anyway.
running vartrix
running souporcell clustering
/opt/souporcell/souporcell/target/release/souporcell -k 3 -a ./CONTROL_SS01//alt.mtx -r ./CONTROL_SS01//ref.mtx --restarts 100 -b ./CONTROL_SS01.barcodes.tsv --min_ref 10 --min_alt 10 --threads 40
running souporcell doublet detection
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 535, in <module>
    doublets(args, ref_mtx, alt_mtx, cluster_file)
  File "/opt/souporcell/souporcell_pipeline.py", line 481, in doublets
    subprocess.check_call(["troublet", "--alts", alt_mtx, "--refs", ref_mtx, "--clusters", cluster_file], stdout = dub, stderr = err)
  File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['troublet', '--alts', './CONTROL_SS01//alt.mtx', '--refs', './CONTROL_SS01//ref.mtx', '--clusters', './CONTROL_SS01//clusters_tmp.tsv']' returned non-zero exit status 101.

I was wondering if you could help me troubleshoot - I've ran on a couple of different machines/datasets but keep bumping into this...

Thanks again!

running souporcell in terminal

hello wheaton5 -

I am trying to use your program to cluster a single cell RNA seq sample that should have cells from two genotypes. I am running it on my institute's supercomputing machine (remotely) using terminal on my mac. I'm getting this error - do you know what could be going on? do I need to run this within python in my terminal?

I used this form after installing singularity -
singularity exec /path/to/souporcell.sif souporcell_pipeline.py -i /path/to/possorted_genome_bam.bam -b /path/to/barcodes.tsv -f /path/to/reference.fasta -t num_threads_to_use -o output_dir_name -k num_clusters

I used num_threads = 8 (how do you choose this?)
num_clusters = 2 (should be two "genotypes")

checking modules
imports done
checking bam for expected tags
Traceback (most recent call last):
File "/opt/souporcell/souporcell_pipeline.py", line 59, in
for (index, line) in enumerate(barcodes):
File "/usr/local/envs/py36/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)

thanks I would appreciate any help! I'm an immunologist trying to run my own code.
could also email me - [email protected]

maximum number of K?

Hi,
Thanks for developing the tool! I just have a quick question about what is the max number of K ever tested? What would be the technical limitations for increasing K? I am planning on an experiment and would like to know how many samples I can mix.

error when running consensus.py

I'm trying to run the 'consensus' step in your pipeline (I'm running each step manually), and am incountering an issue.

I'm running the command:

python /exppath/souporcell_v2/software/souporcell/consensus.py -a /exppath/souporcell_v2/count_alleles/matrices/180920_lane1_wes_unremapped_alt.mtx -r /exppath/souporcell_v2/count_alleles/matrices/180920_lane1_wes_unremapped_ref.mtx -c /exppath/souporcell_v2/soupor_cluster/clusters/180920_lane1_wes_unremapped_cluster.tsv -v /exppath/demux_relevant_samples/WES_per_lane_mmaf002/180920_lane1_wes.vcf_mmaf002.vcf --soup_out /exppath/souporcell_v2/soupor_consensus/consensci/180920_lane1_wes_unremapped.soup --vcf_out /exppath/souporcell_v2/soupor_consensus/consensci/180920_lane1_wes_unremapped.vcf

and getting the issue:

28789 excluded for potential RNA editing WARNING:pystan:No module named 'stanfit4anon_model_c58d6755a445ee1723e096eb7e36ea75_355834653533342947' WARNING:pystan:Something went wrong while unpickling the StanModel. Consider recompiling. 0 doublets excluded from genotype and ambient RNA estimation Traceback (most recent call last): File "/exppath/souporcell_v2/software/souporcell/consensus.py", line 196, in <module> cluster = int(tokens[2]) ValueError: invalid literal for int() with base 10: '-464.86'

I did the clustering with the newer Rust based 'souporcell' step instead of the Python tensorflow one.

Is there a specific version of a library or Python I should use, or is there something wrong in my approach?

Bug when only 1 cluster in data?

Hi,
first: thanks for the singularity-image, a great way to provide scientific tools!

Now to the issue:
We tried the 99-1 ratio mix of two donors (B and C) from 10X / Zhang et al 2017 ("Massively parallel digital transcriptional profiling of single cells") with Souporcell.
Unfortunately, we got the following error:

imports done
checking bam for expected tags
checking fasta
restarting pipeline in existing directory 10X2017data/rundir_99-1
running souporcell clustering
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 446, in <module>
    souporcell(args, ref_mtx, alt_mtx)
  File "/opt/souporcell/souporcell_pipeline.py", line 387, in souporcell
    "-t", str(args.threads), "-l", args.max_loci, "--min_alt", args.min_alt, "--min_ref", args.min_ref,'--out',cluster_file],stdout=log,stderr=log)                              
  File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['souporcell.py', '-a', '10X2017data/rundir_99-1/alt.mtx', '-r', '10X2017data/rundir_99-1/ref.mtx', '-b', '10X2017data/99-1/outs/filtered_feature_bc_matrix/barcodes.tsv', '-k', '2', '--restarts', '15', '-t', '8', '-l', '2048', '--min_alt', '4', '--min_ref', '4', '--out', '10X2017data/rundir_99-1/clusters_tmp.tsv']' returned non-zero exit status 1.

The 50-50 and 90-10 mixes from the same paper run smoothly.
As a side note, when we tried "-k 1" on our data (to get the loss value with only one assumed donor), we also got an error (I did not write it down).
Is there maybe some issue if the data or the parameters show souporcell only 1 cluster?

Best
Julian

ModuleNotFoundError: No module named 'pysam'

Hi,
I'm trying to run souporcell from the singularity container with 10x data.

Using AWS Linux

My OS:

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"

Command

singularity exec souporcell.sif souporcell_pipeline.py \
-i /home/ec2-user/data/hSC_count/hSC_count/outs/possorted_genome_bam.bam \
-b /home/ec2-user/data/hSC_count/hSC_count/outs/filtered_feature_bc_matrix/barcodes.tsv \
-f /home/ec2-user/data/GRCh38-3.0.0_premrna/fasta/genome.fa \
-t 16 -o souporcell \
-k 4

Here is the output

checking modules
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 31, in <module>
    import pysam
ModuleNotFoundError: No module named 'pysam'

Looks like pysam can't be found?

Consistent assignment between samples that have the same individuals

I have multiple libraries that are technical replicates of each other, each containing a pool of the same 4 individuals. When I do downstream analysis, I combine them, but I'm assuming the assignments to (0,1,2,3, where k=4) are random, and won't be congruent between samples. Is there a good way to make them the same, so that library 1's cluster 0 = library 2's cluster 0?

Consensus VCF writing fails if done in parallel for multiple files

I am trying to create multiple VCF files by submitting the Souporcell consensus creations as jobs on our cluster. Some of these jobs however run in parallel, and at that point the consensus VCF creation step fails.

I believe this is due to them writing/reading/removing the same temporary file: "tempsouporcell.vcf" (perhaps prepend/append timestamp+generated random string?)

This is not a huge issue as the consensus creation is pretty quick so it can be done in sequence for multiple files, but limits our automation by just a little bit

singularity issue

I have a quick query about getting this to work with singularity, as it is throwing an error.

Singularity version is okay.
singularity --version

singularity version 3.1.1

Does not throw an error with lolcow.sif test...

Then attempting to run the test (NB wget https://sra-download.ncbi.nlm.nih.gov/traces/sra47/SRZ/005398/SRR5398235/A.merged.bam is broken...
But - wget https://sra-pub-src-1.s3.amazonaws.com/SRR5398235/A.merged.bam.1 does work.. )
I get this error:
singularity exec souporcell.sif souporcell_pipeline.py -i A.merged.bam.1 -b GSM2560245_barcodes.tsv -f GRCh38_full_analysis_set_plus_decoy_hla.fa -t 8 -o soup_or_cell_test -k 4

FATAL: container creation failed: mount /proc/self/fd/3->/software/singularity-3.1.1/var/singularity/mnt/session/rootfs error: can't mount image /proc/self/fd/3: failed to mount squashfs filesystem: invalid argument

Exactly the same error with my own data.

Any ideas?

conceptual problems with scATAC-seq application?

Hello, sorry, me again. This time with a general inquiry - there's some scATAC-seq coming in a project I'm working on, and it'll also need donor deconvolution. I assessed the incoming data from a technical angle, and the BAM/barcode list should be acquirable. However, a collaborator raised a concern with regards to souporcell's use of soup in its modelling. There is no soup in scATAC-seq. However, we've been so happy with the output of souporcell on our transcriptomics data that it'd be great to use the algorithm here if possible.

Do you have any thoughts on the matter? Sorry if this is a dumb question, I'm a complete scATAC-seq greenhorn.

AssertionError: retag subprocess ended abnormally with code 1

Hi,

I have my code (for 10 different samples, 10 different ones are running parallel by changing the number as in 1-RNA*):

singularity exec /projects/ucar-lab/danaco/Software/Soupour/souporcell.sif souporcell_pipeline.py
-i /projects/ucar-lab/danaco/lawlon/PBMC_scRNAseq/BAM_Files/1-RNA_pooled_and_merged_Cellranger_Out_50k/outs/possorted_genome_bam.bam
-b /projects/ucar-lab/danaco/PBMC_scRNAseqv/GroundBarcodes/1-HTO.all.cell.barcodes.for.demuxlet.txt
-f /projects/ucar-lab/danaco/Software/Soupour/GRCh38_full_analysis_set_plus_decoy_hla.fa
-t 8 -o /projects/ucar-lab/danaco/Ground/Soupor/1-RNA -k 4

After running for all the different stages through the pipeline for 56 hours, threw an error at this UMI stage:

checking modules
imports done
checking bam for expected tags
no bam index found, creating
checking fasta
creating chunks
generating fastqs with cell barcodes and umis in readname
remapping with minimap2
cleaning up tmp fastqs
repopulating cell barcode and UMI tags

Such that I get the error:

AssertionError: retag subprocess ended abnormally with code 1

using souporcell for plate-based scRNA-seq (Celseq2)

Hi,
I am thinking of trying to use the tool to demultiplex sequences from a mixed population of 5 donors. However, our lab uses plate-based scRNA-seq (Celseq2) and therefore we also don't cannot estimate how many of e.g. donor 1 is sorted into plate 1, etc.
Do you have any recommendation on how to modify souporcell for plate-based data?
Thanks!

Dyah

conda setup failed on ipython

FYI

conda env create -f souporcell_env.yaml

Collecting package metadata (repodata.json): -                                                                  done
Solving environment: failed

ResolvePackageNotFound:
  - ipython==7.9.0=py36h5ca1d4c_0

removed '=py36h5ca1d4c_0' in yaml file, then it worked.

Very nice work btw.

Stopped during remapping

Hi again,
I used the new DL of souporcell from the README page.
This is what happened. Not sure what's going on.
Can I restart from what's already been done?
I used the script. Here is my input command:

-i /home/ec2-user/data/hSC_count/hSC_count/outs/possorted_genome_bam.bam \
-b /home/ec2-user/data/hSC_count/hSC_count/outs/filtered_feature_bc_matrix/barcodes.tsv \
-f /home/ec2-user/data/GRCh38-3.0.0_premrna/fasta/genome.fa \
-t 16 -o souporcell_k2 \
-k 2

and the output

/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/opt/conda/envs/py36/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
imports done
checking bam for expected tags
checking fasta
creating chunks
generating fastqs with cell barcodes and umis in readname
remapping with minimap2
Traceback (most recent call last):
  File "/opt/souporcell/souporcell_pipeline.py", line 424, in <module>
    minimap_tmp_files = remap(args, region_fastqs, all_fastqs)
  File "/opt/souporcell/souporcell_pipeline.py", line 162, in remap
    stdout = samfile, stderr = minierr)
  File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['minimap2', '-ax', 'splice', '-t', '16', '-G50k', '-k', '21', '-w', '11', '--sr', '-A2', '-B8', '-O12,32', '-E2,1', '-r200', '-p.5', '-N20', '-f1000,5000', '-n2', '-m20', '-s40', '-g2000', '-2K50m', '--secondary=no', '/home/ec2-user/data/GRCh38-3.0.0_premrna/fasta/genome.fa', 'souporcell_k2/tmp.fq']' died with <Signals.SIGKILL: 9>.

TCR/BCR data

Hello,

Can souporcell deconvolute TCR/BCR data (out of cellranger-vdj)?

Thank you.

Sincerely,
Anna Arutyunyan

How to identify which sample is which?

Not really a software issue, but related. So souporcell can identify different samples accurately, I believe. But often I will want to know which of my samples corresponds to a cluster, especially when doing treatments or if there is a disease phenotype. Is there an easy way to match samples with clusters? Part of the appeal of souporcell is not having to have genotype information to cluster, but to identify, you will ultimately need some genotype info, yes? What's a good way to do this. Maybe PCR based assays rather than having to do an entire genotype array? Thoughts?

Change Barcode Identifier

Hi @wheaton5,

I am wondering if there is a way to indicate a different barcode identifier and UMI identifier other than CB and UB? I recently ran into this issue and have altered the bam file to replace all the barcode indicators and UMI indicators with CB and UB but it would be good to have this option instead of editing the bam file directly. Let me know if there is already this option which I have missed.

typo in check_modules.py

print("check successful"_)
                                        ^

Eric

souporcell_pipeline.py :: run souporcell from path

Hello,

souporcell_pipeline.py runs rust compiled souporcell binary with and absolute path
please use a call to souporcell from user path, as it is done for troublet

what is done for supporcell binary

souporcell_pipeline.py:            cmd = [directory+"/souporcell/target/release/sou
porcell", "-k",args.clusters, "-a", alt_mtx, "-r", ref_mtx

versus what is done for troublet binary

souporcell_pipeline.py:        subprocess.check_call(["troublet", "--alts", alt_mtx
, "--refs", ref_mtx, "--clusters", cluster_file], stdout = dub)

best regards
Eric

IndexError: index 12009 is out of bounds for axis 0 with size 10333

Hi,

Thanks for the great tools.
I am trying to run it for the 1st time and the script is killed when running running souporcell clustering. The error message is below:
running freebayes merging vcfs running vartrix running souporcell clustering Traceback (most recent call last): File "/opt/souporcell/souporcell_pipeline.py", line 295, in <module> "-t",str(args.threads), "-l", args.max_loci, "--min_alt", args.min_alt, "--min_ref", args.min_ref,'--out',cluster_file],stdout=log,stderr=log) File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['souporcell.py', '-a', 'sample1_souporcell/alt.mtx', '-r', 'sample1_souporcell/ref.mtx', '-b', 'barcodes.tsv', '-k', '2', '-t', '8', '-l', '2048', '--min_alt', '10', '--mi n_ref', '10', '--out', 'sample1_souporcell/clusters_tmp.tsv']' returned non-zero exit status 1.
The souporcell.log says:
loci being used based on min_alt, min_ref, and max_loci 54 Traceback (most recent call last): File "/opt/souporcell/souporcell.py", line 84, in <module> cell_data[cell-1][index] = float(ref_c)/float(ref_c+alt_c) IndexError: index 12009 is out of bounds for axis 0 with size 10333

Any idea what went wrong? Thanks much!

Reach memory limit (96G) for low number of cell

LSBATCH: User input

./souporAg.sh

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 143.

CellBender + Souporcell

I'm wanting to use the new CellBender tool that removes ambient RNA computationally from droplets.
https://github.com/broadinstitute/CellBender
It takes the raw (unfiltered) 10x hd5 file and gives out its own filtered dataset. I took this into souporcell and used the filtered cell barcode list from CellBender.

I am getting a surprisingly high number of doublets called, like ~40-50%. I suspect something is wrong. These are human nuclei data. About ~10k/reads per nucleus.

Any thoughts?

Execution error

Hello,

I recently installed souporcell, using the singularity installation. The version should be fine given the following:

singularity --version
singularity version 3.2.1-1.1.el7

However, when running souporcell, I receive the following error, just after the "checking modules":

checking modules
SIGILL: illegal instruction
PC=0x473b8b m=0 sigcode=0

goroutine 1 [running, locked to thread]:
syscall.RawSyscall(0x3e, 0x5a32, 0x4, 0x0, 0xc000299ef0, 0x4907b2, 0x5a32)
        /usr/lib/golang/src/syscall/asm_linux_amd64.s:78 +0x2b fp=0xc000299eb8 sp=0xc000299eb0 pc=0x473b8b
syscall.Kill(0x5a32, 0x4, 0x438a2e, 0xc000299f20)
        /usr/lib/golang/src/syscall/zsyscall_linux_amd64.go:597 +0x4b fp=0xc000299f00 sp=0xc000299eb8 pc=0x47053b
github.com/sylabs/singularity/internal/app/starter.Master.func4()
        internal/app/starter/master_linux.go:158 +0x3e fp=0xc000299f38 sp=0xc000299f00 pc=0x8e27be
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute.func1()
        internal/pkg/util/mainthread/mainthread.go:20 +0x2f fp=0xc000299f60 sp=0xc000299f38 pc=0x8819cf
main.main()
        cmd/starter/main_linux.go:102 +0x68 fp=0xc000299f98 sp=0xc000299f60 pc=0x8e2ff8
runtime.main()
        /usr/lib/golang/src/runtime/proc.go:201 +0x207 fp=0xc000299fe0 sp=0xc000299f98 pc=0x430cd7
runtime.goexit()
        /usr/lib/golang/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000299fe8 sp=0xc000299fe0 pc=0x45c7f1

goroutine 5 [syscall]:
os/signal.signal_recv(0xab4480)
        /usr/lib/golang/src/runtime/sigqueue.go:139 +0x9c
os/signal.loop()
        /usr/lib/golang/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /usr/lib/golang/src/os/signal/signal_unix.go:29 +0x41

goroutine 7 [chan receive]:
github.com/sylabs/singularity/internal/pkg/util/mainthread.Execute(0xc00015b150)
        internal/pkg/util/mainthread/mainthread.go:23 +0xb4
github.com/sylabs/singularity/internal/app/starter.Master(0x4, 0x5, 0x1d00, 0x5a45, 0xc00000cb00)
        internal/app/starter/master_linux.go:157 +0x44e
main.startup()
        cmd/starter/main_linux.go:73 +0x563
created by main.main
        cmd/starter/main_linux.go:98 +0x3e

rax    0x0
rbx    0x0
rcx    0xffffffffffffffff
rdx    0x0
rdi    0x5a32
rsi    0x4
rbp    0xc000299ef0
rsp    0xc000299eb0
r8     0x0
r9     0x0
r10    0x0
r11    0x206
r12    0xc
r13    0xff
r14    0xaa90fc
r15    0x0
rip    0x473b8b
rflags 0x206
cs     0x33
fs     0x0
gs     0x0

How to re-run clustering with singularity install?

Hello again!

Potentially naive question here -- If I ran SoupOrCell using the suggested singularity install, how can I run specific parts of the pipeline as you describe in the 'hard-install' section? I just applied the pipeline to a new dataset but realized I set the wrong k and want to just re-do the clustering step.

Thanks again for the new tool -- it is extremely useful!!!

Chris

wheaton5 / souporcell Goto Github PK

souporcell's Introduction

souporcell

Easy Installation (Linux) (recommended)

Practice/Testing data set: Demuxlet paper data

Hard install

To run through the pipeline script

To run things step by step not through the pipeline script

1. Remapping

2. Calling candidate variants

3. Cell allele counting

4. Clustering cells by genotype

5. Calling doublets

6. Genotype and ambient RNA coinference

souporcell's People

Contributors

Stargazers

Watchers

Forkers

souporcell's Issues

LSBATCH: User input

./souporAg.sh

Recommend Projects

Recommend Topics

Recommend Org