aristoteleo / dynast-release Goto Github PK

Inclusive and efficient quantification of labeling and splicing RNAs for time-resolved metabolic labeling based scRNA-seq experiments

Home Page: https://dynast-release.readthedocs.io/en/latest/

License: MIT License

Python 99.59% Makefile 0.16% Stan 0.25%

metabolic-labeling nasc-seq rna-velocity sci-fate scnt-seq scrna-seq scslam-seq vector-field

dynast-release's People

Contributors

Stargazers

Watchers

Forkers

yenaled mowonhoo jiayi-fan cnk113

dynast-release's Issues

Recruiting Python Programmers to Develop Predictive Single Cell Spatiotemporal Analysis Toolkit

Recruiting Python Programmers to Develop Predictive Single Cell Spatiotemporal Analysis Toolkit

The Weissman lab (https://weissman.wi.mit.edu/) at Whitehead Institute & MIT is looking for one or more talented programmer (primarily undergraduates/masters but also open to senior computational scientist) with strong python programming and software engineering backgrounds to develop, optimize and maintain next generation predictive tools for the emergent single cell genomics applications. This position is particularly well suited for senior undergraduates or masters seeking for research experience before continuing their education or industrial career. The goal of the projects is to develop a foundational software ecosystem, Aristotle (https://github.com/aristoteleo), that marries machine learning, dynamical systems and other systems biology approaches with advanced single cell genomics to empower biologists to gain mechanistic and predictive insights of cell fate transitions in development and diseases. The representative publications related to this position include: Genome-wide perturb-seq: Replogle, Saunders, et. al, Cell, 2022; Stereo-seq: Chen, et. al, Cell, 2022; Dynamo: Qiu, Zhang, et. al, Cell, 2022.

The ideal candidate will join us to further develop and improve the Aristotle ecosystem, an end-to-end computational framework that provides advanced spatiotemporal modeling of single cell and spatial genomics datasets. The Aristotle ecosystem currently consists of a coherent set of three major python packages on Github: dynast (aristoteleo/dynast-release), dynamo (aristoteleo/dynamo-release) and spateo (aristoteleo/spateo-release), that have functionalities range from raw sequencing data processing to downstream predictive frameworks of biophysical modeling of RNA kinetics, machine learning of the RNA velocity vector field, and spatiotemporal modeling of single cell resolution spatial transcriptomics in 3D space, etc. The ideal candidate will contribute to two major aspects: software engineering and community outreach. The candidate will apply state-of-the-art software engineering practices to improve the modularity, organization, and quality of the existing code base. Additionally, the candidate will contribute to the development of novel algorithms and the analysis of new single cell genomics datasets. The candidate will be encouraged to participate in community activities to foster wider adoption of the Aristotle ecosystem. Papers reporting the packages and methods will be published to promote the ecosystem and the candidate’s career.

The Weissman lab at Whitehead Institute & MIT is a leader in functional genomics including the development of ribosome profiling, CRISPRi, CRISPRa, and CRISPRoff, Perturb-seq, lineage tracing and novel predictive computational tools. The lab is well-known for its highly stimulating and collaborative environment, as well as its strong track record for placing students into top graduate programs and as a training ground for leaders in academia and industry. A competitive salary will be offered, following Whitehead Institute standard based on training level. The candidate will need to primarily work in the lab, located in the vibrant Kendall Square area in Cambridge, which is a hub for science, engineering and biotechnology and is proximal to the MIT campus.

If you are interested in this position, please email your CV and github account ID to BOTH Dr. Jonathan Weissman ([email protected]) AND his post-doc Dr. Xiaojie Qiu ([email protected]) who will provide many hands-on supervision on the proposed projects.

Whitehead Institute is an equal opportunity employer. Diversity and inclusion are core values of the Whitehead Institute and the Institute is committed to non-discrimination in employment.

The full advertisement can be found in the following:
programmer_final.pdf

dynast --list option not suitable

Hello,

I you like to use the dynast pipeline to gerenate the input files for Dynamo. I work with sc-SLAM seq data.
The four supported technologies offerted for dynast aligment are not suitable for my data, because my read1 file started with 10 Barcode bases + 8 UMI bases (inverted compared to scifate). Here an example of a line of my read1 fastq:

TAACTTGGTCGTGAGGGCTTTTTTTT, where TAACTTGGTC is the BC, GTGAGGGC is UMI.
Is it possible to make dynast align suitable for my data structure?

Thanks
Francesca

Align: not enough memory for BAM sorting potentially due to BinsN too high

In the past, changing ulimit -Hn had fixed this. Now running this on Terra and this parameter is much more difficult to change. Changing the way BinsN is chosen or allowing an optional input to set it would be great!

Thanks!

Dynast count BamError: Some paired reads do not have mates

Hi,

First of all, thank you for the cool tool!
I have a some troubles to get the dynast count function to work. I keep getting the following error:

ERROR [main] An exception occurred Traceback (most recent call last): File "/broad/hptmp/jalbers/dynast_env/lib/python3.7/site-packages/dynast/main.py", line 741, in main COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=args.tmp) File "/broad/hptmp/jalbers/dynast_env/lib/python3.7/site-packages/dynast/main.py", line 582, in parse_count velocity=not args.no_splicing, File "/broad/hptmp/jalbers/dynast_env/lib/python3.7/site-packages/ngs_tools/logging.py", line 62, in inner return func(*args, **kwargs) File "/broad/hptmp/jalbers/dynast_env/lib/python3.7/site-packages/dynast/count.py", line 95, in count velocity=velocity File "/broad/hptmp/jalbers/dynast_env/lib/python3.7/site-packages/dynast/preprocessing/bam.py", line 557, in parse_all_reads n_threads=n_threads, File "/broad/hptmp/jalbers/dynast_env/lib/python3.7/site-packages/ngs_tools/bam.py", line 213, in split_bam raise BamError('Some paired reads do not have mates.') ngs_tools.bam.BamError: Some paired reads do not have mates.

The error occurs when running the following code:
dynast count --tmp tmp_count_1\ --keep-tmp\ --verbose\ -t 8\ --barcode-tag RG\ --barcodes align/Solo.out/Gene/filtered/barcodes.tsv\‚ -o COUNT/\ --barcodes ALIGN/Solo.out/Gene/filtered/barcodes.tsv\ -g gencode.v38.annotation.gtf.gz\ --conversion TC ALIGN/Aligned.sortedByCoord.out.bam

Some background for these data: I performed mini-bulk RNAseq with SmartSeq2, performed the library prep using Nextera and sequenced the samples on a Nextseq500. The BCL file that Nextseq generates was demultiplexed and converted to fastq files using bcl2fastq and I used dynast align to generate the BCL file. The BCL file contains the reads from two different samples (bulk).

This is the code I used to run dynast align:
dynast align\ --verbose\ --keep-tmp\ --tmp new_tmp6\ -i $STAR_DIR\ -o ALIGN/\ -t 8\ -x smartseq samples_ls_1.csv\ -w all_barcodes_SS2_whitelist.txt

The file all_barcodes_SS2_whitelist.txt is empty.

I've done the following troubleshooting so far, but did not get it to run:

I inspected the BAM file using Sam-tools and IGV and in general the reads are paired in mates.
I filtered the BAM file for only paired samples: samtools view -f 1 -b -o out.bam in.bam (https://www.biostars.org/p/263573/) but got the same error when using the filtered BAM file as input.

Thanks for your help!
Julian

dynast estimate: behavior for genes for which estimation is skipped and --reads argument

I take the opportunity to ask a couple additional questions on dynast estimate. What is the current behavior for genes for which estimation is skipped?

(WARNING [estimate] Estimation skipped for 19096 cell-gen
es because they have less than 16 reads. Use --cell-gene-threshold to change.)

Are they removed from the anndata object? Or are they kept on it but not corrected?

My other question is whether it would be possible to allow to use more than one option in the --reads argument (e.g. transcriptome and total).

Thanks for this awesome tool!!
Jorge

Bug in 'estimate' when p_e is taken from control

Hi!

I've been playing around with this package and it looks very cool, congrats and thank you!

There seems to be a little bug in the code for estimating new/old when the background conversion probability is estimated from a control sample. Specifically, in lines 151-155 of estimate.py, there is the following code:

if control_p_e:
    logger.info('--p-e provided. No background mutation rate estimation will be done.')
    df_barcodes = df_counts[p_key].drop_duplicates().reset_index(drop=True)
    df_barcodes['p_e'] = control_p_e
    df_barcodes.to_csv(p_e_path, header=[p_key, 'p_e'], index=False)

df_barcodes is a pandas Series object, not a DataFrame. Thus, when you do df_barcodes['p_e'] = control_p_e, this actually creates a new row in the Series of value control_p_e and index 'p_e'. Instead, a DataFrame should be created with a new column for p_e.

So replacing this line with df_barcodes = pd.DataFrame({p_key:df_barcodes,'p_e':control_p_e}) should resolve this issue. Currently when I try to run estimate I encounter the following issue:

[2022-04-23 23:57:18,936]    INFO [estimate] `--p-e` provided. No background mutation rate estimation will be done.
[2022-04-23 23:57:18,976]   ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/dynast/main.py", line 742, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=args.tmp)
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/dynast/main.py", line 650, in parse_estimate
    estimate(
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/ngs_tools/logging.py", line 62, in inner
    return func(*args, **kwargs)
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/dynast/estimate.py", line 155, in estimate
    df_barcodes.to_csv(p_e_path, header=[p_key, 'p_e'], index=False)
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 3466, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/io/formats/format.py", line 1105, in to_csv
    csv_formatter.save()
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 257, in save
    self._save()
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 261, in _save
    self._save_header()
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 266, in _save_header
    self.writer.writerow(self.encoded_labels)
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 228, in encoded_labels
    encoded_labels += list(self.write_cols)
  File "/camp/home/maizelr/.local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 209, in write_cols
    raise ValueError(
ValueError: Writing 1 cols but got 2 aliases

Cheers,
Rory

Motivation for using consensus

Hi guys,

This is a bit of a reply to your responses in my previous issue #9, regarding the new consensus step, and might not be totally appropriate for the issues section - but I've read through your technical information update on readthedocs, and I find it very surprising that there are substantially different sequences with the same UMI. I had some questions:

do you know if this occurs across multiple sequencing protocols (i.e. does this also happen in 10x data or sci-RNA-seq data), and how widespread is it across different transcripts?
do you have any idea of the mechanism of how this might be happening?
do you see empirically that using consensus significantly improves the detection of conversion sites compared to just using align?

I'm sure that the answers to these questions are all in the works, I just wondered if you could give some rough ideas of your thinking, so I know whether to use consensus for my current analysis. If you want to email me to discuss feel free, my email is [email protected]

Thanks,
Rory

Exception on dynast estimate - cell barcode names

Hi,

After processing my using dynast counts, I used dynast estimate using the --groups argument.

I obtained a list of the cell barcodes of interest from the anndata object generated by dynast counts, and saved it as a csv with the following format:

AAAAAAAAAAGG,group_1
AAAAAGACCAAT,group_1
AAAACATTGGGC,group_1
AAAACCAAGTGA,group_1
AAAACCAGCGTG,group_1
AAAACCTGAGTA,group_1
AAAACTACTCCC,group_1
AAAACTCCGGGT,group_1
AAAAGCTGCATG,group_1
AAAAGTTTCAGG,group_1

I am currently getting the following error at the last step of dynast estimate, during anndata object generation:

ERROR [main] An exception occurred
Traceback (most recent call last):
  File "/home/jmartinr/anaconda3/envs/dynast_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 706, in astype
    casted = self._values.astype(dtype, copy=copy)
ValueError: invalid literal for int() with base 10: 'AAAAAAAAAAGG'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/jmartinr/anaconda3/envs/dynast_env/lib/python3.8/site-packages/dynast/main.py", line 665, in main
    COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=args.tmp)
  File "/home/jmartinr/anaconda3/envs/dynast_env/lib/python3.8/site-packages/dynast/main.py", line 578, in parse_estimate
    estimate(
  File "/home/jmartinr/anaconda3/envs/dynast_env/lib/python3.8/site-packages/dynast/logging.py", line 53, in inner
    return func(*args, **kwargs)
  File "/home/jmartinr/anaconda3/envs/dynast_env/lib/python3.8/site-packages/dynast/estimate.py", line 224, in estimate
    adata.obs['count_dir'] = adata.obs.index.str.split('-').str[-1].astype(int).map({
  File "/home/jmartinr/anaconda3/envs/dynast_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 708, in astype
    raise TypeError(
TypeError: Cannot cast Index to dtype int64

Based on the error, it seems the issue is caused by the formatting of the cell barcodes, which expects a "-1" at the end of their names. However, dynast counts does not produce cell barcodes with "-1" at the end.

A single input directory is being used, as well as a single group of cells on the CSV (I want to use dynast estimate on all cells of the file together).

Thanks,
Jorge

dynast count --conversion

We design a experiment which RNA labeled with s4U to inducing T>C conversion on mRNA. I would like to use dynast count to generate the adata object etc. Should I set the --conversion TC or --conversion TC,AG? Does it depend on the strandedness of my library?

I notice that in the scNT-seq/sci-fate tutorial, you just set --conversion TC. However, scNT-seq and sci-fate are unstranded library and there are roughly equal TC and AG mutations in their alignments. Does it means that we should just set the --conversion as the mutation on mRNA induced by experiment and do not consider library strandedness?

Yan

dynast count error：forward_strand = df_counts['GX'].map(lambda gx: gene_infos[gx]['strand']) == '-' KeyError: '-'

I encountered some problems while running dynast. I would appreciate it if the author could help me with them. Thank you very much.

1. My library is generated using the C4 kit from BGI Genomics, and there is no option in dynast --list to support this technology. So, I would like to know if there is a way to directly analyze C4 data.

2. I tried using the smartseq technology option to analyze only the R2_fastq file, but encountered problems during the counting process. Below is my code:
align
STAR --outSAMtype BAM SortedByCoordinate --outSAMattributes NH HI AS NM nM MD GX GN RG
--outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 --soloType SmartSeq --soloUMIdedup Exact
--soloStrand Unstranded --genomeDir ${dynast}/star --runThreadN 20
--outFileNamePrefix ${dynast_align} --outBAMsortingBinsN 12490 \ ###Due to the limitation of my folder, which can create a maximum of 10,000 files, I removed the "--outBAMsortingBinsN" option.###
--readFilesIn cDNAlib_2.fq --outSAMattrRGline ID:C4
count
dynast count -t 40 -g $gtf --no-splicing -o $my_path/dynast_count/ --barcode-tag RG
--conversion TC ${dynast_align}/Aligned.sortedByCoord.out.bam --tmp ${my_path}/tmp --strand unstranded --verbose --keep-tmp

error：
/home/mingjian/.local/lib/python3.8/site-packages/anndata-0.9.0rc1-py3.8.egg/anndata/experimental/pytorch/_annloader.py:18: UserWarning: Сould not load pytorch.
warnings.warn("Сould not load pytorch.")
[2023-04-14 14:04:10,117] DEBUG [main] Printing verbose output
[2023-04-14 14:04:10,117] DEBUG [main] Input args: Namespace(bam='/home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/dynast_align/Aligned.sortedByCoord.out.bam', barcode_tag='RG', barcodes=None, command='count', control=False, conversion=['TC'], dedup_mode=None, exon_overlap='strict', g='/home/mingjian/workbase/01_zebrafish_embryo/zebrafish_ensenmble_ref/Danio_rerio.GRCz11.108.gtf', gene_names=False, gene_tag='GX', keep_tmp=True, list=False, nasc=False, no_splicing=True, o='/home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/dynast_count/', overwrite=False, quality=27, snp_csv=None, snp_min_coverage=1, snp_threshold=None, strand='unstranded', t=40, tmp='/home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/tmp', umi_tag=None, verbose=True)
[2023-04-14 14:04:10,117] DEBUG [main] Creating /home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/tmp directory
[2023-04-14 14:04:10,117] WARNING [main] --barcodes not provided. All cell barcodes will be processed.
[2023-04-14 14:04:12,629] WARNING [count] BAM contains secondary alignments, which will be ignored. Only primary alignments are considered.
[2023-04-14 14:04:12,916] WARNING [count] Skipped BAM parsing because files already exist. Use --overwrite to re-parse the BAM.
[2023-04-14 14:04:13,788] INFO [count] Counting conversions to /home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/dynast_count/counts_TC.csv
[2023-04-14 14:04:13,788] DEBUG [count] Loading index /home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/dynast_count/conversions.idx
[2023-04-14 14:04:14,784] DEBUG [count] Splitting indices into 40 parts
[2023-04-14 14:04:15,414] DEBUG [count] Spawning 40 processes
counting: 100%|########################################################| 9.64M/9.64M [00:29<00:00, 332kit/s]
[2023-04-14 14:04:48,072] DEBUG [count] Combining intermediate parts to /home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/tmp/tmpcw0f0gh9
[2023-04-14 14:04:49,207] DEBUG [count] Loading combined counts from /home/mingjian/workbase/03_zebra5.5hpf/02_C4_5.5hpf_20230412/dynast_result/tmp/tmpcw0f0gh9
[2023-04-14 14:04:56,822] ERROR [main] An exception occurred
Traceback (most recent call last):
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/dynast/main.py", line 1052, in main
COMMAND_TO_FUNCTION[args.command](parser, args, temp_dir=args.tmp)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/dynast/main.py", line 829, in parse_count
count(
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/ngs_tools/logging.py", line 62, in inner
return func(*args, **kwargs)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/dynast/count.py", line 306, in count
counts_path = preprocessing.count_conversions(
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/dynast/preprocessing/conversion.py", line 541, in count_conversions
df_counts = complement_counts(read_counts(combined_path), gene_infos)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/dynast/preprocessing/conversion.py", line 116, in complement_counts
forward_strand = df_counts['GX'].map(lambda gx: gene_infos[gx]['strand']) == '-'
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/pandas/core/series.py", line 4539, in map
new_values = self._map_values(arg, na_action=na_action)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/pandas/core/base.py", line 890, in _map_values
new_values = map_f(values, mapper)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/pandas/core/base.py", line 873, in
map_f = lambda values, f: values.map(f)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/pandas/core/arrays/categorical.py", line 1533, in map
new_categories = self.categories.map(mapper)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 6361, in map
new_values = self._map_values(mapper, na_action=na_action)
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/pandas/core/base.py", line 890, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "/home/mingjian/miniconda3/envs/dynast/lib/python3.8/site-packages/dynast/preprocessing/conversion.py", line 116, in
forward_strand = df_counts['GX'].map(lambda gx: gene_infos[gx]['strand']) == '-'
KeyError: '-'

Dynast count error:

Despite using Dynast Count on the same data there seems to be a difference when running Dynast with "TC, GA ", "GA", and "TC". It seems that there's in increase in the conversion you call for. (ie. Higher TC when you look for TC)

210809_dual_labeling.pdf

Consensus.bam lost gene-tag after completion

Hi,

I am trying out your very nice dynast pipeline (version 0.2.0) I ran into an issue with counting after creating a consensus bam file. Somehow the consensus.bam has lost the gene-tag or gene assignment after completion. Mind you the flag for gene-tag differs from your default value "GX", but I specify --gene-tag GE when running the consensus command. I checked the in and output bam files for the presence of my gene-tag (see below), plus here is the command I ran

dynast consensus -t 24 -g gencode.GRCm39.vM29.primary_assembly.annotation.noChr.rRNA.gtf --umi-tag UB --barcode-tag BC --gene-tag GE --strand forward --quality 27 -o dynast_consensus UMIs.bam --barcodes expected_barcodes.txt

samtools view -@ 48 UMIs.bam | grep -c 'GE:Z:ENSMUS'
1997940041

samtools view -@ 48 consensus.bam | grep -c 'GE:Z:ENSMUS'
0

Is this normal and one should disregard the --gene-tag for counting after consensus building, or do you have any good input as to why this might have happened?

Thanks in advance.

Best,
Michael