frisen-lab / trex Goto Github PK

Simultaneous lineage TRacking and EXpression profiling of single cells using RNA-seq

License: MIT License

Jupyter Notebook 30.79% Python 69.21%

trex's Introduction

Overview

TREX is an experimental workflow that enables simultaneous lineage TRacking and EXpression profiling of single cells using RNA-sequencing. The method is described in the paper Clonal relations in the mouse brain revealed by single-cell and spatial transcriptomics.

An essential part of this workflow is presented here: the extraction of genetic barcodes or "cloneIDs" from single-cell or spatial transcriptomes and the reconstruction of related cells/spots.

The tool uses BAM files of one or multiple sequencing libraries as an input for the generation of cloneID count matrices and identifies clonally related cells based on Jaccard similarity between each pair of cloneID+cells.

Currently, TREX is compatible with common RNA-sequencing library preparation methods and data formats provided by 10X Chromium, 10X Visium, Smart-seq2 and 3.

Installation

TREX requires Python 3.7 or newer. We recommend that you install TREX into a "virtual environment", which can be done by running the following commands:

python3 -m venv trex-venv
trex-venv/bin/pip install git+https://github.com/frisen-lab/TREX.git

trex-venv is the name of the directory that will be created and which will contain the virtual environment. You can choose a different name.

Activate the virtual environment by running source trex-venv/bin/activate. You need to repeat this step in every new shell in order to be able to run TREX. Finally, test the installation by running trex --version.

Changelog

See Changelog.

Running TREX on a minimal test dataset

Clone the Git repository or download it as a ZIP file and unpack it. The directory tests/data/ contains a test dataset in the proper format. Run TREX on it:

trex run10x -s 695 -e 724 tests/data/

This will create a folder trex_run/ in the current directory (use the -o option to choose a different folder) with the results.

See the "Runnning TREX" section below for further details.

Running TREX on data from the Nature Neuroscience paper

Here we show how to run TREX on the data we analyzed in our Nature Neuroscience paper (https://doi.org/10.1038/s41593-022-01011-x).

These instructions have been tested with Cell Ranger 6.1.2, but the original analysis was done with Cell Ranger 2.2.0.

Data

The data is available under GEO accession GSE153424. The instructions below analyze sample GSM4644060 because it is relatively small.

GSM4644060 is also called "brain1_str" (str stands for striatum) in the GEO description, and it has the ID "10X_41" in Supplementary Table 4 in the paper.

As can be seen on the overview page for GSM4644060, reads for the dataset are available from SRA experiment accession SRX8627776, which in turn links to two run accessions:

We use the run accessions to retrieve the data.

Prerequisites

Go to the Cell Ranger download page. Download and install Cell Ranger.

Download and extract the mouse reference dataset refdata-gex-mm10-2020-A.tar.gz:

 curl -O https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz
 tar xvf refdata-gex-mm10-2020-A.tar.gz

Create a custom reference by adding an extra chrH2B-EGFP-N contig (EGFP-cloneID virus) to the above mouse reference dataset. references/ refers to the directory in this repository. cellranger mkref takes about one hour. You may choose to continue to the next step (downloading reads) while this is running.

cat refdata-gex-mm10-2020-A/fasta/genome.fa references/chrH2B-EGFP-N.fa > mm10_H2B-EGFP-30N_genome.fa
cat refdata-gex-mm10-2020-A/genes/genes.gtf references/chrH2B-EGFP-N.gtf > mm10_H2B-EGFP-30N_genes.gtf
cellranger mkref --genome=mm10_H2B-EGFP-30N --fasta=mm10_H2B-EGFP-30N_genome.fa --genes=mm10_H2B-EGFP-30N_genes.gtf > mkref_mm10_H2B-EGFP-30N.out

Downloads reads

Create a fastq directory and change into it:

mkdir fastq
cd fastq

If you do not have it, install fastq-dump. For example, if you have Conda (with the Bioconda channels activated), run

conda create -n trex sra-tools
conda activate trex

Download the reads:

fastq-dump --gzip --split-3 --defline-qual '+' --defline-seq '@$ac.$sn' SRR12103475 SRR12103476

Rename the files so that Cell Ranger can find them:

mv SRR12103475_1.fastq.gz SRR12103475_S1_L001_R1_001.fastq.gz
mv SRR12103475_2.fastq.gz SRR12103475_S1_L001_R2_001.fastq.gz
mv SRR12103476_1.fastq.gz SRR12103476_S1_L001_R1_001.fastq.gz
mv SRR12103476_2.fastq.gz SRR12103476_S1_L001_R2_001.fastq.gz

cd ..

Run Cell Ranger

Continue with this step only when the above steps have finished. Run cellranger count:

cellranger count --transcriptome=mm10_H2B-EGFP-30N --id=brain1_str --fastqs=fastq/ --sample=SRR12103475 --sample=SRR12103476 --expect-cells=2299

The --expect-cells parameter is set to the number of cells loaded per well in the 10X chip used for droplet generation (4000 in this case, see the "cells recovered for 10X" column in Suppl. Table 4) divided by 1.74.

Run TREX

With trex installed (as described above), run

trex run10x -o trex_brain1_str brain1_str

Results will be written to a new directory named trex_brain1_str.

Running TREX

The input directory for TREX must be a Cell Ranger output directory. In case of Smart-Seq2 / 3 data, one BAM file with all cells or a folder with one BAM file per cell is expected (see zUMIs output) See the contents of the tests/data/ directory to learn which are the minimum files necessary. Cell Ranger/zUMIs must have been configured to map against a reference augmented by an extra chromosome that contains the cloneID. By default, that extra chromosome is assumed to be the last in the BAM file (use --chromosome to choose a different one). The options -s and -e set where on the extra chromosome the cloneID is located (-s gives start and -e gives end in 1-based coordinates).

Please also run trex run10x --help (or trex smartseq2 --help and trex smartseq3 --help respectively ) to see the other available command-line options.

Pipeline steps overview

This is an overview of the steps that the trex run10x/ trex smartseq3 command performs.

Retrieve usable reads from the input BAM file. A usable read fulfills these requirements:
- It aligns to the region specified with the --chr, -s and -e flags or, if the flags are not given, to the region that has been automatically identified to be the region containing the variable cloneID sequence,
- it has both an associated cell ID and UMI (SAM tags CB and UB in case of 10x data , SAM tags BC and UB in case of Smart-seq3 data),
- its cell ID is included in the list of allowed cell IDs (if such a list is provided with --filter-cellid or -f).
Group reads with identical cell ID and UMI into molecules. The cloneID of the molecule is taken to be the consensus of the cloneIDs.
Error-correct cloneIDs of the molecules.
Group molecules by cell ID into cells.
Filter bad cloneIDs in each cell: Rare cloneIDs also found in another cell are considered to be contaminants. CloneIDs supported by only a single read are removed.
Cluster the cells into clones by creating a clone graph. Edges are drawn between cells that appear to belong to the same clone based on the set Jaccard index threshold. The connected components of the graph are considered to be the clones. (A clone is thus simply a set of cells.)
Error-correct the clone graph by removing spurious edges ("bridges").

trex smartseq2 follows a similar pipeline with the following differences:

A useable read does not require a UMI
Reads do not get grouped into molecules and their cloneIDs are not collapsed and error-corrected into consensus sequences

Input parameters

Jaccard index threshold

The Jaccard index measures the similarity of two sample sets, in this case the similarity of two sets of cloneIDs. It is calculated by dividing the number of overlapping, unique cloneIDs between cell A and B by the total number of unique cloneIDs in cell A and B. An index of 0.0 indicates no overlapping cloneIDs and an index of 1.0 a perfect match. The Jaccard threshold is the Jaccard index above which two cells are merged into one clone. It can be set with the --jaccard-threshold flag and is 0.7 by default, meaning cell A and B are merged into one clone if they have more than 70% of cloneIDs in common.

Filter cellids

Tab-separated file of cell IDs to keep in the TREX run. Adding this file via the --filter-cellid or -f option allows to focus the analysis on specific cells and to filter out low quality cells or doublets. Example:

0	CACTCGTGGTACACACTCCG
1	CACTCGTGGTACCACAAGCA

Filtering cloneIDs

Text file with cloneIDs to ignore. The format is one cloneID per line. Adding this file via the --filter-cloneids option allows to ignore cloneIDs that have been identified as overrepresented or artefactual during library characterization. Example file:

GGTCTCCCTATACCAACAGTATCGTCTCAA
GGGTTCTGGGATATTACGTTGACTTGAGAG

Output files

TREX by default writes its results to a newly created directory named trex_run. The name of the output directory can be changed with --output (or -o). The files created in the output directory are described below. They are listed in the order in which the program creates them, see the "Pipeline steps overview" section for more details about what is done in each step.

Many of the files are tables in either tab-separated value (TSV) format or in comma-separated values (CSV) format.

`log.txt`

This file contains a copy of the output that a TREX run prints to the terminal.

`entries.bam`

All usable reads from the input BAM file. (See the pipeline steps overview for a description of "usable reads".)

`reads.txt`

A table listing cell_id, umi and clone_id for each usable input read. Example:

#cell_id          umi         clone_id
TGACGGCGTTACCAGT  AAAAAACTGT  TGTCAATCGTTCGGTTGAGCAAGATCTTAG

The character 0 in a cloneIDs signals a deleted base (CIGAR operation D in the input BAM file).
If the read does not fully cover the cloneID region, the cloneID contains the character - for each missing base.

`molecules.txt`

A table listing cell_id, umi and clone_id for each detected molecule. Example:

#cell_id                umi             clone_id
AAACCTGAGAGGTACC        AGTTAAAGTA      TGTCAATCGTTCGGTTGAGCAAGATCTTAG

`molecules_filtered.txt`

The same molecules.txt, but with those molecules removed that did not pass some filtering criteria (such as low-complexity cloneID filtering).

`molecules_corrected.txt`

A table with molecules where cloneIDs have been error corrected.

The file is the same as molecules.txt with some changes in the clone_id column.

`cells.txt`

A table listing the detected cells. The fields are cell_id, a colon (:) and then pairs of columns clone_id1 and count1 for each cloneID found in that cell. Example:

#cell_id          :  clone_id1                       count1  clone_id2 count2  ...
AAACCTGAGAGGTACC  :  TGTCAATCGTTCGGTTGAGCAAGATCTTAG  1
AAACCTGAGTAGCGGT  :  TGTCAATCGTTCGGTTGAGCAAGATCTTAG  3

`cells_filtered.txt`

The same as cells.txt, but with error-corrected cloneID counts. Cells that end up without any cloneIDs after error correction are removed.

`umi_count_matrix.csv`

A matrix of UMI counts with cells as columns and cloneIDs as rows. This is a different representation of the data written to cells_filtered.txt.

Only created if option --umi-matrix is used.

`read_count_matrix.csv`

Instead of UMI count matrix, running trex smartseq2 produces matrix of read counts with cells as columns and cloneIDs as rows. This is a different representation of the data written to cells_filtered.txt.

Only created if option --read-matrix is used.

`components.txt`

Connected components of the clone graph.

`components_corrected.txt`

Connected components of the clone graph after error correction.

`graph.pdf`, `graph_corrected.pdf`

Plots of the clone graph and the error-corrected clone graph.

These are only created if option --plot is used.

graph_corrected.gv and graph.gv are textual descriptions of the graphs in GraphViz format.

`doublets.txt`

A list of the cell IDs of the cells that were detected to be doublets.

`clones.txt`

A table listing the cell IDs belonging to each clone. The columns are clone_id and cell_id where clone_id is a number that identifies the clone.

clone#   cell_id
1        TGGCGCAAGAATAGGG

`clone_sequences.txt`

A table listing the 30N sequence of each cloneID. The columns are clone# and clone_seq where clone_id is a number that identifies the clone and clone_seq its nucleotide sequence.

clone#  clone_seq
1       ACTAGGAGATTGACGGATCACCTTTGGTCG

`data.loom`

A loom file.

This file is created only if option --loom (or -l) is used.

Creating a quality control report

trex qc --plot-jaccard-matrix --plot-hamming-distance DIRECTORY

qc takes as an input the directory (or directories) of trex output. Plotting the jaccard similarity matrix between cells requires some time as jaccard similarity is calculated pairwise amongst all cells. This can be activated adding the optional flag --plot-jaccard-matrix. Hamming distance between all viral cloneIDs found in the dataset after each step can be plotted by means of the optional flag --plot-hamming-distance.

This will add a PDF file named quality_report.pdf describing the quality of the TREX run inside the same folder with the TREX output.

This report contains:

Overall results

Histogram of clone sizes
Histogram of how many unique cloneIDs can pe found in each clone
Histogram of how many unique cloneIDs can be found in each cell
(Optional) A histogram of the Jaccard similarity values between cells and a matrix of the Jaccard similarity between all cells.
Histogram of how many reads each detected viral cloneID molecule has.

Per step results

Each of these plots has four subplots corresponding to different steps of the TREX pipeline.

Histograms of how many nucleotides have been read in each molecule
(Optional) Histograms of the Hamming distance between all the viral cloneIDs found
Histograms of how many viral cloneID molecules have been found in each cell
Histograms of how many molecules of each unique cloneID have been found in the dataset
Histograms of How many unique cloneIDs per cell have been found

TREX development

It is highly recommended that you develop TREX within a separate virtual environment:

python3 -m venv --prompt trex .venv
source .venv/bin/activate

Install TREX in "editable" mode:

pip install -e .

Install pre-commit and install the pre-commit hooks (these run at git commit time and do some checks on the to-be-committed files):

pip install pre-commit
pre-commit install

trex's People

Contributors

Stargazers

Watchers

Forkers

cnk113 acorbat wylhtydtm leonievb rosestring serjisa

trex's Issues

Filter low-complexity cloneIDs

PR #36 had some code to filter out low-complexity cloneIDs. The suggested method is to filter a cloneID if len(set(clone_id)) <= 1. This discards cloneIDs that consist of only a single repeated nucleotide, such as AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA.

Maybe we can do a bit better and instead also remove cloneIDs like these (these come from real data):

GGAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAACAAAAAAAAA

A typical way to do this would be to compute the Shannon entropy for the frequency distribution of the characters in the string and to then discard cloneIDs for which the entropy is below a threshold.

To get an idea of how the threshold could be, here are some entropies for real data:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.0
AAAAAAAAAAAAAAAAAAAACAAAAAAAAA  0.21084230031853213
GGAAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.35335933502142136
TCTAAAAAAAAAAAAAAAAAAAAAAAAAAA  0.5608251769947301
AAAAAAAAAAAAAAAAAAAAAAATTTCATA  0.7703437707962479
AAGGGGGAAGCAAGAAAATGGCCAAGGGAA  1.5376437149543918
TCTTGCGGCCGAGCTTAAAGTGGAATTTCC  1.9838711132181526

It seems that a threshold of 1 would be a pretty safe bet: We would definitely exclude the cloneIDs with a single repeated nucleotide, but also cover a couple of other cases that very much look like artifacts.

Document `--filter-cellids` file format

Moved from #3

Remove and flag doublets

@acorbat pointed out today that they have observed undesirable clustering results arising from doublets. In the clone graph, these should manifest themselves as vertices that are part of two clusters at the same time.

We can possibly find these by looking for cut vertices in the graph. A cut vertex is a vertex whose removal would lead to an increase in the number of connected components. There exist fast algorithms to find cut vertices.

(Note the similarity to bridges: A bridge is an edge whose removal would lead to an increase in the no. of connected components.)

Plotting clone graph has a division by zero problem

I was running TREX with the plotting option on and it encountered a ZeroDivisionError.

Trex 0.4.dev363+g50e8eb1
Command line arguments: run10x --per-cell --jaccard-threshold 0.4 -s 717 -e$Restricting analysis to 8700 allowed cells
36863 CloneIDs will be ignored during the analysis
[W::hts_idx_load3] The index file is older than the data file: /proj/naiss2$Reading cloneIDs from chrTomato-N:717-746 in /proj/naiss2023-6-78/jingyan/E$Found 107824 reads with usable cloneIDs and UMIs. Skipped 11684 without cel$[W::hts_idx_load3] The index file is older than the data file: /proj/naiss2$Read 98818 reads containing (parts of) the cloneID (55127 full cloneIDs, 11$Detected 77846 molecules (45492 full cloneIDs, 11426 unique)
71564 remain after low-complexity filtering
After cloneID correction, 8545 unique cloneIDs remain
After filtering cloneIDs, 2389 unique cloneIDs remain
Detected 2380 cells
Found 2743 single-read cloneIDs
1511 filtered cells remain
Writing UMI matrix
Removing 0 bridges from the graph
Removing 0 doublets from the graph (first round)
Removing 0 bridges from the graph (second round)
Plotting clone graph
Traceback (most recent call last):
File "/crex/proj/naiss2023-6-78/nobackup/shared_trex/bin/trex", line 8, i$ sys.exit(main()) File "/crex/proj/naiss2023-6-78/nobackup/shared_trex/lib/python3.10/site-$ module.main(args) File "/crex/proj/naiss2023-6-78/nobackup/shared_trex/lib/python3.10/site-$ run_trex( File "/crex/proj/naiss2023-6-78/nobackup/shared_trex/lib/python3.10/site-$ clone_graph.plot(output_dir / "graph", highlight_cell_ids, doublets2)
File "/crex/proj/naiss2023-6-78/nobackup/shared_trex/lib/python3.10/site-$ print(self.dot(highlight_cell_ids, highlight_doublets), file=f)
File "/crex/proj/naiss2023-6-78/nobackup/shared_trex/lib/python3.10/site-$ edge_scaling = (max_width - 1) / math.log(
ZeroDivisionError: float division by zero

Automatic detection of clone_id location throws error in some cases

Error message:

File "TREX/src/trex/bam.py", line 218, in detect_clone_id_location
    if counter.most_common()[0][1] / len(bases) > 0.95:
IndexError: list index out of range

Adjust minimum length if --per-cell is used

I’m not sure we talked about this already, but with per-cell cloneID correction, it appears to me that the minimum required length for a cloneID can be a bit lower. For example, there is a cell in the test dataset that has these cloneIDs:
GATGACTATACCATTTATTGACCGGCGTAC
---------------TATTGACCGGCGTAC
-------------------GACCGGCGTAC
The second and third cloneID are very incomplete (15 and 11 bases), but it’s quite obvious that they represent the same as the first. Have you taken this into account already?

I was considering this scenario. In this case, only the last bases are compared and they are all the same, so they should be corrected to the most complete version.

Originally posted by @acorbat in #36 (comment)

clones.txt contains a clone_id column, but it’s not the clone ID

The clone_id column in clones.txt and clone_sequences.txt is just a consecutive number and not what we refer to as clone ID in the rest of the pipeline and the paper.

Running TREX when samples has been divided into several FASTQ files

I was going through how to run TREX when a single sample was separated into several files. Although it seems to identify every bam file in a folder and put the reads together (#27), there is a single file of cell ids that can be used for this. Although it might be unlikely, it could happen that an allowed cell id for one dataset, should not be allowed in another dataset. I believe that two paired lists of paths and allowed cell id files could be provided to avoid this problem. Is there some other workaround? Am I missing something?

Feature: Having Quality Control plots after running TREX

It would useful to have quality control plots after running TREX. The rationale is that changing parameters for different runs yield different results and it would be useful to have an overview of how well the different steps worked and what the end point as well as intermediate points were.

Add enhancing code of the run10x pos. argument to smartseq2/3 options

New enhancements of the code e.g. the filter to remove cloneIDs with low complexity, the quality control run, the --keep_doublets flag, etc., should be available to run10x AND smartseq2/3. Future additions should immediately made available for all 3 positional arguments, it gets messy otherwise

Reconsider the name of the `annotations` folder and its contents

--min-length flag name and documentation is misleading

The flag name --min-length and its documentation implies that the flag allows to define the minimum length of a cloneID for it to be accepted as cloneID. In reality, the flag defines the minimum overlap between two cloneIDs reqired to calculate their Hamming distance and merge them. Therefore, the flag should be renamed to --min-overlap which is also the name of the variable in the code.

A consistently documented pipeline is missing

A consistent and clear step-by-step pipeline is missing in theREADME.md file. Additionally to the information on how to run the small data set and how to run cellranger for the data from the paper, the following steps should be documented in the proposed order:

documentation on how to use cellranger + cellranger commands (is already there, should be in the beginning and more detailed)
Just as comment, not documentation: in this step single cell data should be analysed and low quality cells, especially doublets, removed. Then a list of remaining cells exported in the format:
"","x"
1, "CellID1"
2, "CellID2"
comment: how to combine clonal with single cell data

Document the usage of the jaccard score

The Jaccard score is an important element of the clone calling process and can be accessed with the --jarccard-threshold-flag. The default threshold score in TREX is 0 which results in the merging of all cells of the same graph. The lack of documentation on the Jaccard score and its impact on clone calling has led to misunderstandings and needs to be improved.

The documentation should show all available arguments

The documentation README.md should include a section with:

all available arguments (what is found with the --help flag
required formats for each argument

Tune low-complexity filtering

Low-complexity filtering currently discards molecules for which the entropy of the full cloneID is below 1.0. There are some decisions that could be made in other ways that could potentially lead to better filtering results.

Use a different threshold
Change what is done with deleted/missing bases (0 and -). As suggested by @acorbat, we could compute entropy on the cloneID without them and rescale entropy.

See #41 (comment)_

Update test?

One of the current tests for TREX is assessing that CloneIDs from previous experiments are read and recovered in the same way. I was trying to add a new function to remove CloneIDs with a single base or a single repeated base. This makes the test fail. Should we change the test or add the option for new versions to only be active if a flag is passed?

The test run only gives error on folder structure when -f flag added

It was reported that running the test dataset works smoothly. However, when the -f flag is added to filter out cells, error messages are shown due to the wrong folder structure of input data (not compressed, ...gene_bc_matrices instead of feature_bc_matrix (see issue #4). This issue should always occur, with or without -f flag

correct_clone_ids looks for CloneIDs throughout the whole dataset, should it be done per cell?

Once all the molecules per cell are computed, Hamming distance between all of them is calculated to correct into a single one if it is less than some threshold. In some cases, odd corrections were performed between CloneIDs found in different cells and artifactual bridges were made generating fictitious clones.

One way of avoiding this, is to calculate Hamming distance only inside the same cell. We can also be quite loose on the parameters for comparing these as it is quite unlikely that we will find something similar in the same cell. After trying this in a couple of our datasets, I noticed that I managed to recover more barcodes in the datasets.

Note:

The correct_clone_id function could be refactored to molecule.py
There are some CloneIDs that look like polyA tails and some very short that could be discarded since the beginning (remove_odd_barcode)

Document that --plot requires Graphviz

... which can be installed with mamba install graphviz, for example.

Usage of the smartseq vs. 10x runs commands unclear

The run10x positional argument allows running trex on 10x Chromium data which have UMIs. Smartseq2 data don't have UMIs and for this purpose the smartseq3 positional argument was added (the name of the argument is wrong and should be smartseq2). However, smartseq3 produces data with UMIs and there is currently no way to run trex on this. Therefore the following should be done:

Smartseq3 positional argument renamed to Smartseq2
help section/documentation updated on what it means to run trex with Smartseq2 argument (no UMIs)
#13

Folder structure is inconsistent

The folder structure of this git repository is inconsistent and confusing. For example,

#12
the folder expected can be moved to /tests/expected
the file chrTomato-N.fa in /tests/data/outs/ should be moved to /tests/data/
folder /tests/data/outs/filtered_gene_bc_matrices should be renamed to /tests/data/outs/filtered_feature_bc_matrices
The files barcodes.tsv, genes.tsv and matrix.mtx in /tests/data/outs/filtered_gene_bc_matrices)/hg38_Tomato-N/ are expected to be compressed (e.g. barcodes.tsv.gz by the algorithm and algorithm gives error if not

Raw umi matrix?

Hello,

I've been looking at the umi matrix of the clonal barcodes outputted from trex run10x on your nature neuro paper. I turn off filter cells and turn on keep single reads. However the output matrix is still filtered... Is there an issue with those parameters? I'm looking to get the most raw umi matrix (including non-called cells).

Best,
Chang

Filtering of short cloneIDs

From https://github.com/frisen-lab/TREX/pull/36/files#r1350416257

@acorbat suggests in #36 to add a command-line option ("--min-bases-detected") for filtering out cloneIDs shorter than a specified length. There is also already the --min-length parameter. The latter is currently used in the following ways:

In correct_clone_ids() and correct_clone_ids_per_cell(): Forwarded to is_similar as min_overlap parameter. The overlap is computed as matches + mismatches (that is, all positions where either sequence has a - or 0 are not counted). Sequence pairs for which the overlap is smaller than min_overlap are considered to be "not similar".
In compute_cells(): Only molecules whose cloneID is at least as long as min_length are added to the cell.

The minimum overlap works different than the minimum length, so it would make some sense to keep min_overlap and min_length separate, that is, to allow them to have different values.

Overall, this would give us three thresholds in three subsequent steps:

Remove molecules with fewer than X detected bases.
Correct cloneIDs if they overlap by at least Y bases.
Remove molecules with fewer than Z detected bases.

Do we need all of these thresholds?

Update CHANGE.md with links

simple...

The --help section needs to be updated

Many of the argument descriptions are not informative enough when calling the --help flag. Often an input format is missing

create the positional argument `Smartseq3` to allow running smartseq3 data with UMI

@Leonievb writes

should be a small issue of parsing a different kind of folder structure

Providing an input folder containing one BAM file per cell does not work

When providing a folder containing several BAM files as input directory for trex smartseq2/3, the following error message occurs:

IsADirectoryError: [Errno 21] could not open alignment file "path/to/folder": Is a directory

The test data are incomplete

The test data should allow the testing and demonstration of the -filtered-cell flag.

An example file should be provided
The test run should allow to add this example file to filter out a few cells and compare results to unfiltered

Dealing with deleted bases in the cloneID

Deleted bases in the cloneID are represented as 0. These bases should always be in the beginning of the sequence, but the deletion can actually be anywhere: Since the reference consists of only Ns, the read mapper cannot say where the deletions actually are, so by default it moves them as far left as they will go.

When comparing cloneIDs with deleted bases, it does not make sense to using Hamming distance. We would need to use a full alignment function in order to get results that make sense.

We could, for example, stop using 0 for deleted bases and just store shorter cloneIDs. When comparing them, we would use an alignment function if their lengths differ.

Add a filtering option for excluding certain cloneIDs

As @acorbat reports, some cloneIDs keep recurring across different experiments and are therefore artifacts that should be filtered out.

His feature request: Add an option that lets us provide a list of cloneIDs (complete or incomplete sequences) to be filtered out.

The output of the clone calling algorithm does not match R output in some cases

In some cases the clone calling algorithm produces unreasonably big clones clumping together many cells that are split into separate clones by the R algorithm (with same parameters)

Reproducing the cloneID raw matrix

Hello,

I'm trying to reproduce the clonal barcode matrix from your paper, and I was wondering if the cellranger annotation is specifically modified? I assume you add the chr-Tomato.fa to the end of the mouse genome and I was wondering what was appended to the GTF file?

Best,
Chang

Output table formats are inconsistent

The following outputs are in tab-separated value format (TSV):

cells_filtered.txt
cells.txt
molecules_corrected.txt
molecules.txt
reads.txt
components.txt
components_corrected.txt

Thes files are in comma-separated value format (CSV):

clone_sequences.txt
clones.txt

If there’s no good reason for this, this should be made consistent. If there is a good reason for clones*.txt to be in CSV format, we should document that.

frisen-lab / trex Goto Github PK

trex's Introduction

Overview

Installation

Changelog

Running TREX on a minimal test dataset

Running TREX on data from the Nature Neuroscience paper

Data

Prerequisites

Downloads reads

Run Cell Ranger

Run TREX

Running TREX

Pipeline steps overview

Input parameters

Jaccard index threshold

Filter cellids

Filtering cloneIDs

Output files

log.txt

entries.bam

reads.txt

molecules.txt

molecules_filtered.txt

molecules_corrected.txt

cells.txt

cells_filtered.txt

umi_count_matrix.csv

read_count_matrix.csv

components.txt

components_corrected.txt

graph.pdf, graph_corrected.pdf

doublets.txt

clones.txt

clone_sequences.txt

data.loom