scverse / scirpy Goto Github PK
View Code? Open in Web Editor NEWA scanpy extension to analyse single-cell TCR and BCR data.
Home Page: https://scirpy.scverse.org/en/latest/
License: BSD 3-Clause "New" or "Revised" License
A scanpy extension to analyse single-cell TCR and BCR data.
Home Page: https://scirpy.scverse.org/en/latest/
License: BSD 3-Clause "New" or "Revised" License
In GitLab by @szabogtamas on Feb 6, 2020, 12:02
Maybe this is something already available in Scanpy and I recall something similar is already implemented in Seaborn.
I think that in order to make plots prettier, a key step would be to fix size.
For most publication scenarios a 3.44 x 2.58 inch figure is a safe choice (a single text column in most journals) that resolves to around the 1050x768 pixels standard, also good enough for presentations.
This figure size is also good, if individual figures will realistically become subplots of a multipanel figure later. If the figure looks nice at this size, chances are good that something will still be visible if the area is reduced to a quarter.
Font size and spacing could be optimized to this size.
Later this could be called a small image
profile and a full-page version could also be made.
This could also be something that we can store in the uns
and the plotting function could check if anything was passed explicitly, the value is set in uns
or fall back to matplotlib defaults.
A separate profile could be set up for colors.
In GitLab by @grst on Mar 31, 2020, 13:30
null
In GitLab by @szabogtamas on Feb 10, 2020, 13:00
When computing convergence, it seems plausible to add the convergence value (let us say log(1/singleton rate) for now) for a single cell given the grouping to obs
and not only the singleton/duplicate/triplicate rates to uns
.
The case would be similar with alpha diversity scores and maybe also clonal expansion.
If we add it, what rule should we follow for column names?
The original issue
Id: 4
Title: Describe rationale of clonotype definition
could not be created.
This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:
TODO
In GitLab by @grst on Jan 30, 2020, 12:59
Tools are functions that work with the data parsed from 10x/tracer and add either
obs
obsm
(e.g. distance matrices)uns
.They are usually required as an additional processing step before running certain plotting functions.
Here's a list of tools we want to implement.
@szabogtamas, feel free to add to/edit the list.
st.tl.define_clonotypes(adata)
assignes clonotypes to cells based on their CDR3 sequencesst.tl.tcr_dist(adata, chains=["TRA_1, "TRB_1"], combination=np.min)
adds TCR dist to obsm
(#11)st.tl.kidera_dist
adds Kidera distances to obsm
st.tl.chain_convergence(adata, groupby)
adds column to obs
that contains the number of nucleotide versions for each CDR3 AA sequencest.tl.alpha_diversity(adata, groupby, diversityforgroup)
Now we were only thinking about calculating diversity of clonotypes in different groups. But the diversity of any group could just as well be calculated.st.tl.sequence_logos(adata, ?forgroup?)
Precompute MSAs and sequence logos for plotting with st.pl.sequence_logos
.st.tl.dendrogram(adata, groupby)
Compute a dendrogram on an arbitrary distance matrix (e.g. from tcr_dist).st.tl.create_group(group_membership={Group1: ['barcode1', barcode2']}
adds a group membership to each cell by adding a column to obsm
and the name of the grouping to a list in uns
(by default, groups based on samples, V gene usage and even clonotypes could be created at initial run); might call chain_convergence and alpha_diversity functions to calculate these measures right when creating a grouptcellmatch
(Fischer, Theis et al. )In GitLab by @szabogtamas on Feb 11, 2020, 16:00
I think we need to think again about how to compute the convergence and how useful it is in the context of single cell data.
In our example dataset, it appears so occur at negligible rates and doesn't make sense to be visualized as a barplot.
Finally, I believe it measures something similar as TCR-dist.
Ideas:
The CDR3 convergence is currently hidden from the public API, but the preliminary code is still in _plotting._cdr_convergence
.
In GitLab by @grst on Mar 20, 2020, 18:16
null
In GitLab by @grst on Mar 18, 2020, 15:42
Even though we cannot get all information from the csv
, we should still support it - some of the public datasets do not provide the json files.
In GitLab by @grst on Nov 18, 2019, 16:53
Hi @szabogtamas,
the new pipeline looks very nice and I got to run it on the example data without problems.
I now
main.nf
a bityml
files. So it is independent of the environment in your home directory.When I try to run it on the data from the Vanderburg study,
I get the following error. Do you have any idea? Otherwise, let's look at
it together when you are back!
Best, Gregor
Error executing process > 'callClonotypes (1)'
Caused by:
Process `callClonotypes (1)` terminated with an error exit status (1)
Command executed:
callClonotype.py mergedCDRs.tsv clonotypeTable.tsv additionalCellInfo.tsv chainConvergence.tsv chainMap.tsv chainPairs.tsv chainNet.tsv inToDiv.txt inToDist.txt
Command exit status:
1
Command output:
(empty)
Command error:
Traceback (most recent call last):
File "/home/sturm/projects/2019/singlecell_tcr/bin/callClonotype.py", line 325, in <module>
__main__()
File "/home/sturm/projects/2019/singlecell_tcr/bin/callClonotype.py", line 19, in __main__
callClonotypesAndChainPairs(cdrF, clonInfo, cellInfo, convergenceTab, chainMap, chainPairs, chainNet, seqsToDiv, seqsToDis, distance_method)
File "/home/sturm/projects/2019/singlecell_tcr/bin/callClonotype.py", line 35, in callClonotypesAndChainPairs
chainMapping, cdrMapping, convergenceTable, clonoFreq, cellChainTable, cellInfoTable, chainsForDiversity, chainsForDist = renameChain(cdrF, chainMapping, cdrMapping, convergenceTable, clonoFreq, cellChainTab
le, cellInfoTable, chainsForDiversity, chainsForDist, distance_method)
File "/home/sturm/projects/2019/singlecell_tcr/bin/callClonotype.py", line 172, in renameChain
cellInfoTable[cellNames] = [clonoSign, cellNames[:cellNames.rindex('_')+1], chainLinks]
ValueError: substring not found
Work dir:
/home/sturm/projects/2019/singlecell_tcr/work/b7/ef5d9c141aeef83d23efc80f827746
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
In GitLab by @grst on Mar 27, 2020, 09:46
Everyone seems to use these scatterplots:
Possible interface
ir.pl.scatter(adata, data_col, x_value, y_value, color)
In GitLab by @grst on Mar 21, 2020, 13:32
null
In GitLab by @grst on Jan 24, 2020, 13:50
Getting the alignments was quite straightforward, thanks to parasail.
We said that we focus on CDR3, at least initially.
@szabogtamas, some points to discuss
4
. Does that make sense?Edit:
Prototype works. ToDo for now.
primary_only
or all chains
In GitLab by @grst on Mar 20, 2020, 18:17
Maybe also the STARTRAC indexes could be of interest: https://www.ncbi.nlm.nih.gov/pubmed/30479382
The original issue
Id: 10
Title: Supported T cell receptor types
could not be created.
This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:
TODO
In GitLab by @grst on Mar 30, 2020, 14:53
null
In GitLab by @grst on Mar 28, 2020, 12:18
null
In GitLab by @grst on Jan 24, 2020, 14:06
Here's a quite recent re-implementation of sequence logos in Python that looks promising:
https://github.com/jbkinney/logomaker
We will also require an algorithm for multiple sequence alignment in addition to the pairwise one that we have already.
Things to discuss:
In GitLab by @szabogtamas on Mar 30, 2020, 07:05
Add possibility to analyse the overlap of the samples based on the clonotype network.
What I would picture here is a heatmap of pairwise Hamming distances, where the binary string is the presence or absence of a given sample in a clonotype cluster.
Could not find a relevant function in Scanpy, but it does not seem very complicated with scipy.spatial.distance.hamming
.
In GitLab by @grst on Mar 20, 2020, 18:16
The current one, sctcrpy is hard to pronounce and remember.
Also it would be nice if the name left the option to expand to BCRs later on.
imm
, sc
, py
, receptor
, cr
, ... ??
In GitLab by @grst on Mar 13, 2020, 11:56
Just discovered this on GitHub:
Has quite similar concepts (scanpy extension)... on the other hand it doesn't really seem promoted a lot and lacks some functionality that we have.
I do like their way of visualizing shared CDR3 sequences and that they allow to find 'public' and 'private' epitopes.
Maybe we can draw some more inspiration from there.
In GitLab by @grst on Mar 21, 2020, 11:45
null
In GitLab by @grst on Jan 17, 2020, 14:32
null
In GitLab by @grst on Nov 19, 2019, 14:00
Hi @szabogtamas,
after fixing the sample prefixes, the pipeline ran fine on two samples of the Vanderburg study.
However, when including all samples (50k cells) a certain object appears to become too large for multiprocessing (see error message below).
This is just to keep track of the error, we can discuss it next week and I can probably fix it myself.
This could help, but there's probably an even better way to deal with this:
https://stackoverflow.com/questions/29704139/pickle-in-python3-doesnt-work-for-large-data-saving
/home/sturm/data/projects/2019/IBDome/datasets/HMP2
Error executing process > 'kideraDistances (1)'
Caused by:
Process `kideraDistances (1)` terminated with an error exit status (1)
Command executed:
chainBasedCellDistanceCalculations.py kidera chainKideras.tsv kideraDistanceMatrix.h5
chainBasedCellDistanceCalculations.py celldist kideraDistanceMatrix.h5 chainPairs.tsv minimum 8
Command exit status:
1
Command output:
<KeysViewHDF5 ['distances', 'names']>
Command error:
Traceback (most recent call last):
File "/home/sturm/projects/2019/singlecell_tcr/bin/chainBasedCellDistanceCalculations.py", line 258, in <module>
__main__()
File "/home/sturm/projects/2019/singlecell_tcr/bin/chainBasedCellDistanceCalculations.py", line 26, in __main__
cellDistances, cells = calculateCellDistanceFromChains(condensedDistances, chains, chainsOnCells, disambiguation=disambiguation, numCores=numCores)
File "/home/sturm/projects/2019/singlecell_tcr/bin/chainBasedCellDistanceCalculations.py", line 82, in calculateCellDistanceFromChains
M = p.map(checkCellDistance, itertools.zip_longest(itertools.combinations(cellnames, 2), [], fillvalue=(condensedDistances, chainsOnCells, posDict, disambiguation, L)), chunksize=chunkSize)
File "/home/sturm/projects/2019/singlecell_tcr/work/conda/tcrpy3-50638e2f3e2662bd69be51301015f0d3/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/sturm/projects/2019/singlecell_tcr/work/conda/tcrpy3-50638e2f3e2662bd69be51301015f0d3/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
File "/home/sturm/projects/2019/singlecell_tcr/work/conda/tcrpy3-50638e2f3e2662bd69be51301015f0d3/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
put(task)
File "/home/sturm/projects/2019/singlecell_tcr/work/conda/tcrpy3-50638e2f3e2662bd69be51301015f0d3/lib/python3.7/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/sturm/projects/2019/singlecell_tcr/work/conda/tcrpy3-50638e2f3e2662bd69be51301015f0d3/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
Work dir:
/home/sturm/projects/2019/singlecell_tcr/work/c0/c9bf36401672277dae2d9edac9d8b4
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
In GitLab by @grst on Oct 3, 2019, 10:35
Python 2 will not be supported any more end of the year.
We should port the code to python 3.7. Should not be too hard.
In GitLab by @grst on Mar 20, 2020, 18:26
null
In GitLab by @grst on Mar 31, 2020, 10:23
null
ToDo CI:
In GitLab by @grst on Mar 20, 2020, 18:35
In particular, double-check all docstrings for
At that occation:
In GitLab by @szabogtamas on Feb 11, 2020, 14:51
In the stacked curve mode of spectratype plotting, the fill areas are shifted because I cannot find a way to set bins for seaborn's kdeplot. Maybe we could move to scipy or scikit.
In GitLab by @szabogtamas on Feb 19, 2020, 14:04
In publications, people commonly refer to CDR3s shared by most individuals in a group (patients) as a public chain. We should also consider if we want to offer a function to detect public CDRs and if this should just return a list or try to visualize something.
Another consideration is whether or not we should creat a metric that tells us how public is a CDR3 (maybe the diversity of the grouping variable among the cells having tat specific CDR3).
In GitLab by @grst on Jan 17, 2020, 16:52
In GitLab by @grst on Jan 30, 2020, 11:18
This issue is about implementing basic plotting functions that can be re-used to plot any column in obs
by a group.
sc.pl.violin
already exists in scanpy.
Example:
sc.pl.violin(adata, ["TRA_1_cdr3_len", "TRA_1_junction_ins"], groupby="leiden")
Do we also want
?
In GitLab by @grst on Feb 14, 2020, 16:16
Mirror the scanpy-workflow:
tcr_dist
is the equivalent of neighbors
Then, the following tools
tl.tcr_umap
st.tl.dendrogram(adata, groupby)
Compute a dendrogram on an arbitrary distance matrix (e.g. from tcr_dist).And the following plots:
pl.tcr_umap
pl.dendrogram
In GitLab by @grst on Feb 13, 2020, 16:03
I have just been using sctcrpy
with Sandro to analyse some TCR data.
He could use the library on his own without a lot of help from my side, except for plotting different subsets of adata
(all plots looked the same because it re-used the precomputed values). And it is indeed a bit unintutitive...
Maybe we should reconsider the need of a tool
for each plotting function, at least for the cases where it can be computed inexpensively.
In GitLab by @szabogtamas on Feb 6, 2020, 11:22
I previously found it great help it tools of functions also accepted standard Python data structures, not only an object defined by a library that might not even be used.
Of course, scanpy is robust already and there is no real danger that we needed this backdoor, so in this particular case, I can go along with accepting AnnData objects only.
The question is rather if we want to stick to the scanpy ecosystem with our tool or try to make some parts work alone as well.
We can think about it later.
In GitLab by @grst on Jan 29, 2020, 18:07
Add unit tests for the IO functions.
Check values with two sample entries and verify manually.
The original issue
Id: 9
Title: List of plots
could not be created.
This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:
TODO
In GitLab by @szabogtamas on Feb 10, 2020, 12:10
Chain pairing stats can be visualized by the group abundance plotting function. It will not check if the tool was ran and also uses a specific set of settings. Does it make sense to create a function for it that is only a wrapper actually?
In GitLab by @szabogtamas on Feb 6, 2020, 11:09
There are lots of cells without TCR and thus, no clnotype assigned. Should we exclude them from the base, when calculating clusters or include them as a fraction? Include on plot?
Pro: it is cleaner to show fractions of the categorized cells
Con: they are there...
The original issue
Id: 24
Title: VDJ usage as Sankey plot
could not be created.
This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:
TODO
In GitLab by @grst on Mar 20, 2020, 18:19
In GitLab by @grst on Mar 17, 2020, 17:29
Should be really straightforward.
https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html
On the other hand, it should be equivalent to a global pairwise sequence alignment with identity matrix and gap penalty 1.
In GitLab by @szabogtamas on Feb 11, 2020, 10:36
In case of spectratypes, I am using pseudocounts to draw nicer KDE curves if fraction is true.
Does peudocount make sense for a stripplot or boxplot?
The original issue
Id: 31
Title: Fix KDE/curve plot
could not be created.
This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:
TODO
In GitLab by @grst on Jan 19, 2020, 17:40
adata.obs
columns are converted to categorical on plotting. None
, NaN
might be turned into str(nan)
in that case which breaks pd.isna
etc.
In GitLab by @grst on Feb 28, 2020, 12:56
With !10, the basic functionality to compute distances was added.
Next steps:
Mirror the scanpy workflow
neighbors
to compute adjacency matricesleiden
to cluster = "define clonotypes"umap
to visualize similarity of clonotypesIn GitLab by @grst on Mar 16, 2020, 15:08
Include demo datasets as part of the package (sctcrpy.datasets
) similar to how it is done in scanpy.
Nice datasets to include:
This is a prerequisite of making a nice tutorial.
In GitLab by @grst on Mar 27, 2020, 14:14
null
In GitLab by @grst on Nov 27, 2019, 11:32
Hi @szabogtamas,
if you still find some time, it would be great if you could add documentation to at least the most important functions, explaining the input arguments and return values. This would make it a lot easier for me to maintain the project.
I would recommend sticking to numpydoc. There's a great example here:
ef foo(var1, var2, long_var_name='hi'):
"""A one-line summary that does not use variable names.
Several sentences providing an extended description. Refer to
variables using back-ticks, e.g. `var`.
Parameters
----------
var1 : array_like
Array_like means all those objects -- lists, nested lists, etc. --
that can be converted to an array. We can also refer to
variables like `var1`.
var2 : int
The type above can either refer to an actual Python type
(e.g. ``int``), or describe the type of the variable in more
detail, e.g. ``(N,) ndarray`` or ``array_like``.
long_var_name : {'hi', 'ho'}, optional
Choices in brackets, default first when optional.
Returns
-------
type
Explanation of anonymous return value of type ``type``.
describe : type
Explanation of return value named `describe`.
out : type
Explanation of `out`.
type_without_description
Other Parameters
----------------
only_seldom_used_keywords : type
Explanation
common_parameters_listed_above : type
Explanation
Raises
------
BadException
Because you shouldn't have done that.
See Also
--------
numpy.array : Relationship (optional).
numpy.ndarray : Relationship (optional), which could be fairly long, in
which case the line wraps here.
numpy.dot, numpy.linalg.norm, numpy.eye
Notes
-----
Notes about the implementation algorithm (if needed).
This can have multiple paragraphs.
You may include some math:
.. math:: X(e^{j\omega } ) = x(n)e^{ - j\omega n}
And even use a Greek symbol like :math:`\omega` inline.
References
----------
Cite the relevant literature, e.g. [1]_. You may also cite these
references in the notes section above.
.. [1] O. McNoleg, "The integration of GIS, remote sensing,
expert systems and adaptive co-kriging for environmental habitat
modelling of the Highland Haggis using object-oriented, fuzzy-logic
and neural-network techniques," Computers & Geosciences, vol. 22,
pp. 585-588, 1996.
Examples
--------
These are written in doctest format, and should illustrate how to
use the function.
>>> a = [1, 2, 3]
>>> print([x + 3 for x in a])
[4, 5, 6]
>>> print("a\nb")
a
b
"""
# After closing class docstring, there should be one blank line to
# separate following codes (according to PEP257).
# But for function, method and module, there should be no blank lines
# after closing the docstring.
pass
In GitLab by @grst on Mar 20, 2020, 18:27
null
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.