krishnaswamylab / multiscale_phate Goto Github PK

Creating multi-resolution embeddings and clusters from high dimensional data

License: GNU General Public License v3.0

Shell 0.03% Python 6.14% Jupyter Notebook 93.83%

multiscale_phate's Introduction

Multiscale PHATE

Multiscale PHATE is a python package for multiresolution analysis of high dimensional data. For an in-depth explanation of the algorithm and applications, please read our manuscript on Nature Biotechnology.

The biomedical community is producing increasingly high dimensional datasets integrated from hundreds of patient samples that current computational techniques are unable to explore across granularities. To visualize, cluster and analyze massive datasets across granularities, we created Multiscale PHATE. The goal of Multiscale PHATE is to learn and visualize abstract cellular features and groupings of the data at all levels of granularity in an efficient manner to identify meaningful biological relationships and mechanisms. Our approach learns a tree of data granularities which can be cut at coarse levels for high level summarizations of data as well as at fine levels for detailed representations on subsets.

Overview of Algorithm:

Our algorithm integrates dimensionality reduction technique PHATE with multigranular analysis tool diffusion condensation. First the non-linear diffusion manifold is calculated using PHATE. Then diffusion condensation takes this manifold-intrinsic diffusion space and slowly condensing data points towards local centers of gravity to form natural, data-driven groupings across multiple granularities. These granularities can then be viewed.

Using gradient analysis, which looks at shifts in data density during successive iterations of the diffusion condensation process, we can identify stable resolutions of the hierarchical tree for downstream analysis. With this stability information, we can cut the hierarchical tree at multiple resolutions to produce visualizations and clusters across granularities for downstream analysis.

By identifying multiple resolutions, Multiscale PHATE enables users to interact with their data and zoom in on cellular subsets of interest to reveal increasingly granular information about cell types and subtypes.

While this may sound computationally inefficient, we show that we are able to perform these calculations as well as visualize and cluster the data significantly faster than “single-scale” visualization techniques like tSNE, UMAP or PHATE, allowing the analysis of millions of cells within minutes. When combined with other computational algorithms for high dimensional data analysis, such as MELD and DREMI, Multiscale PHATE is able to provide deep and detailed insights in biological processes.

Installation

Multiscale PHATE is available on pip. Install by running the following in a terminal:

pip install --user git+https://github.com/KrishnaswamyLab/Multiscale_PHATE

Quick Start

import multiscale_phate
mp_op = multiscale_phate.Multiscale_PHATE()
mp_embedding, mp_clusters, mp_sizes = mp_op.fit_transform(X)

# Plot optimal visualization
scprep.plot.scatter2d(mp_embedding, s = mp_sizes, c = mp_clusters,
                      fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(16,12))

Guided Tutorial

For more details on using Multiscale PHATE, see our guided tutorial using 10X's public PBMC4k dataset.

multiscale_phate's People

Contributors

Stargazers

Watchers

Forkers

soumitrakp mattscicluna majorbio venkataduvvuri smgoggin10 ricardomar cuict rmr975 sachamorin nelson-gon nishuang83

multiscale_phate's Issues

please update pandas dependency

It cause errors with pickle files from pandas 1.4>

Suggestion for parameters

Dear authors,

I have tried using this package on CyTOF data of ~ 3 million cells but it took such a long time. Do you have any recommendation or rule of thumbs for the parameters?

Thanks in advance.

Mikhael

Slow/stalled run of multiscale phate on large datasets using python 3.7.6

Hello,

I am having trouble running multiscale phate using a dataset with size (>200000, 20).

My code:

"
data = load_matrix('data_filename.txt')
stats = load_vector('stations_filename.txt ')
unigenes = load_vector('genes_filename.txt')
df = pd.DataFrame(data, columns = stats, index=unigenes)
n_cores=40

print(type(df))
print(df.shape)
mp_op = multiscale_phate.Multiscale_PHATE(n_jobs=n_cores, random_state=1)
levels = mp_op.fit(df)
"

output::
"
<class 'pandas.core.frame.DataFrame'>
(274774, 20)
Calculating Multiscale PHATE tree...
Calculating PCA...
Calculated PCA in 0.33 seconds.
Calculating partitions...
"
After running PCA quite fast, the algorithm seems to be stalled when calculating partitions: I let it run for several days of calculation using multiple cores (which I verified are running) but the job is still at the same point ('calculating partitions...').

From your examples, multiscale PHATE should run much faster than this on this kind of dataset.

My dataset is:
in rows: genes expression (scaled between 0 and 1 with many 0)
in columns: different samples

Let me know if you have some insights on this.

Thank you

Installation Fails due to UnicodeDecodeError when Reading README.md

I am trying to install the Multiscale_PHATE package on my Windows machine using pip, but the installation fails when preparing metadata with a UnicodeDecodeError. Here's the error message I receive:

Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "C:\Users\WKS\Desktop\Alin\GitHub\Multiscale_PHATE\setup.py", line 23, in
readme = open("README.md").read()
File "C:\Users\WKS\anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3607: character maps to

The error seems to occur when the setup.py script tries to open and read the README.md file, and there's a character at position 3607 that it can't decode using the cp1252 encoding.

I have tried changing my system's default encoding to UTF-8 and even manually saving the README.md file in UTF-8 encoding, but the error persists.

I would appreciate any help or guidance you can provide to resolve this issue.

Thank you,
alciobanu

Multiscale Phate within PhateR?

Hi,

Is it possible to include multiscale phate within R library?

thank you

subset_data does not converge

Hi, I have a large dataset (>100k samples) that contains a lot of duplicates.
MSPHATE does not converge during the Calculating partitions... step.

I can't share the dataset in question, but I think I replicated the effect with some randomly generated data. See the following code and output:

import numpy as np
from multiscale_phate import compress, diffuse, condense

np.random.seed(42)

# spoof data
data = np.random.uniform(size=(10001, 200))
data = np.vstack([data, data, data, data, data, data, data, data, data, data])  # highly redundant

# spoof MSPHATE compress step
N, features = data.shape
n_pca = 200
partitions = None

# Computing compression features
n_pca, partitions = compress.get_compression_features(
    N, features, n_pca, partitions, landmarks=2000
)

# modified to display np.max(cluster_counts) and np.ceil(N / desired_num_clusters)
_ = compress.subset_data(data, desired_num_clusters=partitions, n_jobs=8, num_cluster=100, random_state=None)

output:

Calculating partitions...
np.max(cluster_counts):  3930
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  1120
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  70
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10
np.ceil(N / desired_num_clusters):  6.0
np.max(cluster_counts):  10

The output is the same after many iterations.

Note: I am using python 3.8 and installed using pip install git+https://github.com/KrishnaswamyLab/Multiscale_PHATE

Does Not Run on Large Datasets when using python 3.7

Hi,
I am not able to use MS PHATE with python 3.7.

Reproducible example:

virtualenv --no-download phate_test_env # change path to env as needed
source phate_test_env/bin/activate # load env
pip install git+https://github.com/KrishnaswamyLab/Multiscale_PHATE
python

import numpy as np
from multiscale_phate import multiscale_phate
data = np.random.uniform(size=(5000000,200)) # make sure its big
msphate_obj = multiscale_phate.Multiscale_PHATE(n_pca=None, n_jobs=48, knn=200)
msphate_obj.fit(data)

sterr is:

  File "[...]/lib/python3.7/site-packages/multiscale_phate/multiscale_phate.py", line 158, in fit
    self.hash = utils.hash_object(X)
  File "[...]/lib/python3.7/site-packages/multiscale_phate/utils.py", line 18, in hash_object
    return hash(pickle.dumps(X))
OverflowError: cannot serialize a bytes object larger than 4 GiB

edited to make data shape more realistic

Incoherent cell positions between different visualization layers ?

Hello,

I am working on scATACseq and I tried MS Phate. It seems that some of my clusters switch positions for different visualization layers.
I know the ground truth labels of my cells. 4 of them are:

intermediate mono
CD4+ memory T
CD8+ naive T
CD4+ naive T

For the upper plots, I use the visualization level 0, i.e each dot represents a cell
For the lower plots, I use a coarser visualization level, i.e each dot represents cells that were merged during the condensation phase.
For each pair of plots within the same column, I try to confirm that cells belonging to the same orange cluster are found in the same region of the associated condensed 2D plot.

As it can be observed, it is not the case. It is very obvious for CD8+ naive and Intermediate mono cells also seem to have moved their position in the 2D plot after condensation

Is this behavior expected?

Thanks a lot for your help

Meaning of Multiscale PHATE levels

Hi, I am wondering if the levels of Multiscale PHATE correspond to actual data points.

For example, can I visualize an arbitrary level colored based on a pre-defined label or the levels correspond to the condensation of various datapoints (possibly with mixed labels)?

Thank you!

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?)

tree = mp_op.build_tree()
Calculating base visualization...
Calculated base visualization in 0.01 seconds.
Calculating tree...
Calculated tree in 0.06 seconds.
Traceback (most recent call last):
File "", line 1, in
File "/data/Home/fabotao/software/ENTER/envs/MultiPhate/lib/python3.11/site-packages/multiscale_phate/multiscale_phate.py", line 261, in build_tree
return visualize.build_condensation_tree(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/Home/fabotao/software/ENTER/envs/MultiPhate/lib/python3.11/site-packages/multiscale_phate/visualize.py", line 144, in build_condensation_tree
tree_phate = Ps[l] @ tree_phate_1
~~~~~~^~~~~~~~~~~~~~
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 77566 is different from 77565)