microsoft / graspologic Goto Github PK

View Code? Open in Web Editor NEW

506.0 18.0 125.0 620.84 MB

Python package for graph statistics

Home Page: https://microsoft.github.io/graspologic/latest

License: MIT License

Python 99.94% Makefile 0.06%

graph data-science networks python machine-learning graph-statistics

graspologic's Introduction

graspologic

`graspologic` is a package for graph statistical algorithms.

Overview
Documentation
System Requirements
Installation Guide
Contributing
License
Issues
Citing graspologic

Overview

A graph, or network, provides a mathematically intuitive representation of data with some sort of relationship between items. For example, a social network can be represented as a graph by considering all participants in the social network as nodes, with connections representing whether each pair of individuals in the network are friends with one another. Naively, one might apply traditional statistical techniques to a graph, which neglects the spatial arrangement of nodes within the network and is not utilizing all of the information present in the graph. In this package, we provide utilities and algorithms designed for the processing and analysis of graphs with specialized graph statistical algorithms.

Documentation

The official documentation with usage is at https://microsoft.github.io/graspologic/latest

Please visit the tutorial section in the official website for more in depth usage.

System Requirements

Hardware requirements

graspologic package requires only a standard computer with enough RAM to support the in-memory operations.

Software requirements

OS Requirements

graspologic is tested on the following OSes:

Linux x64
macOS x64
Windows 10 x64

And across the following x86_64 versions of Python:

3.8
3.9
3.10

If you try to use graspologic for a different platform than the ones listed and notice any unexpected behavior, please feel free to raise an issue. It's better for ourselves and our users if we have concrete examples of things not working!

Installation Guide

Install from pip

pip install graspologic

Install from Github

git clone https://github.com/microsoft/graspologic
cd graspologic
python3 -m venv venv
source venv/bin/activate
python3 setup.py install

Contributing

We welcome contributions from anyone. Please see our contribution guidelines before making a pull request. Our issues page is full of places we could use help! If you have an idea for an improvement not listed there, please make an issue first so you can discuss with the developers.

License

This project is covered under the MIT License.

Issues

We appreciate detailed bug reports and feature requests (though we appreciate pull requests even more!). Please visit our issues page if you have questions or ideas.

Citing `graspologic`

If you find graspologic useful in your work, please cite the package via the GraSPy paper

Chung, J., Pedigo, B. D., Bridgeford, E. W., Varjavand, B. K., Helm, H. S., & Vogelstein, J. T. (2019). GraSPy: Graph Statistics in Python. Journal of Machine Learning Research, 20(158), 1-7.

graspologic's People

Contributors

Stargazers

Watchers

Forkers

tpsatish95 hhelm10 kristianeschenburg adalisan bstadt bdpedigo kikiwink quasilegendre shiyussy idc9 tathey1 neurodatadesign rguo123 asaadeldin11 pmyers16 bvarjavand luochuankai-jhu xueminzhu-charmaine science4fun shaaaan chaoshengt codeaudit dfrancisco1998 caseyweiner jingyan230 spencer-loggia zeou1 soumi7 rflperry pauladkisson daxpryce kareef928 loftusa anshutrivedi neurodata zubair-aziz junhaobearxiong perifanosprometheus bryantower pssf23 eigenvivek tliu68 saumitrash langsterga vmdhhh loribeiro adulau abdullah-jannadi emasquil global-localhost global19 global19-atlassian-net dtborders aaitbr sagarhowal rajpratyush kellymarchisio madlad33 deedeebanh diane-lee-01 domandinho qpc-database sleepyhead01 spector-in-london vishalbelsare oshinsharma2002 vishaljha2121 ngcd04-fa07 pbourke standardgalactic cuchulainx shikharvashistha youngser sneznaj santianlin mtandon18 maggiehua01 theartpiece tony-otis ju-c ebridge2 nmcguire101 fvaduva prabhsimran1313 dokato burugaria7 aj-hersko jmleksan isabella232 jwellan1 janeka1122 ktwillcode hugwuoke codewithasadofficial guilhermemonteiropeixoto alyakin314 gdchen94 johntango aad1tya ankit0806

graspologic's Issues

Primitives and Super Primitives to add

Primitives
dimselect
ASE
LSE
OMNI

Super Primitives
Nonpar
Semipar
GClust
OOCASE

implement ASE, LSE, PTR, semipar

based on input data formats we agree on, make sure these functions work properly and no longer call R scripts

DoD:

tests demonstrating that these functions provide expected output, can take in whatever data format we agree on
jupyter notebook demonstrating use

Closer look at semipar power/null distribution improvements

Compare semipar for scaling vs clipping P matrix

Implement standardized IO, simulations, nonpar, semipar

write code, tests, and documentation for the following:

standardize IO
simulations
nonpar
semipar

DoD:

finished code with documentation, tests for each
travis integration for each function added

Implement dimselect, seeded graph matching

write code, tests, and documentation for the following:

dimensionality selection
joint graph embedding base on shangsi paper

DoD:

finished code with documentation, tests for each
travis integration for each function added

read the following

Standardize IO

Shared io functions in utils, so there is less inconsistency and chances for breakage. Ie, this breaks id imagine https://github.com/neurodata/pygraphstats/blob/master/graphstats/ase/ase.py with networkx.

Todo:

profie networkx to numpy matrix function (sparse as function of n, dense as function of n from n=10 to 10k on log scale; sparse = O(n) edges, dense = O(n^2) edges)
embed base class
IO functions
IO tests
travis integration

DoD:

figures showing profiling results from above
IO functions
consistent base class for "embed" methods
tests for both of the above, demonstrating that they work
Description of tests:
- test1: show that input accepts both networkx and numpy objects, and correctly returns the same object type for both
start travis

What plots would be useful?

First I think seaborn is a good choice. ggplot for python hasn't been developed in over 2 years so while ggplot is nice, I don't think I'm going to use it. Not the biggest fan of plotly. Thoughts appreciated.

So for actual plots:

Function for plotting 1 or more adjacency/similarity/dissimilarity matrices as heatmaps
Pairwise scatter plots. So given X \in R^{n x d}, it plots each pairs of dimensions as scatter plots
A generic scatter plot. Given X \in R^n and Y \in R^n, just makes a scatter plot.
??

@jovo @ebridge2 @bdpedigo

Run ASE/fast ASE comparison

Things to do within next week

Pedigo

J1C

FIRM compliance

Find

On github
Permanent DOI
License

Install

Installation guidelines
On PyPi
- this is done automatically

Run

Demo, including expected results, data, and runtime
Readme with quick start guide
Autogen docs

Modify

https://bitsandbrains.io/2018/10/21/numerical-packages.html

Create many to one vertex matching algoritm

Many to one algorithm

DoD:

Some whitepaper describing the problem, the algo, any proofs.

Analyze another data set based on tools/discoveries in prior sprint

Based on what is discovered in sprint 2, see if any significant findings can be repeated in another data set. In particular, if disease/environmental phenotype data can be related to graph statistic properties, try to find another data set for that specific phenotype

DoD:

Quantification of graph statistical properties with regard to phenotype data as in Sprint 2
Reproducible figures and statistics in Jupyter for, ready for publication

Func for returning the actual latent positions in BaseEmbed or LatentPosition?

Thoughts on adding a function that returns the actual latent positions? So that I don't have to keep typing np.dot(lpm.X, np.diag(lpm.d) ** 0.5)?

I think it makes sense to add to BaseEmbed, but I could see it being added to LatentPosition. @ebridge2 ?

sample correlated SBM

Example:

require(igraph)
gg <- rg.sample.SBM.correlated(n = 100, B = matrix(c(0.5,0.5,0.2,0.5), nrow = 2), rho = c(0.4,0.6), sigma = 0.2)

summary(gg$adjacency$A)
IGRAPH c77bf6c U--- 100 2424 --
summary(gg$adjacency$B)
IGRAPH 3ccfb0c U--- 100 2039 --

cor(as.vector(gg$adjacency$A[]), as.vector(gg$adjacency$B[]))
[1] 0.1494246

Correlated ER

rg.sample.correlated.gnp <- function(P,sigma){
require(igraph)
n <- nrow(P)
U <- matrix(0, nrow = n, ncol = n)
U[col(U) > row(U)] <- runif(n*(n-1)/2)
U <- (U + t(U))
diag(U) <- runif(n)
A <- (U < P) + 0 ;
diag(A) <- 0

avec <- A[col(A) > row(A)]
pvec <- P[col(P) > row(P)]
bvec <- numeric(n*(n-1)/2)

uvec <- runif(n*(n-1)/2)

idx1 <- which(avec == 1)
idx0 <- which(avec == 0)

bvec[idx1] <- (uvec[idx1] < (sigma + (1 - sigma)*pvec[idx1])) + 0
bvec[idx0] <- (uvec[idx0] < (1 - sigma)*pvec[idx0]) + 0

B <- matrix(0, nrow = n, ncol = n)
B[col(B) > row(B)] <- bvec
B <- B + t(B)
diag(B) <- 0

return(list(A = graph.adjacency(A,"undirected"), B = graph.adjacency(B,"undirected")))
}

non-igraph version of correlated SBM

#gg <- rg.sample.SBM.correlated(n = 100, B = matrix(c(0.5,0.5,0.2,0.5), nrow = 2), rho = c(0.4,0.6), sigma = 0.2)
#cor(as.vector(gg$adjacency$A[]), as.vector(gg$adjacency$B[]))
rg.sample.SBM.correlated <- function(n, B, rho, sigma, conditional = FALSE){
if(!conditional){
tau <- sample(c(1:length(rho)), n, replace = TRUE, prob = rho)
}
else{
tau <- unlist(lapply(1:2,function(k) rep(k, rho[k]*n)))
}
P <- B[tau,tau]
return(list(adjacency=rg.sample.correlated.gnp(P, sigma),tau=tau))
}

Add graph simulations

ER, SBM, zi-poisson ER, zi-poisson SBM, weighted ER, weighted SBM simulations.

DoD:

simulations subpackage added
tests for simulations subpackage for each simulation type to validate that graphs simulated from the respective simulations satisfy the hyperparameters for the simulation type

learn how to use issues

redo atlases using Jaewon's JSON mappings

Change plot Heatmaps to 1-tone

It's slightly more intuitive to have single tone heatmaps (ie, color for large values, white for small values, somewhere in between otherwise). Makes visualizing stuff like the below:

as the absence of color generally indicates the something is small, whereas the presence of color usually indicates more of something, and here, that is fairly unintuitive. if you were to do 3 colors, that requires your readers to really be checking the axes, limits, etc so that's why we typically do 1-tone with white for small and color for large

Refactor dissimilarity function from omni to cmds

title

look into sklearn preprocessing functions

See if any of these could be useful for graphs

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer

dimensionality selection

Can take matrices of any size

DoD:

Code + tests demonstrating that it works.

Implement omni, clustering after embedding, plotting functions

Write code, tests, and documentation for the following:

DoD:

add edge-case testing to simulations

as title states

learn how to use travis

Embedding regularization investigations

see how "adding c" affects embeddings
see how the embeddings look when we don't even check for LCC
see if one or two graphs are messing everything else up because they have many unconnected nodes
figure out how to still run sparse code after adding c
see if augmenting diagonal changes anything even when there are unconnected nodes (eg can we force connectivity somehow and just fake it)

networkx

pip install networkx
understand what classes and methods are available in the package

URerf Graph2vec

"concatenated vectors through unsupervised random forest, the features that were most informative would be the ones that are used. then, rather than MDS, we simply do an eigendecomposition"

Read papers

Statistical inference on random dot product graphs: a survey
https://medium.com/basecs/a-gentle-introduction-to-graph-theory-77969829ead8

OmnibusEmbed fit_transform results

Hi developers (cc @jovo),

I am running the OmnibusEmbed into several correlation matrices derived from functional magnetic resonance imaging data. Currently, I have 133 subjects, each one with 249 brain regions of interest timeseries. For each subject, I compute the pearson correlation matrix, so in the end I have a matrix [133 x 249 x 249] (if you prefer, 133 graphs with 249 vertices).

However, when I run:

embeddings = OmnibusEmbed(k=20).fit_transform(correlations)

embeddings becomes a 2-items tuple, with two matrices [33117 x 20], in which np.allclose(embeddings[0], embeddings[1]) is True. Why is it returning two of them?

Also, is it safe to reshape the matrix [33117 x 20] into [133 x 249 x 20], in a way that embeddings[0] contains the embeddings of the subject 0`s regions?

Thank you!

CONTRIBUTING Guidelines

Write a concrete contributing guidelines

DoD
CONTRIBUTING.md that specifies the following:

Coding guide following PEP8
Docstring guide following that of numpy/scipy NOT Google

Take Notes for Discriminability

Simulation Notes
Figure Notes

sparse matrix support

Should be easy, most functions should already work on sparse matrices but we will need to update our typechecking in several places and write tests to make sure.

Also J1c says that one of the SVDs does not work on sparse

Possibly could include support for rank 1 + sparse matrices where rather than many 0s they have many of some constant

rerun analysis on CoRR and HBN

using modified atlases and dMRI pipeline

See if semipar needs LCC protection for the simulated RDPGs

add diag aug to ASE

ASE(A) operates on

A + diag(degree_vec / n-1)

update demo datasets

add graphs from here https://github.com/neurodata/graphstats/tree/master/data

scipy.sparse.linalg.svds requires non-int array

must be float or decimal. i will update import_graph function to cast to float if int array.

do ASE/LSE only on the largest connected component.

Make a class for a latent position model and adjust embeddings to store the latent position model

Make a basic class contaning a structured representation of a latent position model. This consists of an X, a Y, and a optional vtx_names attribute where X \in \mathbb{R}^{N \times k}, Y \in NULL U \mathbb{R}^{N \times k}, and vtx_names \in \mathcal{S}^{N}. Correspondingly, make the base embedding class contain an instance of a latent position model to return to users.

DoD:

latent position model class build and tests added
embedding base method contains a instance of latent position model, and tests added correspondingly for base embedding method upon delivery of @bijanv 's dimselect method

Omnibus Embedding

Write a function for omnibus embedding with following features:

Can take any number of matrices
Checks for same matrix dimensions

DoD:

Code + tests demonstrating that it works.

More misc TODO

pretty graphs generated from an extensible python/class module
use jupyter notebook to tell data narrative

dimselect

make sure that it checks that the input is a vector, not a matrix.

@bdpedigo @j1c

Pipeline support

Roll several preprocessing steps (various types of PTR, other transforms, etc), embeddings, and clustering steps into a sklearn Pipeline that will work with randomized parameter search

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV

randomized SVD for ASE/LSE

add common models that can be fit to graphs

add models for SBM, ER, and ZI, inheriting from a base model class.

microsoft / graspologic Goto Github PK

graspologic's Introduction

graspologic

graspologic is a package for graph statistical algorithms.

Overview

Documentation

System Requirements

Hardware requirements

Software requirements

OS Requirements

Installation Guide

Install from pip

Install from Github

Contributing

License

Issues

Citing graspologic

graspologic's People

Contributors

Stargazers

Watchers

Forkers

graspologic's Issues

Pedigo

J1C

Find

Install

Run

Modify

Correlated ER

non-igraph version of correlated SBM

Pedigo

Recommend Projects

Recommend Topics

Recommend Org

`graspologic` is a package for graph statistical algorithms.

Citing `graspologic`