Coder Social home page Coder Social logo

microsoft / graspologic Goto Github PK

View Code? Open in Web Editor NEW
506.0 18.0 125.0 620.84 MB

Python package for graph statistics

Home Page: https://microsoft.github.io/graspologic/latest

License: MIT License

Python 99.94% Makefile 0.06%
graph data-science networks python machine-learning graph-statistics

graspologic's Introduction

graspologic

Paper shield PyPI version Downloads shield graspologic CI License: MIT

graspologic is a package for graph statistical algorithms.

Overview

A graph, or network, provides a mathematically intuitive representation of data with some sort of relationship between items. For example, a social network can be represented as a graph by considering all participants in the social network as nodes, with connections representing whether each pair of individuals in the network are friends with one another. Naively, one might apply traditional statistical techniques to a graph, which neglects the spatial arrangement of nodes within the network and is not utilizing all of the information present in the graph. In this package, we provide utilities and algorithms designed for the processing and analysis of graphs with specialized graph statistical algorithms.

Documentation

The official documentation with usage is at https://microsoft.github.io/graspologic/latest

Please visit the tutorial section in the official website for more in depth usage.

System Requirements

Hardware requirements

graspologic package requires only a standard computer with enough RAM to support the in-memory operations.

Software requirements

OS Requirements

graspologic is tested on the following OSes:

  • Linux x64
  • macOS x64
  • Windows 10 x64

And across the following x86_64 versions of Python:

  • 3.8
  • 3.9
  • 3.10

If you try to use graspologic for a different platform than the ones listed and notice any unexpected behavior, please feel free to raise an issue. It's better for ourselves and our users if we have concrete examples of things not working!

Installation Guide

Install from pip

pip install graspologic

Install from Github

git clone https://github.com/microsoft/graspologic
cd graspologic
python3 -m venv venv
source venv/bin/activate
python3 setup.py install

Contributing

We welcome contributions from anyone. Please see our contribution guidelines before making a pull request. Our issues page is full of places we could use help! If you have an idea for an improvement not listed there, please make an issue first so you can discuss with the developers.

License

This project is covered under the MIT License.

Issues

We appreciate detailed bug reports and feature requests (though we appreciate pull requests even more!). Please visit our issues page if you have questions or ideas.

Citing graspologic

If you find graspologic useful in your work, please cite the package via the GraSPy paper

Chung, J., Pedigo, B. D., Bridgeford, E. W., Varjavand, B. K., Helm, H. S., & Vogelstein, J. T. (2019). GraSPy: Graph Statistics in Python. Journal of Machine Learning Research, 20(158), 1-7.

graspologic's People

Contributors

aj-hersko avatar allcontributors[bot] avatar alyakin314 avatar asaadeldin11 avatar bdpedigo avatar bvarjavand avatar caseyweiner avatar daxpryce avatar dfrancisco1998 avatar diane-lee-01 avatar dokato avatar dtborders avatar gkang7 avatar hhelm10 avatar hugwuoke avatar j1c avatar jheiko1 avatar jingyan230 avatar jonmclean avatar kareef928 avatar loftusa avatar nicaurvi avatar pauladkisson avatar pbourke avatar perifanosprometheus avatar pssf23 avatar spencer-loggia avatar tliu68 avatar vmdhhh avatar zeou1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

graspologic's Issues

implement ASE, LSE, PTR, semipar

based on input data formats we agree on, make sure these functions work properly and no longer call R scripts

DoD:

  • tests demonstrating that these functions provide expected output, can take in whatever data format we agree on
  • jupyter notebook demonstrating use

Implement dimselect, seeded graph matching

write code, tests, and documentation for the following:

  • dimensionality selection
  • joint graph embedding base on shangsi paper

DoD:

  • finished code with documentation, tests for each
  • travis integration for each function added

Standardize IO

Shared io functions in utils, so there is less inconsistency and chances for breakage. Ie, this breaks id imagine https://github.com/neurodata/pygraphstats/blob/master/graphstats/ase/ase.py with networkx.

Todo:

  • profie networkx to numpy matrix function (sparse as function of n, dense as function of n from n=10 to 10k on log scale; sparse = O(n) edges, dense = O(n^2) edges)
  • embed base class
  • IO functions
  • IO tests
  • travis integration

DoD:

  • figures showing profiling results from above
  • IO functions
  • consistent base class for "embed" methods
  • tests for both of the above, demonstrating that they work
  • Description of tests:
    • test1: show that input accepts both networkx and numpy objects, and correctly returns the same object type for both
  • start travis

What plots would be useful?

First I think seaborn is a good choice. ggplot for python hasn't been developed in over 2 years so while ggplot is nice, I don't think I'm going to use it. Not the biggest fan of plotly. Thoughts appreciated.

So for actual plots:

  1. Function for plotting 1 or more adjacency/similarity/dissimilarity matrices as heatmaps
  2. Pairwise scatter plots. So given X \in R^{n x d}, it plots each pairs of dimensions as scatter plots
  3. A generic scatter plot. Given X \in R^n and Y \in R^n, just makes a scatter plot.
  4. ??

@jovo @ebridge2 @bdpedigo

Things to do within next week

Pedigo

  • LCC
  • diag aug
  • Anibal's error/is_close(), unless someone already fixed/clarified that
  • Protect laplacian normalization against undirected graphs
  • Change how form kwarg gets passed to LSE
  • Let PTR take kwargs
  • RDPG
    • fix min()
    • actual simulation
    • add tests
  • Laplacian fails when node degree 0
  • Add alpha param for plotting
    • heatmap: use diverging colormap by default if there are both negative and pos numbers

J1C

  • sphinx docs
  • omni
    • docs that it works on tensors
    • docs that say you should get a tuple back if asymmetric graphs
  • mds
    • modify cmds to work on matrices
      • test for matrix
    • Dimselect in MDS
    • Typecheckingin MDS
  • rewrite SVD to use svd, svds, and randomized_svd
    • Error catching in selectSVD
    • remove lpm and save UD^1/2, VD^1/2, and D
  • rewrite sims to have the following:
    • sampling class that operates on a P matrix
      • sbm, rdpg, erdos renyi (n p) models
    • sampling class that operates on number of edges
      • erdoes renyi - (n m) model
      • zero inflated (n m) models
    • sampling class for edge weights

FIRM compliance

Find

  • On github
  • Permanent DOI
  • License

Install

  • Installation guidelines
  • On PyPi
    • this is done automatically

Run

  • Demo, including expected results, data, and runtime
  • Readme with quick start guide
  • Autogen docs

Modify

  • Contrib guidelines
    • Style guidelines
    • Bug report template
    • Pull request template
    • Feature addition template
  • Unit tests
  • Continuous Integration
  • Badges
    • DOI
    • License
    • Stable release version
    • Documentation link
    • Code quality
    • Coverage
    • Build status (for virtual machine)
    • Number of downloads (this is not possible in pypi)

https://bitsandbrains.io/2018/10/21/numerical-packages.html

Analyze another data set based on tools/discoveries in prior sprint

Based on what is discovered in sprint 2, see if any significant findings can be repeated in another data set. In particular, if disease/environmental phenotype data can be related to graph statistic properties, try to find another data set for that specific phenotype

DoD:

  • Quantification of graph statistical properties with regard to phenotype data as in Sprint 2
  • Reproducible figures and statistics in Jupyter for, ready for publication

sample correlated SBM

Example:

require(igraph)
gg <- rg.sample.SBM.correlated(n = 100, B = matrix(c(0.5,0.5,0.2,0.5), nrow = 2), rho = c(0.4,0.6), sigma = 0.2)

summary(gg$adjacency$A)
IGRAPH c77bf6c U--- 100 2424 --
summary(gg$adjacency$B)
IGRAPH 3ccfb0c U--- 100 2039 --

cor(as.vector(gg$adjacency$A[]), as.vector(gg$adjacency$B[]))
[1] 0.1494246

Correlated ER

rg.sample.correlated.gnp <- function(P,sigma){
require(igraph)
n <- nrow(P)
U <- matrix(0, nrow = n, ncol = n)
U[col(U) > row(U)] <- runif(n*(n-1)/2)
U <- (U + t(U))
diag(U) <- runif(n)
A <- (U < P) + 0 ;
diag(A) <- 0

avec <- A[col(A) > row(A)]
pvec <- P[col(P) > row(P)]
bvec <- numeric(n*(n-1)/2)

uvec <- runif(n*(n-1)/2)

idx1 <- which(avec == 1)
idx0 <- which(avec == 0)

bvec[idx1] <- (uvec[idx1] < (sigma + (1 - sigma)*pvec[idx1])) + 0
bvec[idx0] <- (uvec[idx0] < (1 - sigma)*pvec[idx0]) + 0

B <- matrix(0, nrow = n, ncol = n)
B[col(B) > row(B)] <- bvec
B <- B + t(B)
diag(B) <- 0

return(list(A = graph.adjacency(A,"undirected"), B = graph.adjacency(B,"undirected")))
}

non-igraph version of correlated SBM

#gg <- rg.sample.SBM.correlated(n = 100, B = matrix(c(0.5,0.5,0.2,0.5), nrow = 2), rho = c(0.4,0.6), sigma = 0.2)
#cor(as.vector(gg$adjacency$A[]), as.vector(gg$adjacency$B[]))
rg.sample.SBM.correlated <- function(n, B, rho, sigma, conditional = FALSE){
if(!conditional){
tau <- sample(c(1:length(rho)), n, replace = TRUE, prob = rho)
}
else{
tau <- unlist(lapply(1:2,function(k) rep(k, rho[k]*n)))
}
P <- B[tau,tau]
return(list(adjacency=rg.sample.correlated.gnp(P, sigma),tau=tau))
}

Add graph simulations

ER, SBM, zi-poisson ER, zi-poisson SBM, weighted ER, weighted SBM simulations.

DoD:

  • simulations subpackage added
  • tests for simulations subpackage for each simulation type to validate that graphs simulated from the respective simulations satisfy the hyperparameters for the simulation type

Change plot Heatmaps to 1-tone

It's slightly more intuitive to have single tone heatmaps (ie, color for large values, white for small values, somewhere in between otherwise). Makes visualizing stuff like the below:

image

as the absence of color generally indicates the something is small, whereas the presence of color usually indicates more of something, and here, that is fairly unintuitive. if you were to do 3 colors, that requires your readers to really be checking the axes, limits, etc so that's why we typically do 1-tone with white for small and color for large

Implement omni, clustering after embedding, plotting functions

Write code, tests, and documentation for the following:

  • Omni
  • Clustering after embedding? GMM/MDS
  • Plotting
  • README.md
  • CONTRIBUTING.md
  • Setup sphinx script for generating docs

DoD:

  • Finished code with documentation, tests
  • Passing Travis
  • Nature compliant repo
  • Pypi package
  • Full documentation hosted by github

Embedding regularization investigations

  • see how "adding c" affects embeddings
  • see how the embeddings look when we don't even check for LCC
  • see if one or two graphs are messing everything else up because they have many unconnected nodes
  • figure out how to still run sparse code after adding c
  • see if augmenting diagonal changes anything even when there are unconnected nodes (eg can we force connectivity somehow and just fake it)

networkx

  1. pip install networkx
  2. understand what classes and methods are available in the package

URerf Graph2vec

"concatenated vectors through unsupervised random forest, the features that were most informative would be the ones that are used. then, rather than MDS, we simply do an eigendecomposition"

OmnibusEmbed fit_transform results

Hi developers (cc @jovo),

I am running the OmnibusEmbed into several correlation matrices derived from functional magnetic resonance imaging data. Currently, I have 133 subjects, each one with 249 brain regions of interest timeseries. For each subject, I compute the pearson correlation matrix, so in the end I have a matrix [133 x 249 x 249] (if you prefer, 133 graphs with 249 vertices).

However, when I run:

embeddings = OmnibusEmbed(k=20).fit_transform(correlations)

embeddings becomes a 2-items tuple, with two matrices [33117 x 20], in which np.allclose(embeddings[0], embeddings[1]) is True. Why is it returning two of them?

Also, is it safe to reshape the matrix [33117 x 20] into [133 x 249 x 20], in a way that embeddings[0] contains the embeddings of the subject 0`s regions?

Thank you!

CONTRIBUTING Guidelines

Write a concrete contributing guidelines

DoD
CONTRIBUTING.md that specifies the following:

  • Coding guide following PEP8
  • Docstring guide following that of numpy/scipy NOT Google

sparse matrix support

Should be easy, most functions should already work on sparse matrices but we will need to update our typechecking in several places and write tests to make sure.

Also J1c says that one of the SVDs does not work on sparse

Possibly could include support for rank 1 + sparse matrices where rather than many 0s they have many of some constant

Make a class for a latent position model and adjust embeddings to store the latent position model

Make a basic class contaning a structured representation of a latent position model. This consists of an X, a Y, and a optional vtx_names attribute where X \in \mathbb{R}^{N \times k}, Y \in NULL U \mathbb{R}^{N \times k}, and vtx_names \in \mathcal{S}^{N}. Correspondingly, make the base embedding class contain an instance of a latent position model to return to users.

DoD:

  • latent position model class build and tests added
  • embedding base method contains a instance of latent position model, and tests added correspondingly for base embedding method upon delivery of @bijanv 's dimselect method

Omnibus Embedding

Write a function for omnibus embedding with following features:

  • Can take any number of matrices
  • Checks for same matrix dimensions

DoD:

  • Code + tests demonstrating that it works.

More misc TODO

Pedigo

  • merge ptr changes [#67]
  • constant multiplier kwarg for diag aug
  • figure out ASE/eleganz bug/feature
  • fix LSE is_almost_symmetric behavior
  • write some tutorials and/or show use on real data
  • transformations
    • look into sklearn preprocessing [#52]
    • implement log transform
    • refactor plotting log transform to use above
    • refactor transform funcs out of utils (match docs)
  • run ase with diag aug on eleganz data with several constant multipliers [#59]

Explore spectral and omnibus embedding wrt. phenotype data

Select some phenotype features of interest and explore how spectral embedded data look with regard to these features. If it seems reasonable based on this output, try clustering/classification
DoD:

  • pretty graphs generated from an extensible python/class module
  • use jupyter notebook to tell data narrative

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.