fraenkel-lab / omicsintegrator2 Goto Github PK

View Code? Open in Web Editor NEW

51.0 10.0 24.0 40.77 MB

Prize-Collecting Steiner Forests for Interactomes

Home Page: https://fraenkel-lab.github.io/OmicsIntegrator2

License: BSD 3-Clause "New" or "Revised" License

Python 4.43% Jupyter Notebook 95.57%

omics proteomics prize-collecting steiner-tree

omicsintegrator2's Introduction

OmicsIntegrator2

Background

Omics Integrator is a package designed to integrate proteomic data, gene expression data and/or epigenetic data using a protein-protein interaction network. It identifies high-confidence, relevant subnetworks from the underlying interactome. It is comprised of two modules, Garnet and Forest. This repository holds the code for Forest in version 2 of Omics Integrator.

Forest first maps your high-throughput data onto the network. Proteins in the network are 'nodes' connected by edges representing physical interactions of two protein nodes. You should assign protein nodes prizes from your high-throughput data, i.e. the prize could be the log fold change of that protein in your system. The edges are assigned costs, often proportional to the confidence in that interaction.

Forest then adds a 'dummy node' to the network with edges to all of the nodes you've assigned prizes, called terminals. There are several parameters you can change in Forest. Omega, the cost of the edges between the root dummy node and the terminals, determines the number of pathways in the final solution. Beta, the relative weighting between node prizes and edge costs, determines the size of the final solution. And alpha adds a penalty to edges based on the degrees of the two nodes that the edge connects. This keeps the network from being biased towards "hub nodes", often highly studied and promiscuous proteins that may not be specific to your system.

Finally, Forest uses the Prize-Collecting Steiner Forest algorithm to whittle the large interactome down to relevant sub-networks, or pathways. These pathways are likely places to look for important cellular functions altered in your system. They will include some, but not all, of your terminals. They may also include "Steiner nodes", nodes that you did not assign a prize to, but that the algorithm is predicting are important to the pathways altered in your system.

With the output of these sub-networks, Omics Integrator allows researchers to go from huge, often contradictory lists of genes, proteins, and metabolites from multiple -omics data sources to a few important cellular pathways to focus on in follow-up studies of their system.

Changelog:

6/20/18: 2.3.0

Many breaking changes, e.g. method names, data formats
Add tests
New annotation file contains process and function gene annotations on top of subcellular localizations.
HTML visualization improvements
First release with a changelog entry and github release!

omicsintegrator2's People

Contributors

Stargazers

Watchers

omicsintegrator2's Issues

Subcellular localization -- "no location" should be referred to as NaN

Louvain clustering output

OmicsIntegrator2/src/graph.py

Line 366 in 11a24e1

louvain_clustering(augmented_forest)

Getting the following error:

Post Processing: Clustering and Enriching

I've compiled a list of what seems to be the most useful clustering and enrichment methods for the output of OmicsIntegrator. Below are the methods that are most frequently used, and should be implemented in OmicsIntegrator2 (listed in order of priority)

Clustering:

Louvain Clustering
Edge-betweenness Clustering - already implemented by Alex (?)
K-means clustering / TsNE
Consensus Clustering
Pathway clustering

As well as clustering, it would also be helpful to have enrichment methods.

Output cluster membership
Integration with Enrichr API

If there are any others that you think of, please let me know!

Grid Search over parameter space

This is an important feature a lot of the time. How would we like it to be implemented?

Ideally we would want to be able to run each of these in parallel (e.g. with multiprocessing).
There exist a couple "out of the box" solutions for this (e.g. https://github.com/fraenkel-lab/OmicsIntegrator2/blob/dev/src/grid_search.ipynb) which we may be able to use.

What output format would we like?

@iamjli @AmandaKedaigle

Adding negative prizes for very low mRNA expression

@AmandaKedaigle's idea. Max's work. Further details coming.

knockout proteins

Right now, the knockout functionality is not implemented in OmicsIntegrator2.

parser.add_argument("--knockout", dest='knockout', nargs='*', default=[],
	help='Protein(s) you would like to "knock out" of the interactome to simulate a knockout experiment. [default: %default]')

random terminals: how random?

@AmandaKedaigle

Here are the top couple nodes (left column, their numerical ID) and their edge degrees (right column)

4400      407
9466      409
6347      413
629       416
4309      419
12140     424
430       457
12015     475
75        478
6877      481
10772     516
11477     542
9358      569
4517      597
11171     617
12162     626
6397      647
5146      652
6058      653
4308      655
6433      662
4565      688
7598      746
4767      750
10141     766
6225      836
4310     1113
4770     1239
275      2008
6400     9289

The way random terminals used to work (I think) was it would just look within a fixed window around the current (in the above list) and select by sampling an offset using a gaussian.

But let's say node 6400 is your terminal, and you get an offset of 10, then you're selecting a node with degree 662 instead of 9289 which is a big change in node degree. The window size was set to 100 up or down (200 total I think) meaning you could really get a hugely different value.

Is that how we want to keep doing things? If so do we want to shrink the window? Or maybe we should do something else? What about a uniform sampling from the 5 above and 5 below?

Let me know what you think...

create style code for visualization

Originally from @AmandaKedaigle, copied here.

Specific things create_style_code_for_visualization.py did that we'll need to replace in some manner.

-Attributes for coloring should be named based on what kind of data: proteinChange (for both proteomic and metabolomic data) and geneChange (for expression data)

-Add the "TerminalType" attribute based on what input file you were looking at. For visualization code, these should be "Proteomic", "Metabolite", "mRNA", "TF" (from Garnet)

-It also inferred which one should be the actual "prize" column - for Proteomic or Metabolite nodes, it should be abs(data value), for TF nodes, it should be the data value (which does not change proteinChange or geneChange attr values), and for mRNA data, if prize is not already assigned to a protein node from this mRNA name, it gets prize zero (and geneChange from the data value)

Write informative README

Similarly to GarNet, it would be really nice to have a beautiful readme explaining this project. I no longer believe we'll have OmicsIntegrator2 be a fork of OmicsIntegrator(1) so I think it makes sense to duplicate important documents from OI1 here. I imagine a lot of the overview can be borrowed from OI1 (thought it might be nice to add a little). The technical stuff should be easier this time, since installing is a single command.

Network summarization

Provide summary statistics for single run forests, provide warnings when network structure is off (large star structures/many singleton nodes, etc). Write to file.

Ideally also do this for randomizations, but the code structure has to be modified significantly.

Ernest: Avoid parameter selection by randomize & merge

Avoiding parameter selection. This is much more speculative, but what if we chose parameter combinations at random, and then created a merged network by taking the average weighted by the probability the network is not random (mean specificity score)? This is similar to Importance sampling, but less rigorous since specificity is not the same as prob_true.

Ernest: randomizations heatmap should show node degree information

Generate a version of the heatmaps for parameter selection that simultaneously capture node degree, robustness and specificity. Since degree is fixed for a node, it could be plotted at the margin of the heatmap. Robustness and specificity depend on the parameters. Perhaps we could have two versions of the plot: (A) Color scale for robustness for each matrix element with average specificity of nodes shown in the margin and (B) color scale for specificity for each matrix element with average robustness of nodes shown in the margin.

Bug with main.py

The imports for main.py in line 11 need to be fixed a little bit. In particular, get_networkx_graph_as_node_edge_dataframes needs to be replaced with the two separate functions available in graph (and also once again where it is called, needs to be replaced with two lines)

In addition get_networkx_subgraph_from_randomizations doesn't exist in graph.py currently. This needs to be implemented or need to import the right function instead.

Update to networkx 2.0

This was released a couple days ago.

OI web app

Some notes from our convo on Friday, Mandy, as well as a few other things I just thought of.

What is the mRNA input for? Colorization, or penalizing lowly expressed nodes? If the former, we run in to some consistency issues since we don’t use our entire proteomics file as our prize file. If the latter, we don’t have this functionality implemented, so maybe we should leave that box out for now.
In the algorithm parameters, “Number of Trees” is a bit misleading, but I can’t think of a concise rewording. Maybe “connectivity”? Even that’s not ideal.
Default for b should probably be in the range of 0-2, since we are no longer penalizing node prizes. Default 1 seems to work pretty good for me.
Degree penalty (a) should now be edge penalty (g). Default 10 works fine for me.
iRef14 also worked well for me, so maybe update that.
Visualization: with new penalization scheme, high degree nodes are included, leading to outputs like the one below. Would it be too difficult to make non-robust edges transparent? Low robust edges could still be visualized, but at a very low visibility.
Output for randomized experiments should auto return the top 400 robust nodes as a subnetwork, or some other heuristic. The former is already implemented in graph.py.

Write docstrings for clustering techniques

The docstrings are the multi-line comments below the function definitions. The functions that need annotations are louvain_clustering, edge_betweenness_clustering, k_clique_clustering.

Incorrect output filename

Small one, but shouldn't it be filename="graph_edgelist.txt" per the docstring?

OmicsIntegrator2/src/graph.py

Line 948 in 71d5c59

    
           def output_networkx_graph_as_edgelist(nxgraph, output_dir, filename="graph_json.json"):

Refactor randomizations and output_forest_as_networkx

The randomizations function has gotten a little bloated. Refactor it to make it a little more atomic. Other functions which might be impacted / implicated:

- _aggregate_pcsf
- get_networkx_graph_as_dataframe_of_nodes
- get_networkx_graph_as_dataframe_of_edges

Write cytoscape files

Currently, we have a file called write_cyto_file.py in the repo, but instead of doing this ourselves, we should have networkx do this, I think.

Cytoscape can take in these formats including:

Graph Markup Language (GML or .gml format)

networkx has documentation about reading and writing here, including the functionality of writing GML files.

This seems like a nice way to go. However, we need to make sure we don't run into any problems involving issues like this one: http://stackoverflow.com/questions/5828045/transfer-layout-from-networkx-to-cytoscape

Quick final note, we may want to use graph-tool instead of networkx

Seg fault

This command run from /nfs/latdata/iamjli/packages/OmicsIntegrator2/src produces a segfault:

python __main__.py -e ../PCSF_compare/data/iRefIndex_v13_MIScore_interactome_COSTS.txt -p ../PCSF_compare/data/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/

PCST runs fine if I use a shortened interactome file (../PCSF_compare/data/iRefIndex_v13_MIScore_interactome_COSTS.short.txt), or one that has not been 1 minused (../PCSF_compare/data/iRefIndex_v13_MIScore_interactome.txt)

NetworkXError: GraphML writer does not support <class 'list'> as data values.

The otherLocations node attribute is a list of extracellular locations, but networkx cannot write lists to GraphML, apparently.

pcst_fast function not working on cluster

Running forest from /nfs/latdata/iamjli/ALS/OmicsIntegrator2/src:

python forest.py -e ../data/interactome/iRefIndex_v13_MIScore_interactome.txt -p ../data/iMNs/proteomics/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/test/

am getting this error:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  In the rooted case, only one output cluster is supported.
Aborted

Traced error back to pcsf_fast function.

Terminals not in results included in final network

Terminals that were not included in results are left in the final network as nodes (though no edges are attached to them). The inner join to get rid of the dummy edges still leaves the unattached terminals.

This is not a huge problem, though these nodes don't get assigned attributes and thus caused an error with the web tool visualization, but to be cleaner they should prob be left out. Wanted to leave a note here in case they cause any other issues in the future.

Add feature to remove edges between the 1-hop neighbors of a protein of interest

@brycehwang's idea.

May force the network to choose a greater diversity of paths around your protein of interest.

rewrite _random_terminals

Right now it's just badly written.

Error during randomizations

Using this command:

python /home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py -e /home/nlpm/OI2_Networks/interactome/iref14_Recon2_Htt_mzMatchedMet_OI2_interactome.txt -p /home/nlpm/OI2_Networks/terminals/Cypro_all_terminals.txt -o /home/nlpm/OI2_Networks/networks/Cypro/randomizations/ -w 3 6 9 12 -b 0.25 0.5 0.75 1 2 -g 5 10 20 --noisy_edges=100 --random_terminals=100 -noise=0.04567861 --seed=1

I got this error:

10:59:30 - Graph: INFO - {'w': 12.0, 'b': 2.0, 'g': 20.0, 'noise': 0.04567861, 'exclude_terminals': False, 'seed': 1, 'noisy_edges_repetitions': 100, 'random_terminals_repetitions': 100}
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 525, in _eval_randomizations
    forest, augmented_forest = self.randomizations(params["noisy_edges_repetitions"], params["random_terminals_repetitions"])
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 492, in randomizations
    forest, augmented_forest = self.output_forest_as_networkx(vertex_indices.node_index.values, edge_indices.edge_index.values)
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 322, in output_forest_as_networkx
    forest.add_nodes_from(list(set(self.nodes[vertex_indices]) - set(forest.nodes())))
  File "/home/nlpm/packages/OmicsIntegrator2/venv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 1700, in __getitem__
    result = getitem(key)
IndexError: arrays used as indices must be of integer (or boolean) type
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py", line 116, in <module>
    main()
  File "/home/nlpm/packages/OmicsIntegrator2/src/parameter_sweep.py", line 102, in main
    results = graph.grid_search_randomizations(args.prize_file, params)
  File "/home/nlpm/packages/OmicsIntegrator2/src/graph.py", line 561, in grid_search_randomizations
    results = pool.map(self._eval_randomizations, all_params)
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/net/dorsal/apps/python361/lib/python3.6/multiprocessing/pool.py", line 608, in get
    raise self._value
IndexError: arrays used as indices must be of integer (or boolean) type

Scaling for parameters

Log transform for "g"
Have "w" scale a function of prize weight

Negative Prizes for Hubs

@AmandaKedaigle With respect to negative prizes, how are they computed? I imagine it isn't all too complicated, just having to do with a mu parameter...?

Remind me again why it is we need to keep original, negative and result prizes again?

Add mRNA nodes & interactions to forest

The web tool has support for new kinds of interactions we'd like to include eventually:

The output of Garnet should be changed to indicate what genes were used to predict a TF was important (i.e. these genes are predicted targets of that TF). Then, those genes should be added to the network as new nodes (mRNA nodes, not protein nodes), and edges should be drawn:

"pd" or protein-DNA interaction, from TFs to their target mRNA nodes
"tp" or mRNA-protein interactions, from mRNA nodes to their protein products

Variable attributes discussion

I still think some variables would be better as class attributes. For instance, it seems arbitrary that self.edge_penalties is an attribute despite not being used anywhere else in the class. On the other hand, terminals in the _prepare_prizes is redefined again in pcsf, and will be in pcsf_exact.

I think I would be satisfied if prizes, terminals, and (maybe) node_attributes were their own attributes. Especially since prizes and terminals are both arrays that are meaningless to the user outside of the context of the class. So the current format essentially forces users to save prizes/terminals as local vars, just to pass it back into pcsf.

Arguments against?

Directed Edges

@AmandaKedaigle this thread is for discussing whether we need to support directed edges, and if so, how we might go about doing so. I imagine Ludwig will need to be part of these discussions. In the short term, I plan on releasing this without support for directed edges, because I have a heavy bias here of "good is better than perfect", unless you think that's a bad idea.

Ernest: Ranking subnetworks

Ranking subnetworks based on something Anthony introduced in his most recent manuscript:

rank scores for these sub-networks according to their prize densities (sum of prizes multiplied by the fractional size of sub-network

_check_validity_of_instance

We need to think deeply about what the potential malformed inputs to OmicsIntegrator might be. This thread is for that. Beyond making sure every function is passed the right type, what else do we need to check? @divyaramamoorthy

Wrapper scripts TODO list

Data Input
- Wrapper to format prize/interactome files
Parameter sweep
- Grid search
  - Input: parameter file with param grid and I/O paths (/nfs/latdata/iamjli/packages/PCSF_analysis/specification_sheet.yaml)
  - Output: parameter summary node matrix (/nfs/latdata/iamjli/ALS/network_analysis/iMNs_ALS_CTR/new_proteomics_051917/param_search/summary/PCSF_JLI_networkNodeMatrix.tsv)
- Visualize node membership as heatmap (DONE)
  - /nfs/latdata/iamjli/packages/PCSF_analysis/bin/heatmapFromNodeFrequencyMatrix.R
Randomizations
- Randomization experiments
  - Input: parameter set
  - Output: file summarizing all runs
    - node_summary.tsv: protein, specificity, robustness, prize, type (/nfs/latdata/iamjli/ALS/network_analysis/iMNs_ALS_CTR/new_proteomics_051917/W_2_BETA_9_D_7_mu_5e-05/summary/summary_nodes.txt)
    - edge_summary.tsv (not urgent)
Post-processing (optional)
- clustering
  - Input: node_summary.tsv, robustness threshold, clustering method
  - Output: networkx object with nodes that have robustness greater than threshold.
    - Edge weights from interactome
    - Node attributes found in node_summary.tsv
    - Cluster assignments

Edge data representation

pcsf returns edge_indices as a 1D np array. It would be nice to return as 2D array of vertex indices (since pcsf_exact uses this format), but I think this breaks a lot of downstream analyses, particularly _aggregate_pcsf and randomizations.

One solution would be to create a copy of the interactome, and assign robustness/specificity as node and edge attributes..

So what's the best way to go from 2D array of vertex indices to 1D np array of edge_indices?

Switch all soon-to-be-deprecated .loc's to .reindex's in graph.py

Node attributes output

-Gene symbol
-Prize value
-Robustness
-Specificity
-Protein type (steiner, TF, terminal)

exclude_terminals

Post processing graphs

@AmandaKedaigle What are the post-processing steps for graphs, and which of those do we want to support under which circumstances? In my notes, I see that there's been discussion of betweeness, Louvain clustering, and community clustering. I imagine some of these are supported by networkx, others we might need to implement on our own.

@iamjli would you comment on this for us?

Duplicate edges in edge file

Hi @AmandaKedaigle ,

Will there ever be duplicate edges in the edge file? I.e.

A   B   prize1
B   A   prize1

or worse

A   B   prize1
B   A   prize2

Specifically I'm asking whether I should check for that in forest, and what I should do if it happens...

Can we name alpha something else?

Tony's implementation of multi-PCSF (which Azim and I making a modified/new version of for OI2) already uses a parameter called alpha in calculating new prizes.

If there's a good argument for calling the hub edge penalty alpha we can name ours something different, but for confusion's sake it would be better to keep similar parameters similar names to the published algorithm

Add ability to penalize or remove nodes

Which allows for #35

Randomizations

@AmandaKedaigle when people go to run OmicsIntegrator2, how do they do randomizations? Do they specify a number of times to randomize using each randomization strategy and merge them all? I'm still not sure I understand merging too well either... but maybe I can read about that.

THE ROADMAP: Everything we want to change for this release:

@divyaramamoorthy @AmandaKedaigle

New in this version

Safe to Remove:

Cross validation code
Shuffle Prizes
Support for Cytoscape 2.8

Web tool not working

@AmandaKedaigle @zfrenchee Can you guys take a look at this soon. Error in browser:

A problem occurred in a Python script.

/var/www/htdocs/omicsintegrator/log/tmpVAf0YI.html contains the description of this error.

Same error in Safari and Firefox. Likely data-dependent error. He can't share the raw data though.

--random_terminals flag doesn't work

python forest.py -e /nfs/latdata/iamjli/ALS/data/interactome/iRefIndex_v13_MIScore_interactome.txt -p /nfs/latdata/iamjli/ALS/data/iMNs/proteomics/20170323_ALS_CTR_iMNs_protein_log2FC1_FDR0.01.tsv -o ../output/ --random_terminals=5

run from /nfs/latdata/iamjli/packages/OmicsIntegrator2/src

Error:

Traceback (most recent call last):
  File "forest.py", line 477, in <module>
    forest, augmented_forest = graph.randomizations(prizes, terminals, args.noisy_edges_repetitions, args.random_terminals_repetitions)
  File "forest.py", line 329, in randomizations
    for random_prizes, terminals in [self._random_terminals(prizes, terminals) for rep in range(random_terminals_reps)]:
  File "forest.py", line 329, in <listcomp>
    for random_prizes, terminals in [self._random_terminals(prizes, terminals) for rep in range(random_terminals_reps)]:
  File "forest.py", line 296, in _random_terminals
    new_prizes[new_terminal] = prizes[old_terminal]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Low(er) priority: support for TF lists

Web-tool suggestions

run_visualization_for_user_saved_JSON_file.sh creates two redundant folders: ../visualize_results_rundir and ../visualize_saved_<name>
Currently only recognizes Steiner node if TerminalType field is 0. Please change to recognize anything that's not RNA/TF/Proteomic
Node names need to be text searchable
Allow user to switch between viewing Protein and mRNA level change. Both could be represented by node color...looks cluttered as is.

What to do when Prize file contains nodes missing from interactome

@AmandaKedaigle What does the old omicsintegrator do?

running on local machine

Python 3.5.1 (default, Mar  1 2017, 15:20:57)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pcst_fast import pcst_fast
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so, 2): Symbol not found: __PyThreadState_UncheckedGet
  Referenced from: /Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so
  Expected in: flat namespace
 in /Users/jonathanli/Documents/research/packages/OmicsIntegrator2/venv/lib/python3.5/site-packages/pcst_fast.cpython-35m-darwin.so

When I try to install pcst_fast:

(venv) jonathanli (master *) OmicsIntegrator2 $ pip install pcst_fast
Requirement already satisfied (use --upgrade to upgrade): pcst-fast in ./venv/lib/python3.5/site-packages
Requirement already satisfied (use --upgrade to upgrade): pybind11>=2.1.0 in ./venv/lib/python3.5/site-packages (from pcst-fast)
You are using pip version 7.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.