Comments (6)
Thanks @semihsalihoglu-uw for your questions. For those who don't know, @semihsalihoglu-uw is currently conducting a survey on graph database usage, which I believe anyone is free to take. They have previousely explored this topic in a work titled The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. For some reason, the DOI for that article has not been registered (shame on the publisher).
Anyways, here are my answers to the questions:
What kind of queries and graph computations do you run on hetionet?
The primary computation we run is to compute degree-weighted path counts (DWPCs, initially described here). DWPCs measure the extent of connectivity between two nodes along a given type of path (metapath). They are related to path counts (the number of paths), but have an adjustment for node degree to downweight paths through high degree nodes.
We also ran some one-of-a-kind queries to investigate specific questions. These usually rely on Cypher queries to Hetionet in Neo4j (see examples).
What kind of software do you use to run these queries
We have three implementations for computing the DWPC. Here they are in order of both date created and sophistication:
-
Using a function in the hetio python package which takes a
hetio.hetnet.Graph
object. This requires the whole graph to be read into memory. -
Using a Cypher implementation (background) that computes DWPCs from a Neo4j database. This has the advantages that the graph can be stored on disk and concurrent queries are possible. Generally, we still use functions from the hetio package to template these queries.
-
Matrix multiplication approaches we're currently developing for the hetmech project. This approach stores hetnets as matrices (one adjacency matrix for each relationship type). We can achieve massive efficiency gains by computing DWPCs with matrix multiplication. The two downsides are that this method doesn't track which paths connect nodes (just how many) and that excluding duplicate nodes in a path is tricky. We have built considerable python infrastructure to do this. The hetmech infrastructure still uses parts of the hetio package.
So as you can see, each new implementation builds off the previous ones and often depends on parts of the existing codebases.
I was curious if you extract simpler, more homogenous graphs, out of hetionet, say of only gene gene interactions
We implement a get_subgraph
method for hetio.hetnet.Graphs
. However, we have mostly used this to generate sub-hetnets (usually to create testing networks) rather than homogeneous networks. Since I feel that hetnets are underutilized compared to homonets, I don't spend much time working on approaches for homonets.
Somewhat related, in hetmech, we've created a HetMat data structure that stores hetnets on disk. Each adjacency matrix is a different file (exported from numpy or scipy). In this way, users only interested in certain parts of the hetnet, don't have to read all relationship matrices.
Were there any features that you think was missing from the software that you were using, or things that were difficult to do?
I think visualization of hetnets is still a pain point. Especially visualizing large numbers of nodes and relationships. Of course, visualizing 50 thousand nodes and millions of relationships won't tell you much about specific nodes or relationships, but these views help communicate the network generally. We've used Cytoscape here, but even this become unwieldy and was very manual.
Feel free to follow up with any additional questions. Or if you have nothing else to ask, you can close the issue.
from hetionet.
Thank you very much for the detailed response. The Cypher queries here are especially very useful. Two follow up question:
- Are you building the hetmech for performance reasons only? Or were there computations you thought were simply much easier to express as matrix multiplications.
- There are several well developed graph libraries that provide adjacency matrix representations and operations of graphs, such as networkx. Would these serve your needs or did you choose to build hetmetch from scratch because there were specific computations they would not satisfy?
And I think VLDB might be registering the dois in September when conference is held. They always do but I'm not sure of the exact timeline.
from hetionet.
Are you building the hetmech for performance reasons only?
Hetmech and the related HetMat data structure are motivated primarily by performance. Personally I don't find matrices and their dot products an intuitive data structure for hetnets. To me, it's much more intuitive to use a data structure that more closely resembles a network and that more easily allows nodes/edges to be annotated with properties. However, the performance improvement from calculating path counts via matrix multiplication turns out to be too compelling to ignore. The matrix multiplication is faster than path traversal algorithms in two important ways:
- Computation time scales linearly with path length because matrix multiplication does not track which paths arrive at a certain destination. Path traversal methods blow up with increasing path length.
- The matrix multiplication approaches compute DWPCs for all source-target node pairs. We usually are interested in all pairs, so it's a huge bonus to get all the DWPCs in a single output matrix.
Together these factors lead to a several orders of magnitude efficiency improvement.
There are several well developed graph libraries that provide adjacency matrix representations and operations of graphs, such as networkx. Would these serve your needs or did you choose to build hetmetch from scratch because there were specific computations they would not satisfy?
We are always on the lookout for larger projects that provide functionality for hetnets and did consider alternatives before creating HetMat. networkx
isn't a good option as it's hetnet support is mediocre --- the MultiGraph supports relationship types but not node types and doesn't really provide first-class type support. While network can export to adjacency matrices, it doesn't really have the functionality we needed to perform computations on them. Building the HetMat data structure from scratch allowed us to do some really cool things:
- implement an on-disk data structure for hetnets
- enable types of on-disk caching
- enable additional types of in-memory caching and optimizations
Currently, our hetnet stack consists of the following tools:
hetio.hetnet.MetaGraph
objects for storing the hetnet schema (metagraph)hetmech.hetmat.HetMat
for storing hetnets as matrices- neo4j for enabling custom cypher queries, interactive visualization, and path traversal operations
So as our research has progressed, it seems like we're using more tools, since we're finding where each tool excels and using it for just those applications. For some projects, we do use networkx (see obonet for example), just less so for our hetnet work.
from hetionet.
Great thank you very much for the detailed answer again.
One final question: When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"? For example, you noticed that some nodes or edges looked suspicious and then had to remove them? Or was each network data that you integrated clean data?
from hetionet.
When integrating data to the hetnet, or even after that when using it for research, did you ever have to do "graph cleaning"?
We released an initial version of Hetionet quite a bit before we released version 1.0, which had several additions, changes, and improvements. First, every relationship type required preprocessing, which is where the cleaning occurred. In general, each resource had it's own repository where most of the preprocessing took place. Then we had a single notebook that integrated the data from all of the source repositories.
Graph cleaning was an iterative approach. Neo4j was super helpful here because it provided a visual way for us to quickly explore and sanity check the networks. Of course, you often will notice bugs or possible improvements. For example, certain metadata may be missing, certain things may be misspelled, or additional processing may be required. Cleaning the data as well as mapping everything to common standardized identifiers was a laborious process, as were the legal issues surrounding the data reuse.
See this table with all of the resources we integrated and citations to the related supplementary materials.
from hetionet.
Great, thank you very much! I'm closing the issue.
from hetionet.
Related Issues (20)
- Local files HOT 2
- Multiple Match Queries Not Working HOT 2
- Question About Hetionet's Dictionary HOT 3
- How to add new disease and anatomy nodes HOT 2
- Providing a dump version of Hetionet HOT 11
- http://neo4j.het.io/browser/ time out HOT 4
- Neo4J instance down (?) HOT 7
- Updated TSV version HOT 6
- graph.db database offline in neo4j HOT 3
- neo4j website down HOT 6
- Hetionet Browser is down HOT 4
- Mapping to original databases HOT 2
- Cannot map non-existing file HOT 5
- Do any relations imply another relation? HOT 1
- Connectivity Search Automated Query Question HOT 8
- Docker compatibility question HOT 4
- Question on metrics HOT 1
- What does it mean if something up regulates a disease in this context? HOT 3
- Speeding up data import to Neo4j v5 and CSV format data HOT 2
- Inquiry about metapaths from 2017 Paper "Systematic Integration of Biomedical Knowledge Prioritizes Drugs for Repurposing" HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hetionet.