Light

neo4j-contrib / neo4j-graph-algorithms Goto Github PK

View Code? Open in Web Editor NEW

763.0 53.0 240.0 42.34 MB

Efficient Graph Algorithms for Neo4j

Home Page: https://github.com/neo4j/graph-data-science/

License: GNU General Public License v3.0

Java 100.00%

graph-algorithms neo4j cypher graph-database graph-analytics

neo4j-graph-algorithms's Introduction

Efficient Graph Algorithms for Neo4j

This libray has been deprecated by the Graph Data Science (GDS) library, now available on our download center or via our github repo at https://github.com/neo4j/graph-data-science/.

This is an update to the graph algorithms library, featuring all of your favorite graph algorithms - and some new ones - plus a new, unified and simplified surface, improvements to the graph loaders, better error messaging, and additional features and workflows to support production scale deployments.

Documentation for the Graph Data Science Library is available here: https://neo4j.com/docs/graph-data-science/current/

please enter all github issues on the new repo

Graph Algorithms Book

Amy Hodler and Mark Needham recently finished writing the O’Reilly Graph Algorithms Book. For a limited time only, you can download your free copy from neo4j.com/graph-algorithms-book

neo4j-graph-algorithms's People

Contributors

Stargazers

Watchers

Forkers

tunks mknblch fbiville tomasonjo shaficks matethurzo grv1207 dotrado jjaderberg igsys chrisemoulton amyhodler sijifeng sampoorna adazhou kdd-opensource mneedham mmmika meltzerpete jbdatascience tiagoooliveira theofe davidoliversp2 vidalcastillocz radha1972 rbramwell astarien nithanaroy teddtech shadowridgedev marplemr dahaynes yshihui mrchtr baituhuangyu calvinalvin muyurainy manduner sunilk1303 firefoxxy8 nehsus tplink32 moxious igovsol knutwalker fieldstar purethink ahmedfarag1993 timholds semsevens jiangbin216 joneswan linuxlsx wwk-tju shinichr ookk3388 pasindusenanayake whuawell shafiahmed arpit-draup mariascharin rzs840707 enterstudio vtonmail okj8-data brucehuang2 renyi533 gevarakelyan bnaseer xingchaozh agilesharp smooshpie agreme frant-hartm jpabbathi nisarpro asarrabay sfaz shunpeizhang rockkb javarulezzz s1ck engyma soerenreichardt buaalyx cdxgm timehsw mats-sx pstutz darthmax crazyyanchao aohan765 billho yarenty iwooyun irokin lixiaofengch kaiyuanzh seyi databill86

neo4j-graph-algorithms's Issues

Configuration of procedure API

The test procedures at https://github.com/neo4j-contrib/neo4j-graph-algorithms/blob/607ec2c03ae57fd0edca75f25da3b6ad4177361d/algo/src/main/java/org/neo4j/graphalgo/impl/TestProcedure.java have unsafe configurability.

parallelism is done by a copied procedure of a different name but should instead be a config parameter
there is no validation of parameters
there are no default parameters

JMH Benchmarks for Shortest Path

Parallel loading for LightGraph

Fix weight mapping in heavy graph

org.neo4j.graphalgo.core.heavyweight.HeavyGraphTest is currently ignored due to failing tests (weighted iterator / weighted forEach). Fix weight mapping and unignore test.

Custom default weight

The WeightMapping assumes a default weight of 0.0. The number should be configurable per algorithm. Since the defaults aren't stored, this would allow algorithms to reduce memory usage by not storing default weights if they require a different default from 0.0

RelationContainer.Builder doesn't care about the given label-id

Evaluate a manager concept for graph loading

The Manager should decide which graph or config to use if no further configuration is specified in the cypher statement.

Turn WeightMapping into interface with two cases

Currently we have conditional checks on any get the see if we have a mapping at all or if we just have to return the default value.
We should turn WeightMapping into an interface with at most two implementations.
One is the current implementation and the second one always returns a default value and never stores.

Floyd Warshall algorithm

I suggest to add Floyd Warshall algorithm to compute the shortest path from many sources to many destinations and to call it directly using APOC Procedures.

Thank you

Allow concurrent access in GraphView

GraphView loads the ReadOperations in the constructor, which bind the graph the the same thread.
We should make it such that it can be used from multiple threads, esp. if we start to implement parallel graph algorithms.

Add single array optimization for IntArray

https://github.com/mknblch/graphtest/commit/33fba7f591fb6e8bf5a1f172aea6d7e304b55717 has removed a optimization of IntArray, that uses a single page.
It should be revived and added, but for a larger array size than just a single page.

Remove nondeterminism in HeavyGraphParallelLoadingTest

The HeavyGraphParallelLoadingTest occasionally fails with AIOOBEs thrown by the GraphFactory. Tests should always be deterministic.

Document graph encoding

We should explain how exactly the LightGraph and HeavyGraph encode the graph and how their internals work.

Write results of algorithms to the graph

Some algorithms operate on all nodes (e.g. PageRank) and instead of returning a large list of results, we should write the result back to the graph. Writes can be be partitioned which makes them embarrassingly parallel.

Replace LongToIntFunction with a more domain specific interface in GraphLoaders

Load input for Graphs via Cypher

Instead of allocating a List<Node> from Cypher while calling the procedure, we can accept a Cypher statement and run it ourselves, using the far more efficient PLongIterators.

int[] specialized form of LightGraph

Follows #7

For smaller graphs, the LightGraph could use int[] instead of IntArray for the adjacency list and int[] instead of long[] for the offsets, which does reduce memory consumption and one indirection.

Add SCC and Dijkstra algos from graph test repo

Add Setter in GraphLoader for propertyDefaultWeight

We need another Setter in the GraphLoader for the propertyDefaultWeight which just overwrites the actual defaultWeight but leaves the relation type unchanged.

GraphView ID handling relies on int values

The GraphView currently relies on node Ids smaller then int.max. We should add some kind of lazy IdMapping if it is considerable to support graphs (or subsets of a graph) with Ids which exceed the int size.

Allow Graphs to skip loading of certain relationships

Not every algorithm needs every dimension of the graph. For example, PageRank only requires incoming relationships on for outgoing ones, it just needs the degree.
We should find an API that allows us to express those requirements, so that we don't have to load and store all outgoing relationships, just their degree.

Change Consumer into Functions to allow premature termination

Since we might decide to get rid of the Iterator-methods (in #29) we should add the possibility to terminate the iteration within a forEach(..) method before all values have been emitted. This could be implemented by changing the Consumer into Functions which return a Boolean that either stops or continues the current iteration.

Investigate using Hilbert Curves for better cache locality

turn list of graph algorithms into issues with progress tracking

create card for each algorithm track progress in

also note findings, links, ideas from research / reading in each card
and also limitations of the current implementation / alternatives

Implementation of the selected algorithms and exposed them as a Java API and Java Stored Procedure
- PageRank
- Label Propagation
- Louvain
- Betweenness Centrality
- Closeness Centrality
- Degree Centrality
- Single Shortest Path
- Strongly, Weakly connected components
- Parallel BFS / DFS
Evaluation
- Performance Tests (different dataset sizes, regressions, level of concurrency)
- Performance comparison with other implementations (Spark/Flink/Graphlytic) on same dataset and same level of concurrency
- Review existing implementations for its functional correctness and performance:
  - A*
  - Dijkstra
  - Shortest Path

issues created

Simplify Matrix representation

We have a large amount of objects in indirections in AdjacencyMatrix. We should reduce the overhead by finding a better adjacency encoding.

Faster Array.fill for large arrays

Some arrays are allocated and then pre-filled with a default value. Array.fill does a naïve linear iteration over all array indices and sets each element, which can become quite inefficient for large arrays. As this happens mostly for arrays that store something pre-node, we have a linear dependency on the number of nodes that we can strive to eliminate.

fill in batches by arraycopying from pre-filled arrays
check if we can maybe rewrite some algorithms to make use of the system default value, instead of having to use a custom default value

Add negative tests for all GraphLoaders

Test that providing invalid/unknown labels/types/properties behave as they are expected to.

Implement parallel WeightMapping

The HeavyGraph ignores weights when loading happens in parallel but Weights are still required.

Move Fileloaders back to core module

Loading and writing graph into a file serialization might be useful for others besides our tests

Move every loader to the core module
Replace reflective access with package-private access
document accessors and constructors, why they are package private

Remove relation ids from Graph API

Consider IdMapping starting at 1 (0 exclusive)

In our current approach the Id-mapping returns intergers starting at 0. Yet there is often the case where nodeId-arrays have to be initialized with some kind of start value. An 1-based mapping could save us some initialization loops.

Investigate performance differences between Light and Heavy Graph

Just by the looks of it, the HeavyGraph shouldn't be that much faster than the LightGraph.
Let's see if we can figure out why the difference is how it is and whether we can make LightGraph faster.

JavaDocument public Graph API

Graph: #4 (comment)
IdMap: #4 (comment)

Smaller and fixed batch sizes for parallel imports

The current batchSize is nodeCount / nrOfThreads. It would be better to use a fixed batch size like 10k oder 100k.

Better work stealing possibilities, if one batch contains mostly deleted nodes
More predictable resource usage, temporary arrays could be reused for multiple batches

JMH recording

run JMH benchmarks with csv output or something similar machine-parsable. Dump results over time to someplace.

Decide for one approach of Iterator vs Consumer vs Lookup in Graph API

The Graph currently provides multiple ways to access the graph data. We should decide for one with the consumer based API being the favorite.

#4 (comment)

Autogrowing array in RelationContainer

Currently we have to initialize the RelationContainer.Builder with the degree. Add a logic which grows the relationship array on demand. Also check if growing is an option for the parent (data) array too.

First clustering graph algorithm

Something that's useful / useable, like label-prop or union-find, I leave the choice up to you.

Reasons:

we have a practical use-case that would benefit from it
we want to exercise the graph-API also from different kinds of algorithms

One relevant feature would be to consider the "weight" property in a relationship for the "strength" of the connection to a cluster.

As a simple solution to start with, could be to filter relationships to consider "weight" as a filter, e.g. only consider relationship-input that exceeds a certain weight at all.

General-purpose, power procedure

This general purpose procedure allows loading the graph once and then allows multiple, differently configured algorithms on top of it, e.g. also page-rank with different configs or page-rank and clustering

procedure pipelines

We can also consider one algorithm feeding into the next. E.g. the first-page-rank is not (just) persisted into the graph but immediately (with the in-memory computed values) taken into account for the next algorithm (centrality or clustering)

Don't rely on availableProcessors as a default/fallback

The default should be something unrelated to the number of host processors. With #5 and #16 implemented, the impact of the default parallelism should be minimal.

Include all benchmarks from graphtest repository

ThreadPool handling

optionally use Neo4j Thread Pool like in APOC or via dependency resolver of GD-API

Validation of GraphLoader values

Similar to #16, the provided values should be validated before building the Graph.

#4 (comment)

Thread safe WeightMap

The current WeightMap is not threadsafe for write access. Evaluate int->double / int->int backed mapping logic. To implement this we first need some kind of mapping between the long-relationship ids and their inner representation

Revise primitive collections

We should try to use Neo4j's primitive collections where possible and document and explain, when we use a different collection.
Where we settle on a third-party collection, we can think about changing to algorithm to be able to use a Neo4j collections instead, or PR a change to the Neo4j collections.

Write benchmarks for larger graphs

couple of million nodes and edges (wikipedia/dbpedia size)
Singleshot execution
less iterations and warmups

document internal workings of UndirectedTree

Better speaking names in GraphLoader interface

The set* methods are uncommon for the fluent/builder style interface of the GraphLoader, better would be to use with*

Restrict GraphView to use only supplied label/relation/property/defaultValue

The GraphView doesn't care about the restrictions given in its constructor. To implement more UnitTests we need a fully working GraphView.

implement restriction for label-, relation and property types.

Considerations on weights

So far we assumed edge weights to be double values in [0, 1) and we have a basic mapper logic for turning arbritary property objects into doubles. We also have to duplicate the relationship-iterator ifaces for the weighted versions. This make the handling of weights very inflexible. With the api2 approach we could consider to switch over to a weight-datasource instead of haven them bound to the iterator.

Since we consider one property per relation and one relation between a pair of nodes we could use the mapped-nodeIds as key-pair.

I'd suggest something like this

wheightOf(sourceNodeId:int, targetNodeId:int):double

This would reduce the amount of different ifaces and implementations. We could also have different impl. for the wheight-source with their own characteristics.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.