neo4j-contrib / neo4j-mazerunner Goto Github PK

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

License: Apache License 2.0

Shell 2.43% Java 66.45% Scala 30.20% Dockerfile 0.92%

neo4j-mazerunner's Issues

Create a Mazerunner administration tool

Create a Mazerunner administration tool that schedules, configures, and monitors agent jobs.

Registering new graph algorithms

Create a graph algorithm registration process, making it available to the Mazerunner admin tool.

Maven issue

I have tried a couple of times, but I keep getting the following error when provisioning...I wonder if it's just me?

==> default: Compiling neo4j-mazerunner extension...
==> default: Nov 10, 2014 8:03:23 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
==> default: INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server repo.maven.apache.org failed to respond
==> default: Nov 10, 2014 8:03:23 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
==> default: INFO: Retrying request
==> default: [ERROR]
==> default: Failed to execute goal on project extension: Could not resolve dependencies for project org.mazerunner:extension:jar:1.0: The following artifacts could not be resolved: org.neo4j:neo4j-kernel:jar:tests:2.1.5, org.neo4j.app:neo4j-server:jar:2.1.5: Could not transfer artifact org.neo4j:neo4j-kernel:jar:tests:2.1.5 from/to central (http://repo.maven.apache.org/maven2): Read timed out -> [Help 1]
==> default: [ERROR]
==> default: [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
==> default: [ERROR] Re-run Maven using the -X switch to enable full debug logging.
==> default: [ERROR]
==> default: [ERROR] For more information about the errors and possible solutions, please read the following articles:
==> default: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Deployment guide for Spark cluster scaling with Hadoop Yarn and Mesos

Create production deployment scenario for a Hadoop Yarn Resource Manager and Mesos for the Spark cluster. See http://spark.apache.org/docs/latest/cluster-overview.html

Use dependencies from local Mazerunner source on Sandbox

Change provisioning to compile from the shared Mazerunner repository and not inside the Vagrant sandbox.

neo4j 2.3.1 support?

I've already upgraded my database to 2.3.1. Any plans to support it soon? Or is it possible to connect to 2.3.1?

Betweenness Centrality Tractability?

Running this algo on a directed graph with 30K nodes and 50K Edges is extremely slow. The output during betweennessCentralityMap phase is roughly 1 vertex completed per 3 minutes. I realize that this can be a O(n^3) scale problem depending on algorithm, but this seems prohibitively slow. Running this on a single VM with 10 GB of Memory and 3 cores. Is this behavior normal?

Create a log management service

Create a log management service that listens for logs submitted across all modules of Mazerunner.

Dependency issues

When running kbastani/neo4j-graph-analytics:latest in docker I get an exception during startup:

Exception in thread "main" java.lang.NoSuchFieldError: IBM_JAVA
    at org.apache.hadoop.security.UserGroupInformation.getOSLoginModuleName(UserGroupInformation.java:337)
    at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:382)
    at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:36)
    at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:109)
    at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:238)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
    at org.mazerunner.core.processor.GraphProcessor.initializeSparkContext(GraphProcessor.java:78)
    at org.mazerunner.core.messaging.Worker.doMain(Worker.java:109)
    at org.mazerunner.core.messaging.Worker.main(Worker.java:60)

The reason for this exception is a conflict between hadoop-auth-2.4.1.jar and hadoop-core-1.2.1.jar. Both are copied from the docker/bin/lib to the classpath in the docker image and both contain the class org.apache.hadoop.util.PlatformName, but hadoop-core does not declare the field IBM_JAVA.

I managed to resolve this issue by simply deleting hadoop-core-1.2.1.jar and rebuilding the docker image, though I'm not sure if this break some other part of the application (so far it worked fine for me).

I'm not sure about the structure of this image, but it seems like the projects org.mazerunner.extension and org.mazerunner.spark have conflicting dependencies yet still they are copied on the same classpath.

If those are two separate programs, maybe you should copy each project and it's dependencies in a separate folder. And if one depends on the other, then you should add the other project as dependency and let Maven resolve the conflicts.

Test on My Dataset

Kenny,

I have got this up and running and the movie sample working fine.

I have replaced graph.db with my data and attempted to get the Pagerank to run. I have done the following:

Replaced graph.db
Restarted Neo4j - works fine
Created a new relationship to define the subgraph (u1:User)-[:RETWEETED]->(u2:User)
commented out the create demo graph in mazerunner.sh
changed KNOWS to RETWEETED in spark/src/main/resources/spark.properties
restarted Mazerunner - starts ok
refreshed endpoint - and seen Hadoop fire up and run

However, the nodes (User) are not updated with the weight property.

I appreciate all this is very hacky on my part and rather than continue hacking to try and find out what I am missing to configure this to Page Rank my (u1:User)-[:RETWEETED]->(u2:User) relationship, wondered if you could let me know if there is documentation to configure this.

regards,
js

integrate simRank algorithm

I was planning to integrate simRank algorithm within this. My original thought is:

export data into hdfs
run simRank algo via spark
import the the scores back into neo4j (aka. binded with relationships between nodes)

I'm not positive if that's the right way to go and where to start. Any suggestions? That would be much appreciated.

/cc @kbastani

Error while building neo4j-mazerunner project

I'm trying to run the project neo4j-mazerunner of Kenny Bastani but it gives me an error :

root@salma-SATELLITE-C855-1EQ:~/workspace/neo4j-mazrunner# docker run 1cddd1005556
Starting message broker: rabbitmq-server.


    __  ______ _____   __________  __  ___   ___   ____________  
   /  |/  /   /__  /  / ____/ __ \/ / / / | / / | / / ____/ __ \ 
  / /|_/ / /| | / /  / __/ / /_/ / / / /  |/ /  |/ / __/ / /_/ / 
 / /  / / ___ |/ /__/ /___/ _, _/ /_/ / /|  / /|  / /___/ _, _/  
/_/  /_/_/  |_/____/_____/_/ |_|\____/_/ |_/_/ |_/_____/_/ |_|   

=========================
Mazerunner is running...
=========================
To start a PageRank job, access the Mazerunner PageRank endpoint
Example: curl http://localhost:7474/service/mazerunner/analysis/pagerank/KNOWS
Error: Could not find or load main class org.mazerunner.core.messaging.Worker

can anyone help me with this

MazeRunner Problems - hdfs not accesssible from neo4j container [Fixed]

Mazerunner does not seem to work for me - here are some of the steps that I have followed..

I have followed the README.md file of the following github repository : https://github.com/kbastani/neo4j-mazerunner

My Neo4j Data looks like this :
Ran the Curl job
Checked the RabbitMq job Queue

I do not know much about Spark, but I found that in the following file -
https://raw.githubusercontent.com/kbastani/neo4j-mazerunner/af6a93c91b9ba54403422eca7756695911e341f1/src/spark/src/main/resources/spark.properties

# The relationship type to extract the subgraph from
org.mazerunner.job.relationshiptype=KNOWS

So Does the Relationship type limited to KNOWS relationship ??

Allow relationship type name as a parameter to PageRank endpoint

Allow the relationship type to be configured for each job through a query string parameter named 'relationship'.

Example request:

http://localhost:7474/service/mazerunner/pagerank/FOLLOWS

Request for docker-compose

Since the stack has three dockers, it will be much easier to have a docker-compose to manage the dockers instead three separate commands. Is there any specific reason that there is not docker-compose in the repo?

Storing the output of Spark on Neo4j

Presently, mazerunner provides the ability to perform graph analysis on the data already stored in Neo4j Server. However, one important feature is the ability to store the data streaming out of Spark into Neo4j in real time. And also, perform operation on that.

Example of one such condition can be: http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark

Parallelize adjacency list writes to HDFS

Export (Job started): Paralellize adjacency list writes to HDFS.

Data source adapter for log storage

Create a configurable data source for log storage to use a variety of different storage solutions for logging.

[doc-enhancement] File an guide on collaborative filtering algorithm

If I understand correctly, CF is supported.
That would be extremely lovely that you could put this on the table explicitly.

I would love to volunteer myself to do this once I figure out how things work, which I' m not as of now.

Expose Neo4j Browser from Sandbox

On the development environment, update Vagrant provisioning to expose Neo4j browser on the development machine.

Loading configurations at compile time

Refactor configurations to be loaded at run time and not at compile time.

Create a workflow pipeline manager

Create a workflow pipeline manager that pipe jobs together in a sequence before applying updates back to Neo4j.

Parallelize updates on nodes in Neo4j from HDFS

Import (Job complete): Paralellize updates on nodes in Neo4j from HDFS.

Improve algorithm test coverage

Improve test coverage on algorithms included in the org.mazerunner.core.algorithms package.

Can't Increase memory size in docker-compose.yml

I'm trying to run strongly_connected_components analysis on OSX macBook pro 16GB, with a graph of about 60,000 nodes. I have tried the following but keep getting out of memory exceptions.

hdfs:
  environment:
    - "JAVA_OPTS=-Xmx5g"
  image: sequenceiq/hadoop-docker:2.4.1
  command: /etc/bootstrap.sh -d -bash
  mem_limit: 2048m
mazerunner:
  environment:
    - "JAVA_OPTS=-Xmx5g"
  image: kbastani/neo4j-graph-analytics:latest
  mem_limit: 2048m
  links:
   - hdfs
graphdb:
  environment:
    - "JAVA_OPTS=-Xmx2g"
  image: kbastani/docker-neo4j:latest
  mem_limit: 2048m
  ports:
   - "7474:7474"
   - "1337:1337"
  volumes:
   - /opt/data
  links:
   - mazerunner
   - hdfs

error attaching an existing graph.db

Hi,

I'm trying to attach an existing graph.db to the setup described in Readme.md and I'm getting a couple errors during the Neo4j server start:
[o.n.s.r.RrdFactory]: Unable to open rrd store, attempting to recreate it
java.io.Exception: Invalid argument
...
[o.n.s.r.RrdFactory]: current RRDB is invalid, renamed it to /opt/data/graph.db/rrd-invalid-14446192...
[o.n.s.r.RrdFactory]: Unable to create new rrd store
java.io.Exception: Invalid argument
...

After these errors the start-up procedure ends and the server is shutdown.

I am able to open the graph.db using the normal install of Neo4j on my machine. My setup is using the latest Docker image for mazerunner and I've built locally the Neo4j image in order to use the latest Neo4j community edition 2.2.5 with my graph.db. I am running all this in Windows 7.

I am new to the stack used for this sample and any pointers to what may cause this problem and how to get mazerunner working with my graph.db would be very helpful!

Thanks!

Neo4j Mazerunner 1.0.0-M01 Milestone Features

Mazerunner Milestone Features

These are a list of features to be prioritized for the 1.0.0-M01 milestone release of Neo4j Mazerunner. This is a summary view of the milestone, and a place to comment about new feature requests.

Mazerunner Spark Service

Refactor the org.mazerunner.core.scala package to an algorithm package where the GraphX algorithms will be contributed.
Refactor configurations to be loaded at run time and not at compile time.
Setup continuous compilation of Scala sources for GraphX algorithm development.
Create an algorithm routing workflow for scheduling agent jobs in a pipeline.
Create a graph algorithm registration process, making it available to the Mazerunner admin tool.

Development Sandbox

Create production deployment scenario for a Hadoop Yarn Resource Manager for the Spark cluster.
On the development environment, update Vagrant provisioning to expose Neo4j browser on the development machine.
Change provisioning to compile from the shared Mazerunner repository and not inside the Vagrant sandbox.
Create a sample dataset for testing graph algorithms on, which is imported into a fresh Neo4j database when provisioning the sandbox.
Improve test coverage on algorithms included in the org.mazerunner.core.algorithms package.

Mazerunner Admin

Create a Mazerunner administration tool that schedules, configures, and monitors agent jobs.

Job Pipeline

Create a workflow pipeline manager that pipe jobs together in a sequence before applying updates back to Neo4j.

Logging

Create a log management service that listens for logs submitted across all modules of Mazerunner.
Create a configurable data source for log storage to use a variety of different storage solutions for logging.

Mazerunner Neo4j Extension

Import (Job complete): Parallelize updates on nodes in Neo4j from HDFS.
Export (Job started): Parallelize adjacency list writes to HDFS.

Refactor the org.mazerunner.core.scala package

Refactor the org.mazerunner.core.scala package to an algorithm package where the GraphX algorithms will be contributed.

Running analysis on all relationships

I think there should be an option (a '*' wildcard?) to run an analysis on all relationships in the graph.
It's pretty common to work with a graph with heterogeneous relationships that should all contribute to the pagerank (for example), and it's a bit hard to build "duplicate edges" on the fly when you have millions of nodes.

Is this something you'd consider integrating?

Method of programmatically determining when values have been persisted back to Neo4J

I am using NodeJS to trigger the pagerank algorithm. I understand once I end up with a ton of nodes and edges that it could take some time, but I wanted to have an automated way of determining when that process is complete so I can start additional logic that depends on the Neo4J nodes being updated with the values calculated by Mazerunner.

You had mentioned that I could check the log file. I think I can make that work using Node, to simply tail that file and scan the output.

Is there another way? Perhaps the original http call to trigger the job could return a jobId and then I can poll another http endpoint with that jobId for status/progress? I'm probably misusing the project anyway :-), but figured I would ask.

Here is my code:

http.get(CONF.neo4J.paths.pageRank, function(response) {

            var body = '';

            response.on('data', function(d) {
                body += d;
            });

            response.on('end', function() {

                var parsed = JSON.parse(body);
                if(parsed.result && parsed.result === "success") {

                    jobTaskInstance.complete({success:'awesome'});

                } else {

                    jobTaskInstance.error('Unknown error running PageRank');
                }
            });

        }).on('error', function(e) {
            jobTaskInstance.error(e.message);
        });

So maybe the response body could look like?

{"result":"success", "jobId":1234}

Love this project!!!

Is this project dead?

Hey,

I read about mazerunner and I was very interested to use this for my neo4j database, I think that his could have potential. However, now after installing it and running into some problems and investigation the github issues I get the impression, that this project is dead - no support, no updates, no contributions from kbastani and other people since December. So I just not sure if I should invest more time into this or just skip it - which is a shame because I think it would be cool to use. So the questions:

1 Is this project dead?
2 Is there support for newer neo4j versions on the way? I won't downgrade my production databases just to make mazerunner work. (see below)
3 Are there any alternatives that accomplish somewhat the same things?

My current problem with mazerunner is that linking my folder with docker does not seem to work. Basically the option that I use for this step (from documentation):
docker run -d -P -v /Users/<user>/<neo4j-path>/data:/opt/data --name graphdb --link mazerunner:mazerunner --link hdfs:hdfs kbastani/docker-neo4j:2.2.1
seems to be ignored, the neo on the docker always links to /opt/data/graph.db - no matter what I enter as path - I just worked around this by copying my graph db into docker into this very folder (/opt/data) ... That was successful, the folder is now 600 MB big, however, nothing is shown if I try to access is in neo. IF I call one of the mazerunner endpoints, I only get a json reply saying "result: success" which .... sounds nice but doesn't do anything for me. Can't see any data, not at the endpoint and not in cypher, :(

I guess (does anybody know?) the problems come from the fact that my graphdb folder comes from another neo version (2.3.3) and the neo packed into mazerunner is 2.2.1 - and that is why nothing is shown when I try to cypher into the data.....

So well, that is what's on my mind. As mentioned, I fear that this project is dead, which would be a shame, so please convince me that it is not if that's the case! I would love to make this work with my data, could be awesome!

Bye guys!

Using StandAlone Spark Cluster

Hello

Wanted to know if we can use MAzerunner with standalone spark cluster. We have stand alone spark cluster and stand alone Neo4j Database.

Was able to find the documentation using standalone neo4j database but not Spark cluster.

Please help.

Regards
Kalyan

Scala continuous compilation for GraphX

Setup continuous compilation of Scala sources for GraphX algorithm development.

Job pipeline manager

Create an algorithm routing workflow for scheduling agent jobs in a pipeline.

Error on Start

when I run the start command for Mazerunner I get the following error. I stopped vagrant and restarted but changed nothing else.

Mazerunner is running...

To start a PageRank job, access the Mazerunner PageRank endpoint
Example: curl http://localhost:7474/service/mazerunner/pagerank
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (default-cli) on project spark: An exception occured while executing the Java class. null: InvocationTargetException: Connection refused -> [Help 1]

Individual value PR by relationship

Greetings, I'm using Neo4j maze runner-PR to detect various types of relationships, these relationships that I'm using:
FRIEND
FOLLOWER
POST

My problem is that to get the PR FRIEND (service/mazerunner/analysis/pagerank/FRIEND) value within the pagerank node replaced if I run another procedure (service/mazerunner/analysis/pagerank/FOLLOWER), I wonder if PR may be generated when the ratio of the value within the node out something like pagerank_FRIEND, pagerank_FOLLOWER thus not replace the value.

I'm creating a personal project for the detection of illegal networks Twitter is very important your answer, thank you.

Setting up a dev environment

I am wondering what is the best way to setup a test environment for maven (extension and spark components).

Story: As an engineer I want a way to partition an analysis on a subset of nodes within my graph

As an engineer I want a way to partition an analysis on a subset of nodes with in my graph.

Story detail:

(1) Select a set of nodes by label (A)

(2) Describe a relationship that connects (A) to a set of other nodes (B)

(3) Describe the relationship(s) between the (B) nodes

(4) Dispatch jobs to a queue for each (A) and the pattern that is described in (2) and (3)

(5) The result of each analysis should be stored as a relationship connected between (A) and (B) - (R)

(6) (R) should be updated to include the property for the analysis described in (5)

Create a sample Neo4j dataset for testing graph algorithms

Create a sample dataset for testing graph algorithms on, which is imported into a fresh Neo4j database when provisioning the sandbox.