Coder Social home page Coder Social logo

neo4j-contrib / neo4j-mazerunner Goto Github PK

View Code? Open in Web Editor NEW
379.0 55.0 105.0 304.06 MB

Mazerunner extends a Neo4j graph database to run scheduled big data graph compute algorithms at scale with HDFS and Apache Spark.

License: Apache License 2.0

Shell 2.43% Java 66.45% Scala 30.20% Dockerfile 0.92%

neo4j-mazerunner's Issues

Maven issue

I have tried a couple of times, but I keep getting the following error when provisioning...I wonder if it's just me?

==> default: Compiling neo4j-mazerunner extension...
==> default: Nov 10, 2014 8:03:23 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
==> default: INFO: I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server repo.maven.apache.org failed to respond
==> default: Nov 10, 2014 8:03:23 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
==> default: INFO: Retrying request
==> default: [ERROR]
==> default: Failed to execute goal on project extension: Could not resolve dependencies for project org.mazerunner:extension:jar:1.0: The following artifacts could not be resolved: org.neo4j:neo4j-kernel:jar:tests:2.1.5, org.neo4j.app:neo4j-server:jar:2.1.5: Could not transfer artifact org.neo4j:neo4j-kernel:jar:tests:2.1.5 from/to central (http://repo.maven.apache.org/maven2): Read timed out -> [Help 1]
==> default: [ERROR]
==> default: [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
==> default: [ERROR] Re-run Maven using the -X switch to enable full debug logging.
==> default: [ERROR]
==> default: [ERROR] For more information about the errors and possible solutions, please read the following articles:
==> default: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

neo4j 2.3.1 support?

I've already upgraded my database to 2.3.1. Any plans to support it soon? Or is it possible to connect to 2.3.1?

Betweenness Centrality Tractability?

Running this algo on a directed graph with 30K nodes and 50K Edges is extremely slow. The output during betweennessCentralityMap phase is roughly 1 vertex completed per 3 minutes. I realize that this can be a O(n^3) scale problem depending on algorithm, but this seems prohibitively slow. Running this on a single VM with 10 GB of Memory and 3 cores. Is this behavior normal?

Dependency issues

When running kbastani/neo4j-graph-analytics:latest in docker I get an exception during startup:

Exception in thread "main" java.lang.NoSuchFieldError: IBM_JAVA
    at org.apache.hadoop.security.UserGroupInformation.getOSLoginModuleName(UserGroupInformation.java:337)
    at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:382)
    at org.apache.spark.deploy.SparkHadoopUtil.<init>(SparkHadoopUtil.scala:36)
    at org.apache.spark.deploy.SparkHadoopUtil$.<init>(SparkHadoopUtil.scala:109)
    at org.apache.spark.deploy.SparkHadoopUtil$.<clinit>(SparkHadoopUtil.scala)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:238)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:53)
    at org.mazerunner.core.processor.GraphProcessor.initializeSparkContext(GraphProcessor.java:78)
    at org.mazerunner.core.messaging.Worker.doMain(Worker.java:109)
    at org.mazerunner.core.messaging.Worker.main(Worker.java:60)

The reason for this exception is a conflict between hadoop-auth-2.4.1.jar and hadoop-core-1.2.1.jar. Both are copied from the docker/bin/lib to the classpath in the docker image and both contain the class org.apache.hadoop.util.PlatformName, but hadoop-core does not declare the field IBM_JAVA.

I managed to resolve this issue by simply deleting hadoop-core-1.2.1.jar and rebuilding the docker image, though I'm not sure if this break some other part of the application (so far it worked fine for me).

I'm not sure about the structure of this image, but it seems like the projects org.mazerunner.extension and org.mazerunner.spark have conflicting dependencies yet still they are copied on the same classpath.

If those are two separate programs, maybe you should copy each project and it's dependencies in a separate folder. And if one depends on the other, then you should add the other project as dependency and let Maven resolve the conflicts.

Test on My Dataset

Kenny,

I have got this up and running and the movie sample working fine.

I have replaced graph.db with my data and attempted to get the Pagerank to run. I have done the following:

  • Replaced graph.db
  • Restarted Neo4j - works fine
  • Created a new relationship to define the subgraph (u1:User)-[:RETWEETED]->(u2:User)
  • commented out the create demo graph in mazerunner.sh
  • changed KNOWS to RETWEETED in spark/src/main/resources/spark.properties
  • restarted Mazerunner - starts ok
  • refreshed endpoint - and seen Hadoop fire up and run

However, the nodes (User) are not updated with the weight property.

I appreciate all this is very hacky on my part and rather than continue hacking to try and find out what I am missing to configure this to Page Rank my (u1:User)-[:RETWEETED]->(u2:User) relationship, wondered if you could let me know if there is documentation to configure this.

regards,
js

integrate simRank algorithm

I was planning to integrate simRank algorithm within this. My original thought is:

  1. export data into hdfs
  2. run simRank algo via spark
  3. import the the scores back into neo4j (aka. binded with relationships between nodes)

I'm not positive if that's the right way to go and where to start. Any suggestions? That would be much appreciated.

/cc @kbastani

Error while building neo4j-mazerunner project

I'm trying to run the project neo4j-mazerunner of Kenny Bastani but it gives me an error :

root@salma-SATELLITE-C855-1EQ:~/workspace/neo4j-mazrunner# docker run 1cddd1005556
Starting message broker: rabbitmq-server.


    __  ______ _____   __________  __  ___   ___   ____________  
   /  |/  /   /__  /  / ____/ __ \/ / / / | / / | / / ____/ __ \ 
  / /|_/ / /| | / /  / __/ / /_/ / / / /  |/ /  |/ / __/ / /_/ / 
 / /  / / ___ |/ /__/ /___/ _, _/ /_/ / /|  / /|  / /___/ _, _/  
/_/  /_/_/  |_/____/_____/_/ |_|\____/_/ |_/_/ |_/_____/_/ |_|   

=========================
Mazerunner is running...
=========================
To start a PageRank job, access the Mazerunner PageRank endpoint
Example: curl http://localhost:7474/service/mazerunner/analysis/pagerank/KNOWS
Error: Could not find or load main class org.mazerunner.core.messaging.Worker

can anyone help me with this

MazeRunner Problems - hdfs not accesssible from neo4j container [Fixed]

Mazerunner does not seem to work for me - here are some of the steps that I have followed..

docker_ps

  • My Neo4j Data looks like this :
    neo4j_data
    neo4j_rel
  • Ran the Curl job
    curl_job
  • Checked the RabbitMq job Queue
    rabbitmq_jobs

I do not know much about Spark, but I found that in the following file -
https://raw.githubusercontent.com/kbastani/neo4j-mazerunner/af6a93c91b9ba54403422eca7756695911e341f1/src/spark/src/main/resources/spark.properties

# The relationship type to extract the subgraph from
org.mazerunner.job.relationshiptype=KNOWS

So Does the Relationship type limited to KNOWS relationship ??

Request for docker-compose

Since the stack has three dockers, it will be much easier to have a docker-compose to manage the dockers instead three separate commands. Is there any specific reason that there is not docker-compose in the repo?

Can't Increase memory size in docker-compose.yml

I'm trying to run strongly_connected_components analysis on OSX macBook pro 16GB, with a graph of about 60,000 nodes. I have tried the following but keep getting out of memory exceptions.

hdfs:
  environment:
    - "JAVA_OPTS=-Xmx5g"
  image: sequenceiq/hadoop-docker:2.4.1
  command: /etc/bootstrap.sh -d -bash
  mem_limit: 2048m
mazerunner:
  environment:
    - "JAVA_OPTS=-Xmx5g"
  image: kbastani/neo4j-graph-analytics:latest
  mem_limit: 2048m
  links:
   - hdfs
graphdb:
  environment:
    - "JAVA_OPTS=-Xmx2g"
  image: kbastani/docker-neo4j:latest
  mem_limit: 2048m
  ports:
   - "7474:7474"
   - "1337:1337"
  volumes:
   - /opt/data
  links:
   - mazerunner
   - hdfs

error attaching an existing graph.db

Hi,

I'm trying to attach an existing graph.db to the setup described in Readme.md and I'm getting a couple errors during the Neo4j server start:
[o.n.s.r.RrdFactory]: Unable to open rrd store, attempting to recreate it
java.io.Exception: Invalid argument
...
[o.n.s.r.RrdFactory]: current RRDB is invalid, renamed it to /opt/data/graph.db/rrd-invalid-14446192...
[o.n.s.r.RrdFactory]: Unable to create new rrd store
java.io.Exception: Invalid argument
...

After these errors the start-up procedure ends and the server is shutdown.

I am able to open the graph.db using the normal install of Neo4j on my machine. My setup is using the latest Docker image for mazerunner and I've built locally the Neo4j image in order to use the latest Neo4j community edition 2.2.5 with my graph.db. I am running all this in Windows 7.

I am new to the stack used for this sample and any pointers to what may cause this problem and how to get mazerunner working with my graph.db would be very helpful!

Thanks!

Neo4j Mazerunner 1.0.0-M01 Milestone Features

Mazerunner Milestone Features

These are a list of features to be prioritized for the 1.0.0-M01 milestone release of Neo4j Mazerunner. This is a summary view of the milestone, and a place to comment about new feature requests.

Mazerunner Spark Service

  • Refactor the org.mazerunner.core.scala package to an algorithm package where the GraphX algorithms will be contributed.
  • Refactor configurations to be loaded at run time and not at compile time.
  • Setup continuous compilation of Scala sources for GraphX algorithm development.
  • Create an algorithm routing workflow for scheduling agent jobs in a pipeline.
  • Create a graph algorithm registration process, making it available to the Mazerunner admin tool.

Development Sandbox

  • Create production deployment scenario for a Hadoop Yarn Resource Manager for the Spark cluster.
  • On the development environment, update Vagrant provisioning to expose Neo4j browser on the development machine.
  • Change provisioning to compile from the shared Mazerunner repository and not inside the Vagrant sandbox.
  • Create a sample dataset for testing graph algorithms on, which is imported into a fresh Neo4j database when provisioning the sandbox.
  • Improve test coverage on algorithms included in the org.mazerunner.core.algorithms package.

Mazerunner Admin

  • Create a Mazerunner administration tool that schedules, configures, and monitors agent jobs.

Job Pipeline

  • Create a workflow pipeline manager that pipe jobs together in a sequence before applying updates back to Neo4j.

Logging

  • Create a log management service that listens for logs submitted across all modules of Mazerunner.
  • Create a configurable data source for log storage to use a variety of different storage solutions for logging.

Mazerunner Neo4j Extension

  • Import (Job complete): Parallelize updates on nodes in Neo4j from HDFS.
  • Export (Job started): Parallelize adjacency list writes to HDFS.

Running analysis on all relationships

I think there should be an option (a '*' wildcard?) to run an analysis on all relationships in the graph.
It's pretty common to work with a graph with heterogeneous relationships that should all contribute to the pagerank (for example), and it's a bit hard to build "duplicate edges" on the fly when you have millions of nodes.

Is this something you'd consider integrating?

Method of programmatically determining when values have been persisted back to Neo4J

I am using NodeJS to trigger the pagerank algorithm. I understand once I end up with a ton of nodes and edges that it could take some time, but I wanted to have an automated way of determining when that process is complete so I can start additional logic that depends on the Neo4J nodes being updated with the values calculated by Mazerunner.

You had mentioned that I could check the log file. I think I can make that work using Node, to simply tail that file and scan the output.

Is there another way? Perhaps the original http call to trigger the job could return a jobId and then I can poll another http endpoint with that jobId for status/progress? I'm probably misusing the project anyway :-), but figured I would ask.

Here is my code:

http.get(CONF.neo4J.paths.pageRank, function(response) {

            var body = '';

            response.on('data', function(d) {
                body += d;
            });

            response.on('end', function() {

                var parsed = JSON.parse(body);
                if(parsed.result && parsed.result === "success") {

                    jobTaskInstance.complete({success:'awesome'});

                } else {

                    jobTaskInstance.error('Unknown error running PageRank');
                }
            });

        }).on('error', function(e) {
            jobTaskInstance.error(e.message);
        });

So maybe the response body could look like?

{"result":"success", "jobId":1234}

Love this project!!!

Is this project dead?

Hey,

I read about mazerunner and I was very interested to use this for my neo4j database, I think that his could have potential. However, now after installing it and running into some problems and investigation the github issues I get the impression, that this project is dead - no support, no updates, no contributions from kbastani and other people since December. So I just not sure if I should invest more time into this or just skip it - which is a shame because I think it would be cool to use. So the questions:

1 Is this project dead?
2 Is there support for newer neo4j versions on the way? I won't downgrade my production databases just to make mazerunner work. (see below)
3 Are there any alternatives that accomplish somewhat the same things?

My current problem with mazerunner is that linking my folder with docker does not seem to work. Basically the option that I use for this step (from documentation):
docker run -d -P -v /Users/<user>/<neo4j-path>/data:/opt/data --name graphdb --link mazerunner:mazerunner --link hdfs:hdfs kbastani/docker-neo4j:2.2.1
seems to be ignored, the neo on the docker always links to /opt/data/graph.db - no matter what I enter as path - I just worked around this by copying my graph db into docker into this very folder (/opt/data) ... That was successful, the folder is now 600 MB big, however, nothing is shown if I try to access is in neo. IF I call one of the mazerunner endpoints, I only get a json reply saying "result: success" which .... sounds nice but doesn't do anything for me. Can't see any data, not at the endpoint and not in cypher, :(

I guess (does anybody know?) the problems come from the fact that my graphdb folder comes from another neo version (2.3.3) and the neo packed into mazerunner is 2.2.1 - and that is why nothing is shown when I try to cypher into the data.....

So well, that is what's on my mind. As mentioned, I fear that this project is dead, which would be a shame, so please convince me that it is not if that's the case! I would love to make this work with my data, could be awesome!

Bye guys!

Using StandAlone Spark Cluster

Hello

Wanted to know if we can use MAzerunner with standalone spark cluster. We have stand alone spark cluster and stand alone Neo4j Database.

Was able to find the documentation using standalone neo4j database but not Spark cluster.

Please help.

Regards
Kalyan

Job pipeline manager

Create an algorithm routing workflow for scheduling agent jobs in a pipeline.

Error on Start

when I run the start command for Mazerunner I get the following error. I stopped vagrant and restarted but changed nothing else.

Mazerunner is running...

To start a PageRank job, access the Mazerunner PageRank endpoint
Example: curl http://localhost:7474/service/mazerunner/pagerank
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (default-cli) on project spark: An exception occured while executing the Java class. null: InvocationTargetException: Connection refused -> [Help 1]

Individual value PR by relationship

Greetings, I'm using Neo4j maze runner-PR to detect various types of relationships, these relationships that I'm using:
FRIEND
FOLLOWER
POST

My problem is that to get the PR FRIEND (service/mazerunner/analysis/pagerank/FRIEND) value within the pagerank node replaced if I run another procedure (service/mazerunner/analysis/pagerank/FOLLOWER), I wonder if PR may be generated when the ratio of the value within the node out something like pagerank_FRIEND, pagerank_FOLLOWER thus not replace the value.

I'm creating a personal project for the detection of illegal networks Twitter is very important your answer, thank you.

Story: As an engineer I want a way to partition an analysis on a subset of nodes within my graph

As an engineer I want a way to partition an analysis on a subset of nodes with in my graph.

categorical pagerank 2 001

Story detail:

(1) Select a set of nodes by label (A)

(2) Describe a relationship that connects (A) to a set of other nodes (B)

(3) Describe the relationship(s) between the (B) nodes

(4) Dispatch jobs to a queue for each (A) and the pattern that is described in (2) and (3)

(5) The result of each analysis should be stored as a relationship connected between (A) and (B) - (R)

(6) (R) should be updated to include the property for the analysis described in (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.