Coder Social home page Coder Social logo

dbs-leipzig / gradoop Goto Github PK

View Code? Open in Web Editor NEW
243.0 25.0 89.0 818.31 MB

Distributed Temporal Graph Analytics with Apache Flink

Home Page: https://github.com/dbs-leipzig/gradoop

License: Apache License 2.0

Java 99.97% Shell 0.03%
graph property-graph distributed-graph-analytics apache-flink graph-mining pattern-matching temporal-graph

gradoop's Introduction

Apache License, Version 2.0, January 2004 Maven Central Build Status Code Quality: Java Total Alerts

Gradoop: Distributed Graph Analytics on Hadoop

Gradoop is an open source (ALv2) research framework for scalable graph analytics built on top of Apache Flink. It offers a graph data model which extends the widespread property graph model by the concept of logical graphs and further provides operators that can be applied on single logical graphs and collections of logical graphs. The combination of these operators allows the flexible, declarative definition of graph analytical workflows. Gradoop can be easily integrated in a workflow which already uses Flink® operators and Flink® libraries (i.e. Gelly, ML and Table).

Gradoop is work in progress which means APIs may change. It is currently used as a proof of concept implementation and far from production ready.

The project's documentation can be found in our Wiki. The Wiki also contains a tutorial to help getting started using Gradoop.

Further Information (articles and talks)

Data Model

In the extended property graph model (EPGM), a database consists of multiple property graphs which are called logical graphs. These graphs describe application-specific subsets of vertices and edges, i.e. a vertex or an edge can be contained in multiple logical graphs. Additionally, not only vertices and edges but also logical graphs have a type label and can have different properties.

Data Model elements (logical graphs, vertices and edges) have a unique identifier, a single label (e.g. User) and a number of key-value properties (e.g. name = Alice). There is no schema involved, meaning each element can have an arbitrary number of properties even if they have the same label.

Graph operators

The EPGM provides operators for both single logical graphs as well as collections of logical graphs; operators may also return single graphs or graph collections. An overview and detailed descriptions of the implemented operators can be found in the Gradoop Wiki.

Setup

Use gradoop via Maven

  • Add one of the following dependencies to your maven project

Stable:

<dependency>
    <groupId>org.gradoop</groupId>
    <artifactId>gradoop-flink</artifactId>
    <version>0.6.0</version>
</dependency>

Latest weekly build (additional repository is required):

<repositories>
    <repository>
        <id>oss.sonatype.org-snapshot</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots</url>
        <releases><enabled>false</enabled></releases>
        <snapshots><enabled>true</enabled></snapshots>
    </repository>
</repositories>
<dependency>
    <groupId>org.gradoop</groupId>
    <artifactId>gradoop-flink</artifactId>
    <version>0.7.0-SNAPSHOT</version>
</dependency>

In any case you also need Apache Flink (version 1.9.3):

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.9.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.11</artifactId>
        <version>1.9.3</version>
    </dependency>
</dependencies>

Build gradoop from source

  • Gradoop requires Java 8

  • Clone Gradoop into your local file system

    git clone https://github.com/dbs-leipzig/gradoop.git

  • Build and execute tests

    cd gradoop

    mvn clean install

  • You might want to skip tests for faster builds. Also, some tests fail on Windows due to missing test dependencies

    mvn clean install -DskipTests

Windows

  • Some operators require the Hadoop winutils

Gradoop modules

gradoop-common

The main contents of that module are the EPGM data model and a corresponding POJO implementation which is used in Flink®. The persistent representation of the EPGM is also contained in gradoop-common and together with its mapping to HBase™.

gradoop-data-integration

Provides functionalities to support graph data integration. This includes minimal CSV and JSON importers as well as graph transformation operators (e.g. connect neighbors or conversion of edges to vertices and vice versa).

gradoop-accumulo

Input and output formats for reading and writing graph collections from Apache Accumulo®.

gradoop-hbase

Input and output formats for reading and writing graph collections from Apache HBase™.

gradoop-flink

This module contains reference implementations of the EPGM operators. The EPGM is mapped to Flink® DataSets while the operators are implemented using DataSet transformations. The module also contains implementations of general graph algorithms (e.g. Label Propagation, Frequent Subgraph Mining) adapted to be used with the EPGM model.

gradoop-temporal

This module contains a reference implementation of the Temporal Property Graph Model (TPGM) and it's operators used to perform graph analysis with respect to the additional time dimension in real-world graphs.

gradoop-examples

Contains example pipelines showing use cases for Gradoop.

  • Graph grouping example (build structural aggregates of property graphs)
  • Social network examples (composition of multiple operators to analyze social networks graphs)
  • Input/Output examples (usage of DataSource and DataSink implementations)

gradoop-checkstyle

Used to maintain the code style for the whole project.

Related Repositories

Gradoop Tutorial which has been shown in BOSS20' Workshop of VLDB 2020 international conference.

This repository contains sets of Gradoop operator benchmarks designed to run on a cluster to measure scalability and speedup of the operators.

Demo application to show the functionalities of the grouping and query operator in an interactive web UI.

Gradoop Temporal Graph Explorer Demo which showcases some operators of the Temporal Property Graph Model.

This repository contains the definition of our Temporal Graph Definition Language (Temporal-GDL).

Version History

See the Changelog at the Wiki pages.

Disclaimer

Apache®, Apache Accumulo®, Apache Flink, Flink®, Apache HBase™ and HBase™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

gradoop's People

Contributors

0x002a avatar 2start avatar ammazza avatar chpengzh avatar chrissi16 avatar chrizzz110 avatar cmoesler avatar darthmax avatar dbaumgarten avatar dependabot[bot] avatar foerster-finsternis avatar freeclimbing avatar hr73vexy avatar jgotoh avatar lc0197 avatar masterpi227 avatar maxzim21 avatar merando avatar mschley2 avatar ndhieu1994 avatar p-f avatar rascat avatar rostam avatar s1ck avatar smee avatar taucontrib avatar timo95 avatar venom590 avatar vicolinho avatar xcorail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gradoop's Issues

CSVReader

if the csv reader is only working with the SNA use case, leave it there and close the ticket

if not, please move it from:

gradoop-examples / org.gradoop.csv.io.reader
to
gradoop-core / org.gradoop.io.reader

thx

make edge identifier serialization less error-prone

at the moment, the edge is serialized by simply building a string which separates target vertex id, edge label and edge index by a string ("."). this needs to be more generic as labels or even keys can contain separator chars

fix EXTENDED_GRAPH in GradoopTest

line 0
"0|A|3 k1 5 v1 k2 5 v2 k3 5 v3|a.1.0 1 k1 5 v1|b.1.0 1 k1 5 v1|1 0"
should be
"0|A|3 k1 5 v1 k2 5 v2 k3 5 v3|a.1.0 1 k1 5 v1|b.1.0 2 k1 5 v1 k2 5 v2|1 0"

update graphs table after BTGComputation

Currently the results are stored only vertex centric. If a vertex gets assigned to a graph this change need to be reflected in the according row in the graphs table.

Add data generator for Frequent Pattern Mining

The data generator should be able to generate scalable collections of scalable graphs showing frequent patterns. The number of graphs should be a power of ten, where a collection scale factor (CSF) is the exponent. Scaling the graph size is a little more complex. Let us have a look on the following illustration:

patterngeneratorbasegraph

The graph shows the base graph of vertices labeled with A to K and S as well as edges labeled with gid and 40 to 100. All graphs will have the same number of vertices and edges. While vertex labels will be the same in all graphs, at least every 10th graph will vary in edge labels. All edges labeled with gid in the base graph will have labels based on the graph instance id, thus, will differ for any graph and, in consequence, will not be frequent.

Actually, there will be 10 base graphs, and the illustration shows only base graph 1 of 10. All base graphs will show the edge (a)-100->(B), 9 of 10 the edge (B)-90->(C) and so on... In the remaining base graphs the numeric label will be replaced by gid. As we duplicate those base graphs according to the CSF, we generate pattern of determinable support (i.e., 40%,..,100%). Additionally, those patterns will cover special cases like loop, cycle and so on.

Further on, the illustration shows colored pattern groups. Modifying a graph scale factor GSF will determine how often these pattern groups will be duplicated per graph. However, there will always be a single vertex S connecting all pattern groups, i.e., every graph will have 11 x GSF + 1 vertices.

SimpleVertexReader does not write Incoming Edges

v1 -> v2
v1 is created correctly with outgoing edge to v2
v2 is created, but not with an incoming edge

background: ConnectedComponentsComputationTest only needs the IDs to check the components, SimpleVertexReader would be easier to use as input.

remove commented code in tests, write issues for TODOs

There are a lot of commented lines and TODO statements in ConnectedComponentsComputationTest and some TODO statements in ConnectedComponentsComputation. If you need a TODO, please write an issue for it and remove the comment. If code is not needed and is not related to any open issue, please remove it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.