dbs-leipzig / gradoop Goto Github PK

View Code? Open in Web Editor NEW

243.0 25.0 89.0 818.31 MB

Distributed Temporal Graph Analytics with Apache Flink

Home Page: https://github.com/dbs-leipzig/gradoop

License: Apache License 2.0

Java 99.97% Shell 0.03%

graph property-graph distributed-graph-analytics apache-flink graph-mining pattern-matching temporal-graph

gradoop's Introduction

Gradoop: Distributed Graph Analytics on Hadoop

Gradoop is an open source (ALv2) research framework for scalable graph analytics built on top of Apache Flink. It offers a graph data model which extends the widespread property graph model by the concept of logical graphs and further provides operators that can be applied on single logical graphs and collections of logical graphs. The combination of these operators allows the flexible, declarative definition of graph analytical workflows. Gradoop can be easily integrated in a workflow which already uses Flink® operators and Flink® libraries (i.e. Gelly, ML and Table).

Gradoop is work in progress which means APIs may change. It is currently used as a proof of concept implementation and far from production ready.

The project's documentation can be found in our Wiki. The Wiki also contains a tutorial to help getting started using Gradoop.

Further Information (articles and talks)

Data Model

In the extended property graph model (EPGM), a database consists of multiple property graphs which are called logical graphs. These graphs describe application-specific subsets of vertices and edges, i.e. a vertex or an edge can be contained in multiple logical graphs. Additionally, not only vertices and edges but also logical graphs have a type label and can have different properties.

Data Model elements (logical graphs, vertices and edges) have a unique identifier, a single label (e.g. User) and a number of key-value properties (e.g. name = Alice). There is no schema involved, meaning each element can have an arbitrary number of properties even if they have the same label.

Graph operators

The EPGM provides operators for both single logical graphs as well as collections of logical graphs; operators may also return single graphs or graph collections. An overview and detailed descriptions of the implemented operators can be found in the Gradoop Wiki.

Setup

Use gradoop via Maven

Add one of the following dependencies to your maven project

Stable:

<dependency>
    <groupId>org.gradoop</groupId>
    <artifactId>gradoop-flink</artifactId>
    <version>0.6.0</version>
</dependency>

Latest weekly build (additional repository is required):

<repositories>
    <repository>
        <id>oss.sonatype.org-snapshot</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots</url>
        <releases><enabled>false</enabled></releases>
        <snapshots><enabled>true</enabled></snapshots>
    </repository>
</repositories>

<dependency>
    <groupId>org.gradoop</groupId>
    <artifactId>gradoop-flink</artifactId>
    <version>0.7.0-SNAPSHOT</version>
</dependency>

In any case you also need Apache Flink (version 1.9.3):

<dependencies>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <version>1.9.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.11</artifactId>
        <version>1.9.3</version>
    </dependency>
</dependencies>

Build gradoop from source

Gradoop requires Java 8
Clone Gradoop into your local file system

git clone https://github.com/dbs-leipzig/gradoop.git
Build and execute tests

cd gradoop

mvn clean install
You might want to skip tests for faster builds. Also, some tests fail on Windows due to missing test dependencies

mvn clean install -DskipTests

Windows

Some operators require the Hadoop winutils

Gradoop modules

gradoop-common

The main contents of that module are the EPGM data model and a corresponding POJO implementation which is used in Flink®. The persistent representation of the EPGM is also contained in gradoop-common and together with its mapping to HBase™.

gradoop-data-integration

Provides functionalities to support graph data integration. This includes minimal CSV and JSON importers as well as graph transformation operators (e.g. connect neighbors or conversion of edges to vertices and vice versa).

gradoop-accumulo

Input and output formats for reading and writing graph collections from Apache Accumulo®.

gradoop-hbase

Input and output formats for reading and writing graph collections from Apache HBase™.

gradoop-flink

This module contains reference implementations of the EPGM operators. The EPGM is mapped to Flink® DataSets while the operators are implemented using DataSet transformations. The module also contains implementations of general graph algorithms (e.g. Label Propagation, Frequent Subgraph Mining) adapted to be used with the EPGM model.

gradoop-temporal

This module contains a reference implementation of the Temporal Property Graph Model (TPGM) and it's operators used to perform graph analysis with respect to the additional time dimension in real-world graphs.

gradoop-examples

Contains example pipelines showing use cases for Gradoop.

Graph grouping example (build structural aggregates of property graphs)
Social network examples (composition of multiple operators to analyze social networks graphs)
Input/Output examples (usage of DataSource and DataSink implementations)

gradoop-checkstyle

Used to maintain the code style for the whole project.

Related Repositories

Gradoop Tutorial

Gradoop Tutorial which has been shown in BOSS20' Workshop of VLDB 2020 international conference.

Gradoop Benchmarks

This repository contains sets of Gradoop operator benchmarks designed to run on a cluster to measure scalability and speedup of the operators.

Gradoop Demo

Demo application to show the functionalities of the grouping and query operator in an interactive web UI.

Temporal Graph Explorer

Gradoop Temporal Graph Explorer Demo which showcases some operators of the Temporal Property Graph Model.

Gradoop GDL

This repository contains the definition of our Temporal Graph Definition Language (Temporal-GDL).

Version History

See the Changelog at the Wiki pages.

Disclaimer

Apache®, Apache Accumulo®, Apache Flink, Flink®, Apache HBase™ and HBase™ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

gradoop's People

Contributors

Stargazers

Watchers

Forkers

anikagross pieterjanvanaeken kakamessi99 hatran3887 siddarthkotnala anukat2015 sobolsigizmund venom590 vicolinho masterpi227 foesi lydiasteiner s1ck benjamesbabala darthmax jackbergus datalayer-attic dwmclary apsaltis lordon rex2068 colinsongf alrehamy ammazza p-f soerenreichardt cmoesler nruemmele alexaverbuch chpengzh chrizzz110 hr73vexy naperone sgolecha rascat timo95 patricksilva 2start mk14xero 0x002a jiminjiauber merando galpha ndhieu1994 niklasteichmann aliakbarrashidi pengjiwu ulugbekkodirov ajliualiyun jsnorman jgotoh mfi319 0xqq dbaumgarten roborambo omsaaf benhe119 zuiwanting luke202001 denzelhopkins caimantian007 suyambuganesh taucontrib tosintubi lc0197 masterquan rostam ghfork maxzim21 db-extreme dystudio mschley2 vinwin99 cpwstatic mullerhai chrissi16 pingoolp omarmahamid andrewille ingrim4 marrary2 jaypee2109 dagere mnhat64 dobraczka christopherlausch ved-05

gradoop's Issues

AdjacenyListReader must also call GraphStore.writeGraph()

Currently, only vertices are inserted into HBase. If the vertex has associated graphs, those should also be stored in HBase.

make physical partition part of the vertex / row identifier

the partition should be part of the row key to use HBase automatic sorting / regions for physical graph partitioning

ConnectedComponentsComputation does not resetGraphs() if input data contains graphs

See ConnectedComponentsComputationTest, if test String[] has any graph assignment, test fails.
EPGVertexValueWritable.resetGraphs()

move Identifiable member from MultiLabeledPropertyContainer to MemoryVertex and MemoryGraph

atm Vertex and Graph use Long IDs. As this will change, both should handle their identifier by themselves.

add SNA-Analysis-Driver

rename package 'csv' to 'sna'

in gradoop-examples, the package should be named sna instead of csv. thx

adaptive partitioning algorithm based on label propagation

http://www.few.vu.nl/~cma330/papers/ICDCS14.pdf

CSVReader

if the csv reader is only working with the SNA use case, leave it there and close the ticket

if not, please move it from:

gradoop-examples / org.gradoop.csv.io.reader
to
gradoop-core / org.gradoop.io.reader

thx

update class documentation in kway partitioning

make edge identifier serialization less error-prone

at the moment, the edge is serialized by simply building a string which separates target vertex id, edge label and edge index by a string ("."). this needs to be more generic as labels or even keys can contain separator chars

test LabelPropagation with Apache Flink

Implement GSpan as Flink operator

add edge property count in edge property string to decrease memory footprint

atm, if a node has at least one propery, a hashmap is created to store them. If the property count is known upfront (like in vertices), the hashmap can be created based on that count

update class documentation in label propagation

Add a short algorithm description in the class documentation

Bulk Export to dump graph repository into files

export should work using user-defined writers

test gradoop with Java 8

consider a graph predicate for the Select mapper

create factories for vertices, graphs and edges

Add a method to add an (out/in) edge to an existing vertex

fix EXTENDED_GRAPH in GradoopTest

line 0
"0|A|3 k1 5 v1 k2 5 v2 k3 5 v3|a.1.0 1 k1 5 v1|b.1.0 1 k1 5 v1|1 0"
should be
"0|A|3 k1 5 v1 k2 5 v2 k3 5 v3|a.1.0 1 k1 5 v1|b.1.0 2 k1 5 v1 k2 5 v2|1 0"

Go back to single labels on graphs and vertices

No need to support multiple labels anymore

update graphs table after BTGComputation

Currently the results are stored only vertex centric. If a vertex gets assigned to a graph this change need to be reflected in the according row in the graphs table.

convenience method to get edge count of vertex

upgrade HBase to 0.98.11-hadoop2 on dbclu

Please work together with Kevin on this, he will show you how the cluster is organized.
Thx

report edge property issue to Giraph Jira

correct stabilizationround calculation in AdaptiveRepartitioningComputation

LabelPropagation on Flink

Optimize graph initialize in EPGVertexValueWritable

this.graphs = Sets.newHashSet(graphs); (85)
does not limit new hashset size to graphs.size()

SimpleVertexReader does not create vertex having no neighbour

vertex-id neighbour1-id neighbour2-id // working
vertex-id // not working
-> VertexFactory.createDefaultVertexWithID(vertex-id)

Make BTGComputation read and write graph from and to HBase

make hbase scan for giraph input formats configurable

atm, all CFs are read by the scan. the user should select the CFs needed by the input format

implement HBase batch import for foodbroker

see http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/ for that

migrate Gradoop to Hadoop 2

define generic vertex value for subgraph extraction algorithms

algorithms that detect subgraphs (CC, Partitioning, BTG Extraction) store a value and a list of graphs they belong to. this can be generalized in a common vertex value class.

store edge count and edge index in edges CFs

the count can be used for fast retrieval of the node degree
the index is used to allow parallel edges in the graph (same label, same start and end vertex)

Evaluate Apache Flink for Gradoop Operators

Look into Flink and Gelly (Graph Component of Flink)
Map EPGM to Flink/Gelly datamodel

Make VertexHandler in HBaseVertexInput-/OutputFormat configurable

atm, the VertexHandler is set inside the format classes, this should be done from the outside using the configuration

Rename -gop (graph output path) to sth like -tmp (temp hbase path)

create BIIIG example pipeline

Foodbroker Batch Import -> BTG Computation -> Select -> Aggregate

upgrade HBase dependency to 1.0.0

should result in minor refactoring in the GraphStore implementation and Factory

PairWritable should support more types

currently, only Double is supported for the BTG pipeline. At the least the Number types should be supported

Add data generator for Frequent Pattern Mining

The data generator should be able to generate scalable collections of scalable graphs showing frequent patterns. The number of graphs should be a power of ten, where a collection scale factor (CSF) is the exponent. Scaling the graph size is a little more complex. Let us have a look on the following illustration:

The graph shows the base graph of vertices labeled with A to K and S as well as edges labeled with gid and 40 to 100. All graphs will have the same number of vertices and edges. While vertex labels will be the same in all graphs, at least every 10th graph will vary in edge labels. All edges labeled with gid in the base graph will have labels based on the graph instance id, thus, will differ for any graph and, in consequence, will not be frequent.

Actually, there will be 10 base graphs, and the illustration shows only base graph 1 of 10. All base graphs will show the edge (a)-100->(B), 9 of 10 the edge (B)-90->(C) and so on... In the remaining base graphs the numeric label will be replaced by gid. As we duplicate those base graphs according to the CSF, we generate pattern of determinable support (i.e., 40%,..,100%). Additionally, those patterns will cover special cases like loop, cycle and so on.

Further on, the illustration shows colored pattern groups. Modifying a graph scale factor GSF will determine how often these pattern groups will be duplicated per graph. However, there will always be a single vertex S connecting all pattern groups, i.e., every graph will have 11 x GSF + 1 vertices.

add CSV-Reader

SimpleVertexReader does not write Incoming Edges

v1 -> v2
v1 is created correctly with outgoing edge to v2
v2 is created, but not with an incoming edge

background: ConnectedComponentsComputationTest only needs the IDs to check the components, SimpleVertexReader would be easier to use as input.

Implement GraphCollection.select()

remove commented code in tests, write issues for TODOs

There are a lot of commented lines and TODO statements in ConnectedComponentsComputationTest and some TODO statements in ConnectedComponentsComputation. If you need a TODO, please write an issue for it and remove the comment. If code is not needed and is not related to any open issue, please remove it.