Coder Social home page Coder Social logo

yixinglu / graphar Goto Github PK

View Code? Open in Web Editor NEW

This project forked from apache/incubator-graphar

0.0 0.0 0.0 2.69 MB

An open source, standard data file format for graph data storage and retrieval

Home Page: https://alibaba.github.io/GraphAr/

License: Apache License 2.0

C++ 63.86% Java 0.23% Scala 32.80% CMake 3.10%

graphar's Introduction

GraphAr

GraphAr CI Docs CI GraphAr Docs

Welcome to GraphAr (short for "Graph Archive"), an open source, standardized file format for graph data storage and retrieval.

What is GraphAr?

Overview

Graph processing serves as the essential building block for a diverse variety of real-world applications such as social network analytics, data mining, network routing, and scientific computing.

GraphAr is a project that aims to make it easier for diverse applications and systems (in-memory and out-of-core storages, databases, graph computing systems, and interactive graph query frameworks) to build and access graph data conveniently and efficiently.

It can be used for importing/exporting and persistent storage of graph data, thereby reducing the burden on systems when working together. Additionally, it can serve as a direct data source for graph processing applications.

To achieve this, GraphAr provides:

  • The Graph Archive(GAR) file format: a standardized system-independent file format for storing graph data
  • Libraries: a set of libraries for reading, writing and transforming GAR files

By using GraphAr, you can:

  • Store and persist your graph data in a system-independent way with the GAR file format
  • Easily access and generate GAR files using the libraries
  • Utilize Apache Spark to quickly manipulate and transform your GAR files

The GAR File Format

The GAR file format is designed for storing property graphs. It uses metadata to record all the necessary information of a graph, and maintains the actual data in a chunked way.

A property graph consists of vertices and edges, with each vertex contains:

  • A unique identifier (called vertex id or vertex index).
  • A text label that describes the vertex type.
  • A collection of properties, with each property can be represented by a key-value pair.

And each edge contains:

  • A unique identifier (called edge id or edge index).
  • The outgoing vertex (source).
  • The incoming vertex (destination).
  • A text label that describes the relationship between the two vertices.
  • A collection of properties.

The following is an example property graph containing two types of vertices ("person" and "comment") and three types of edges.

property graph

Vertices in GraphAr

Logical table of vertices

Each type of vertices (with the same label) constructs a logical vertex table, with each vertex assigned with a global index (vertex id) starting from 0, corresponding to the row number of the vertex in the logical vertex table. An example layout for a logical table of vertices under the label "person" is provided for reference.

Given a vertex id and the vertex label, a vertex is uniquely identifiable and its respective properties can be accessed from this table. The vertex id is further used to identify the source and destination vertices when maintaining the topology of the graph.

vertex logical table

Physical table of vertices

The logical vertex table will be partitioned into multiple continuous vertex chunks for enhancing the reading/writing efficiency. To maintain the ability of random access, the size of vertex chunks for the same label is fixed. To support to access required properties avoiding reading all properties from the files, and to add properties for vertices without modifying the existing files, the columns of the logical table will be divided into several column groups.

Take the "person" vertex table as an example, if the chunk size is set to be 500, the logical table will be separated into sub-logical-tables of 500 rows with the exception of the last one, which may have less than 500 rows. The columns for maintaining properties will also be divided into distinct groups (e.g., 2 for our example). As a result, a total of 4 physical vertex tables are created for storing the example logical table, which can be seen from the following figure.

vertex physical table

Edges in GraphAr

Logical table of edges

For maintaining a type of edges (that with the same triplet of the source label, edge label, and destination label), a logical edge table is established. And in order to support quickly creating a graph from the graph storage file, the logical edge table could maintain the topology information in a way similar to CSR/CSC (learn more about CSR/CSC), that is, the edges are ordered by the vertex id of either source or destination. In this way, an offset table is required to store the start offset for each vertex's edges, and the edges with the same source/destination will be stored continuously in the logical table.

Take the logical table for "person likes person" edges as an example, the logical edge table looks like:

edge logical table

Physical table of edges

According to the partition strategy and the order of the edges, edges can be one of the four types: ordered_by_source, ordered_by_dest, unordered_by_source or unordered_by_dest. A logical edge table could contain physical tables of three categories:

  • The adjList table (which contains only two columns: the vertex id of the source and the destination).
  • The edge property tables (if there are properties on edges).
  • The offset table (optional, only required for ordered edges).

Since the vertex table are partitioned into multiple chunks, the logical edge table is also partitioned into some sub-logical-tables, with each sub-logical-table contains edges that the source (if the type is ordered_by_source or unordered_by_source) or destination (if the type is ordered_by_dest or unordered_by_dest) vertices are in the same vertex chunk. After that, a sub-logical-table is further divided into edge chunks of a predefined, fixed number of rows (referred to as edge chunk size). Finally, an edge chunk is separated into an adjList table and 0 or more property tables.

Additionally, the partition of the offset table should be in alignment with the partition of the corresponding vertex table. The first row of each offset chunk is always 0, indicating the starting point for the corresponding sub-logical-table for edges.

Take the "person knows person" edges to illustrate. Suppose the vertex chunk size is set to 500 and the edge chunk size is 1024, the edges will be saved in the following physical tables:

edge logical table1 edge logical table2

Libraries

Libraries are provided for reading, writing and transforming files in GraphAr, now the C++ library and the Spark library are available. And we are going to provide libraries for more programming languages.

The C++ Library

See GraphAr C++ Library for details about the building of the C++ library.

The Spark Library

See GraphAr Spark Library for details about the Spark library.

Contributing to GraphAr

See Contribution Guide for details on submitting patches and the contribution workflow.

License

GraphAr is distributed under Apache License 2.0. Please note that third-party libraries may not have the same license as GraphAr.

graphar's People

Contributors

acezen avatar andydiwenzhu avatar lixueclaire avatar sighingnow avatar yecol avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.