alibaba / graphar Goto Github PK

View Code? Open in Web Editor NEW

176.0 13.0 36.0 5.57 MB

An open source, standard data file format for graph data storage and retrieval.

Home Page: https://graphar.apache.org/

License: Apache License 2.0

CMake 2.52% C++ 43.43% Java 21.91% Scala 21.73% Shell 0.83% C 0.08% Makefile 0.12% Python 9.11% Jinja 0.13% HTML 0.13%

big-data graph graph-storage data-orchestration etl graph-analysis pyspark spark

graphar's Issues

[Feat] Implement GraphAr Spark Reader for reading GAR format files into Spark DataFrame

Is your feature request related to a problem? Please describe.
Implement the Spark Reader to provide functions for reading GraphAr files into Spark DataFrames.

Describe the solution you'd like
The reader should include VertexReader and EdgeReader:

VertexReader provide functions to read a type of vertices at a time and assembles the result into the Spark DataFrames.
EdgeReader provide functions to read edge chunks including adjList, offset and property chunks.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement `FragmentWriter` with GraphAr in GraphScope to support writing graph to GAR format files

Is your feature request related to a problem? Please describe.
We select GraphScope as GraphAr' first landing system and make that as an example to use GraphAr.
Implement writer of Fragment in GraphScope with GraphAr to support dump the in-memory property graph to GraphAr format files.

Describe the solution you'd like
The process of writer works like:

first user should design the yaml files to describe the graph you want to dumps, it can be the whole in-memory graph or a subgraph.
FragmentWriter loads the yaml files as Info(GraphInfo, VertexInfo and EdgeInfo), and use the ArrowChunkWriter API of GraphAr to dumps the arrow table to GraphAr format files.

Here is a prototype implementation of FragmentWriter

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Integrate GraphAr into GraphScope

Is your feature request related to a problem? Please describe.
Use GraphScope as our first landing system.

Describe the solution you'd like

Implement writer/builder with GraphAr in vineyard.
benchmarking (cf, ldbc snb30, ldbc snb100)
Add related call api in GraphScope client to enable writer/build graph with GraphAr
Add related documents and test

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug]: Offset chunk of spark writer got wrong value and output location

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

Offset chunk file's path output by spark writer is not compatible with the path get from edge info:

run mvn test -Dsuites='com.alibaba.graphar.WriterSuite test edge writer with vertex table and edge table

the offset0 output path is /tmp/edge/person_knows_person/ordered_by_source/offset/part0
but [getAdjListOffsetFilePath] method of edge info return /tmp/edge/person_knows_person/ordered_by_source/offset0
https://github.com/alibaba/GraphAr/blob/0991064e3f5a5844d453d2743bc2b03dc65fdf14/spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala#L291

Offset value generated by spark writer is the edge count of src/dst, not the real offset.
https://github.com/alibaba/GraphAr/blob/0991064e3f5a5844d453d2743bc2b03dc65fdf14/spark/src/main/scala/com/alibaba/graphar/writer/EdgeWriter.scala#L157-L163

Expected Behavior

path: offset output path should compatible with edge info
offset value should be the real offset value, not edge count.

Minimal Reproducible Example

cd spark
mvn test -Dsuites='com.alibaba.graphar.WriterSuite test edge writer with vertex table and edge table

Environment

Operating system: MacOS
GraphAr version:
v0.1.0

Link to GraphAr Logs

No response

Further Information

No response

[Feat] Provide libraries for other languages

Is your feature request related to a problem? Please describe.
Currently the libraries for GraphAr are only available for C++ and Spark. But many graph processing systems are implemented by other programming languages (like Neo4j by java). We need to provide libraries for more programming languages.

Describe the solution you'd like
Implement library with

Java
Go
Rust
Python

Fix prefix of GAR files in document

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Include additional built-in data types for GraphAr libraries

Is your feature request related to a problem? Please describe.
Currently, the GraphAr C++ and Spark libraries supports only several basic data types (including BOOL, INT32, INT64, FLOAT, DOUBLE, and STRING). To serve more scenarios, more built-in data types need to be added in GraphAr libraries.

Describe the solution you'd like
Add more common data types to the GraphAr libraries, such as DATE, TIME, BINARY, STRUCT, MAP, ARRARY, and JSON. Since these types are not always supported by the CSV/ORC/Parquet file types and the C++/Spark standard libraries, careful handling should be taken in each case, e.g., performing the necessary type conversions.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement `FragmentBuilder` in GraphScope with GraphAr to support building graph from GAR format files

Is your feature request related to a problem? Please describe.
We select GraphScope as GraphAr' first landing system and make that as an example to use GraphAr.
Implement a builder of Fragment in GraphScope with GraphAr to support build the in-memory property graph from GraphAr format files.

Describe the solution you'd like
The process of builder works like:

first user should design the yaml files to describe the graph you want to load, it can be the whole in-memory graph or a subgraph.
FragmentBuilder load the yaml files as Info(GraphInfo, VertexInfo and EdgeInfo), and use the ArrowChunkReader API of GraphAr to load chunk files as arrow table(including vertex table, edge table and offset table) and use these table to construct fragment.

Here is a prototype implementation of FragmentBuilder

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat][Doc] Generate GAR files of whole `ldbc-sample` property graph as an example to demonstrate GAR format

Is your feature request related to a problem? Please describe.
Maybe we need a widely-used property graph to demonstrate the GAR file format. The ldbc dataset seems to be a good choice.

This issue can be a good first issue for a developer.

Describe the solution you'd like
Generate formatted files in GraphAr for a property graph including:

design the metadata files (in Yaml) for the ldbc graph, the file format of chunk files can be csv for easy to read
generate GAR data files with the Spark library
add some tests to check the data files are match to the metadata information through utilizing the Info classes

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
related to issue #37

[Feat] Add tests for GraphAr spark tool and integrate to CI

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Ensure the metadata information to behave exactly the same across different languages

Is your feature request related to a problem? Please describe.
Utilize ProtoBuf to ensure the metadata information of GAR file format to behave exactly the same across different languages.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Support users-defined data type parser in graph info

Is your feature request related to a problem? Please describe.
Currently, the GraphAr Information classes support the users to extend their custom data types base on the info version (#27). However, the Reader/Writer implementations of our libraries do not support to read/write data in user-defined types.

Describe the solution you'd like
Extend the GraphAr libraries to support pass a user-defined parser to the Reader/Writer, to handle the custom data types.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Bug] Fix the image links of README in docs

Currently the images in README not show in alibaba.github.io/GraphAr , fix the images link of README

[Feat] Add ccache to github actions

[Bug]: The `libgar` library building from source expose its interface to its dependencies

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

When I use gar library in my project with

target_link_libraries(my_example PUBLIC ${GAR_LIBRARIES})

and build my project

got error:

-larrow_static not found

it looks like the target link libraries has inherited the gar library's dependency.
https://github.com/alibaba/GraphAr/blob/e8edfe38aa776f091dce24f4480fae06827194f4/CMakeLists.txt#L170-L172

Expected Behavior

DO NOT inherit the dependencies interface of GraphAr in user's project

Minimal Reproducible Example

project(MyExample)

find_package(gar REQUIRED)
include_directories(${GAR_INCLUDE_DIRS})

add_executable(my_example my_example.cc)
target_compile_features(my_example PRIVATE cxx_std_17)
target_link_libraries(my_example PRIVATE ${GAR_LIBRARIES})

Environment

Operating system: Ubuntu 20.04
GraphAr version: commit e8edfe3

Link to GraphAr Logs

No response

Further Information

No response

[Feat] Integrate Neo4j spark connector as input data source

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement GraphAr Spark Tools 0.1

Is your feature request related to a problem? Please describe.
GraphAr Spark tools are required as a library for generating, loading and transforming GAR files with Apache Spark easy.

Describe the solution you'd like
GraphAr Spark tools consist of the following parts:

Reader: for reading GAR files into Spark DataFrame #29
Writer: for writing Spark DataFrame into GAR files #28
IndexGenerator: for helping to generate the vertex index for vertex/edge DataFrames #36
Info Classes: for constructing and accessing the meta information of GraphAr #32

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Community] Find and add som `good-first` issues for beginner

Is your feature request related to a problem? Please describe.
Add some good-first issues for developer.

[Bug] Inconsistent prefix for vertex property chunks in the test data

In documentation and meta data, the prefix is "./vertex/person/first_name_last_name_gender", but the file path for property chunks is "./vertex/person/firstName_lastName_gender".
documentation: https://alibaba.github.io/GraphAr/user-guide/getting-started.html#property-data
meta data: https://github.com/acezen/gar-test/blob/master/ldbc_sample/csv/person.vertex.yml
file path for the chunks: https://github.com/acezen/gar-test/tree/master/ldbc_sample/csv/vertex/person/firstName_lastName_gender

🛣️ Roadmap

Below is a high-level road map view for GraphAr to provide a sense of direction of where the project is going. This can change at any point and does not reflect many features and improvements that will also be included as part of the journey along this road map. For more granular detail of what will be included in upcoming releases you can review the project milestones as defined in our Release Process documentation.

Format Spec
- #231
- Extract property groups property from adjacent list in edge info
- Use the same property group chunks in all adjacent lists, to reduce total file size
- #275
C++
- cross-language schema compatibility for format.
- Support features in research paper of GraphAr
Java
- cross-language schema compatibility for format.
- Provide ability that can integrate into HugeGraph
- Refactor SDK to avoid strong binding to C++/arrow
Spark
- cross-language schema compatibility for format.
- #320
- #324
- #330
Python
- cross-language schema compatibility for format.
- Support Python SDK

Add introduction about GraphAr Spark tools in document

Is your feature request related to a problem? Please describe.
Add an individual page in GraphAr document to introduce the Spark tools.

Describe the solution you'd like
The document would include:

the high-level overview of the Spark tools
how to get the tools
how to use them

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Remove the `arrow/api.h` include from graph.h and move the related code to graph.cc

Is your feature request related to a problem? Please describe.
To avoid include arrow's header in GraphAr headers, we need to remove the arrow/api.h include of graph.h and move the related code to graph.cc

Improve the document about the file format introduction and use examples

Is your feature request related to a problem? Please describe.
Improve the document about the GAR file format introduction to make it more clear. Also, re-organize and improve the examples for helping the users to getting started with GraphAr.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improve the performance of high-level graph iterators of the C++ library

Is your feature request related to a problem? Please describe.
An important application case of GraphAr is to serve out-of-core graph processing scenarios. With the graph data saved as GAR files in the disk, GraphAr provides a set of reading interfaces to allow to load part of graph data into memory when needed, to conduct analytics.
Since for out-of-core graph processing, disk I/O time usually dominates the overall execution time. It is critically important that the GraphAr C++ library perform efficiently for traversing vertices/edges through high-level graph iterators.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Improve the performance of Spark writer

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement Info class for GraphAr spark tool to construct and access the meta information of graph

Is your feature request related to a problem? Please describe.
Implement info classes for GraphAr spark tool. the Info include GraphInfo, VertexInfo and EdgeInfo and align to the classes of c++ SDK.

Describe the solution you'd like
Here is a proposal of the Info classes api:

class Property () {
  @BeanProperty var name: String = ""
  @BeanProperty var data_type: String = ""
  @BeanProperty var is_primary: Boolean = false
}

//methods of Property:
// -- getName: String
// -- getData_type: String
// -- getData_type_in_gar: GarType.Value
// -- getIs_primary: Boolean

class PropertyGroup () {
  @BeanProperty var prefix: String = ""
  @BeanProperty var file_type: String = ""
  @BeanProperty var properties = new java.util.ArrayList[Property]()
}

//methods of PropertyGroup:
// -- getPrefix: String
// -- getFile_type: String
// -- getFile_type_in_gar: FileType.Value
// -- getProperties:  ArrayList[Property]

class AdjList () {
  @BeanProperty var ordered: Boolean = false
  @BeanProperty var aligned_by: String = "src"
  @BeanProperty var prefix: String = ""
  @BeanProperty var file_type: String = ""
  @BeanProperty var property_groups = new java.util.ArrayList[PropertyGroup]()
}

//methods of AdjList:
// -- getOrdered: Boolean
// -- getAligned_by: String
// -- getPrefix: String
// -- getFile_type: String
// -- getFile_type_in_gar: FileType.Value
// -- getAdjList_type: String
// -- getAdjList_type_in_gar: AdjListType.Value
// -- getPropertyGroups: ArrayList[PropertyGroup]

class GraphInfo() {
  @BeanProperty var name: String = ""
  @BeanProperty var prefix: String = ""
  @BeanProperty var vertices = new java.util.ArrayList[String]()
  @BeanProperty var edges = new java.util.ArrayList[String]()
  @BeanProperty var version: String = ""
}

//methods of GraphInfo:
// -- getName: String
// -- getPrefix: String
// -- getVertices: ArrayList[String]
// -- getEdges: ArrayList[String]
// -- getVersion: String

class VertexInfo() {
  @BeanProperty var label: String = ""
  @BeanProperty var chunk_size: Long = 0
  @BeanProperty var prefix: String = ""
  @BeanProperty var property_groups = new java.util.ArrayList[PropertyGroup]()
  @BeanProperty var version: String = ""
}

//methods of VertexInfo:
// -- getLabel: String
// -- getChunk_size: Long
// -- getPrefix: String
// -- getProperty_groups: ArrayList[PropertyGroup]
// -- getVersion: String
// -- containPropertyGroup(property_group: PropertyGroup) : Boolean
// -- containProperty(property_name: String) : Boolean
// -- getPropertyGroup(property_name: String):PropertyGroup
// -- getPropertyType(property_name: String): GarType.Value
// -- isPrimaryKey(property_name: String): Boolean
// -- getPrimaryKey(): String
// -- isValidated(): Boolean
// -- getVerticesNumFilePath(): String
// -- getFilePath(property_group: PropertyGroup, chunk_index: Long): String
// -- getDirPath(property_group: PropertyGroup): String

class EdgeInfo() {
  @BeanProperty var src_label: String = ""
  @BeanProperty var edge_label: String = ""
  @BeanProperty var dst_label: String = ""
  @BeanProperty var chunk_size: Long = 0
  @BeanProperty var src_chunk_size: Long = 0
  @BeanProperty var dst_chunk_size: Long = 0
  @BeanProperty var directed: Boolean = false
  @BeanProperty var prefix: String = ""
  @BeanProperty var adj_lists = new java.util.ArrayList[AdjList]()
  @BeanProperty var version: String = ""
}

//methods of EdgeInfo:
// -- getSrc_label: String
// -- getEdge_label: String
// -- getDst_label: String
// -- getChunk_size: Long
// -- getSrc_chunk_size: Long
// -- getDst_chunk_size: Long
// -- getDirected: Boolean
// -- getPrefix: String
// -- getAdj_lists: ArrayList[AdjList]
// -- containAdjList(adj_list_type: AdjListType.Value): Boolean
// -- getAdjListPrefix(adj_list_type: AdjListType.Value): String
// -- getAdjListFileType(adj_list_type: AdjListType.Value): FileType.Value
// -- containPropertyGroup(property_group: PropertyGroup, adj_list_type: AdjListType.Value) : Boolean
// -- containProperty(property_name: String) : Boolean
// -- getPropertyGroups(adj_list_type: AdjListType.Value): java.util.ArrayList[PropertyGroup]
// -- getPropertyType(property_name: String): GarType.Value
// -- getPropertyGroup(property_name: String, adj_list_type: AdjListType.Value): PropertyGroup 
// -- isPrimaryKey(property_name: String): Boolean
// -- getPrimaryKey(): String
// -- isValidated(): Boolean
// -- getAdjListOffsetFilePath(chunk_index: Long, adj_list_type: AdjListType.Value) : String
// -- getAdjListOffsetDirPath(adj_list_type: AdjListType.Value) : String
// -- getAdjListFilePath(vertex_chunk_index: Long, chunk_index: Long, adj_list_type: AdjListType.Value) : String
// -- getAdjListDirPath(adj_list_type: AdjListType.Value) : String
// -- getPropertyFilePath(property_group: PropertyGroup, adj_list_type: AdjListType.Value, vertex_chunk_index: Long, chunk_index: Long): String
// -- getPropertyDirPath(property_group: PropertyGroup, adj_list_type: AdjListType.Value) : String
// -- getVersion: String

Refine the README.rst to make user/developer easy to know `What is GraphAr`

Is your feature request related to a problem? Please describe.
Current README of GraphAr is a little clumsy and incomplete. It is hard to help user/developer to know What is GraphAr.

Describe the solution you'd like

Clear and concise introduction of GraphAr.
Goals of GraphAr
Links to other documents (for advance reading)
Concise writing
Add Code of conduct

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement Graph spark `IndexGenerator` for helping to generate the vertex index for vertex/edge DataFrame

Is your feature request related to a problem? Please describe.
According to GAR file format, the global index of vertex is important in GAR file format and it is continuous and unique.

The original data source for spark(e.g vertex dataframe and edge dataframe) usually not contain such column.
IndexGenerator is a helper object that help GraphAr to generate index of vertex for vertex dataframe and edge dataframe.

Describe the solution you'd like
Here is a API proposal of IndexGenerator

object IndexGenerator {
  //helper methods for vertex DataFrame
  def constructVertexIndexMapping(vertexDf: DataFrame, primaryKey: String): DataFrame = {
    //return a DataFrame contains two columns: vertex index & primary key
  }

  def generateVertexIndexColumn(vertexDf: DataFrame): DataFrame = {
    //add a column contains vertex index
  }

  //helper methods for edge DataFrame
  //generate index from vertex mapping
  def generateSrcIndexForEdgesFromMapping(edgeDf: DataFrame, srcColumnName: String, srcIndexMapping: DataFrame): DataFrame = {
  	// join the edge table with the vertex index mapping for source column
	}

  def generateDstIndexForEdgesFromMapping(edgeDf: DataFrame, dstColumnName: String, dstIndexMapping: DataFrame): DataFrame = {
  	// join the edge table with the vertex index mapping for destination column
	}

  def generateVertexIndexForEdgesFromMapping(edgeDf: DataFrame, srcColumnName: String, dstColumnName: String, srcIndexMapping: DataFrame, dstIndexMapping: DataFrame): DataFrame = {
  	// join the edge table with the vertex index mapping for source & destination columns
	}
	//generate index by sorting the src/dst column
  def generateSrcIndexForEdges(edgeDf: DataFrame, srcColumnName: String): DataFrame = {
  	// construct vertex index for source column
	}
  
  def generateDstIndexForEdges(edgeDf: DataFrame, dstColumnName: String): DataFrame = {
  	// construct vertex index for destination column
	}

  def generateSrcAndDstIndexUnitedlyForEdges(edgeDf: DataFrame, srcColumnName: String, dstColumnName: String): DataFrame = {
    // construct vertex index for source & destination columns together
  }
}

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Integrate NebulaGraph spark connector as input data source for GraphAr spark tool

Is your feature request related to a problem? Please describe.
The graph data migration between NebulaGraph and GraphAr could be an important application of GraphAr. This can be implemented based on the NebulaGraph Spark connector and the GraphAr Spark library, including reading graph data from NebulaGraph to generate GAR files, and reading from GraphAr to create/update instances in NebulaGraph.

Describe the solution you'd like
Please refer to the integration with Neo4j (#107).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat][FileFormat] CSV should include the header row in chunk file

Is your feature request related to a problem? Please describe.
Currently, CSV chunk files generated by c++/spark writer does not contains the header row and it would lost schema information of data. We should include the header row when generate CSV chunk files.

Describe the solution you'd like
enable the include_header option of C++ chunk writer , refer from: https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv12WriteOptions14include_headerE
enable the header option in spark dataframe writer. refer from: https://spark.apache.org/docs/latest/sql-data-sources-csv.html

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Revise image links in docs

Use doc URL, instead of raw file link to repo.

[Improvement] Reorganize the code directory to make developer easily to extend library for other language

Is your feature request related to a problem? Please describe.
Since the C++ library is the first library support by GraphAr, its code is put directly in the root of source. For extending other language library easily, we need to reorganize the code directory like:

.
├── cpp (c++ library code)
├── docs
├── examples
├── spark
└── thirdparty

Describe the solution you'd like

put the c++ library code to cpp directory
~~Add an CMakeList.txt to manage the building of all libraries~~

[Feat] Integrate LDBC spark connector as input data source for GraphAr spark tool

Is your feature request related to a problem? Please describe.
LDBC provides a synthetic graph generator running on Spark (https://github.com/ldbc/ldbc_snb_datagen_spark). We can utilize the GraphAr spark library to integrate with this graph generator, for dumping the generated graph data into GraphAr files.

Describe the solution you'd like
Refer to the API reference of Reader/Writer and graph-level interface of the GraphAr Spark library. The integration with neo4j Spark connector (#107 ) can also help.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat]: Support data type extension in graph information base on `version` attribute of Infos

Is your feature request related to a problem? Please describe.
The version attribute of infos(graph, vertex, edge) now is only a number. Actually it can contain the implicit information that the property data types support with the version. With the version growing, the supported data types could be extended. Likes:
version 1 -> support bool,int32, int64, float, double, string
version 2 -> support bool,int32, int64, float, double, string, date32

Describe the solution you'd like

Use string instead of number as version, something like User Agent message of browser.
version example:
gar/v1
gar/v2
gar/v3 (user_define1, user_define2) # suppose the version 3 or higher support user define type.
Add a VersionMeta class to keep record different version supported data types and do the version string parse job.
If the yaml contains the data type that the yaml version not support, raise error to user.

Add CODE_OF_CONDUCT.md

Code of conduct help establish expectations for behavior of the project's participants, and facilitate healthy, constructive community behavior.

We should add a document to the root of the git repository to direct interested individuals to the CoC.

Improve the performance of Spark Reader

Is your feature request related to a problem? Please describe.
Optimize the Spark Reader to support reading multiple chunks in parallel for better performance, and maintain the relative order of the chunks in resulting DataFrame.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Refine document to make users/developers easy to use GraphAr

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Implement GraphAr Spark Writer for writing Spark DataFrame into GAR format files

Is your feature request related to a problem? Please describe.
Implement writer of Spark tool to provide functions to generate GraphAr format files from hive table.

Describe the solution you'd like
It's better to read hive table as a spark DataFrame and use operators of DataFrame to generate the files.
The writer should include VertexWriter and EdgeWriter.

VertexWriter provide functions to generate chunk files of property group base on the vertex info user defined
EdgeWriter provide function to generate chunk files of adj list/offset/property group base on the edge info that user defined

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Support more file formats for payload files

Is your feature request related to a problem? Please describe.
The GraphAr's chunk files could be stored in ORC, parquet or CSV now. We can support more builtin file formats like
Json, hdf5 and avro to enhance the capacity of GraphAr and satisfy different requirements for file formats.

Describe the solution you'd like
Support more file types by extending the metadata information and implement related reading/writing functions with help of
arrow or other third-party libraries.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[C++] [Improvement] Provide more writing methods in the C++ library

Is your feature request related to a problem? Please describe.
Currently, the low-level writers (VertexPropertyWriter and EdgeChunkWriter) only support to write Arrow tables, thus for the users, it is required to construct such tables before writing (e.g., writing the PageRank results saved in a std::vector into GAR files). For high-level writers (VerticesBuilder and EdgesBuilder), it is required to construct the Vertex/Edge firstly, which is the internal high-level data structure in GraphAr

Describe the solution you'd like
We are proposed to provide more built-in writing methods in C++ Writer SDK, to support additional data structures besides Arrow tables and GraphAr Vertex/Edge. A possible solution is to use containers from the STL, as Boost Graph Library does, including:

std::vector
std::list
std::slist
std::set
std::hash_set
std::multiset

Add release tutorial to contributing guide to make maintainer easy to do the release process of GraphAr

Is your feature request related to a problem? Please describe.
Add an github action to simplify the release process of GraphAr and add release tutorial for maintainer how to cut a version.

Describe the solution you'd like
simplify the process with tool like action-automatic-releases

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Add appropriate api to GraphScope client and documents for user easy to write/build graph to/from GAR format files

Is your feature request related to a problem? Please describe.
Since GraphAr already integrate to v6d to enable load graph from graphar. We should add some easy-to-use api to GraphScope client for users to easy to archive/load graph to/from graphar

Revise the application example implementation

Is your feature request related to a problem? Please describe.
Currently the examples of GraphAr are implement like unit test and they are not intuitive for user or developer beginner to know how to use GraphAr as example.
We need to revise the implement and make them more like an example and show case.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
related to #37

[Feat] GraphAr Spark library to support adding new rows/columns

Is your feature request related to a problem? Please describe.
In real use cases, the graph data is usually continuously changing, including adding, deleting, and modifying vertices or edges. As part of incremental management functions, we intend to extend the GraphAr Spark tools to support adding new rows/columns conveniently and efficiently.

Describe the solution you'd like
Support to add new rows/columns for vertex/edge table and dump the new data by generating new GAR files or appending/rewriting existing GAR files.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Release version v0.1.0

Is your feature request related to a problem? Please describe.
Release version v0.1.0

Describe the solution you'd like

Check CI pass
The release note.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Add Spark Examples in GAR using GraphAr Spark tools

Is your feature request related to a problem? Please describe.
The GraphAr Spark tools can be applied to the scenarios where the graph format need to be transformed. It can also be used when taking GraphAr as data sources to execute SQLs or do graph processing. We can add some examples to show the use cases.

Describe the solution you'd like
Add examples that utilize the Spark tools to:

take GAR as data sources to do graph processing (e.g., run CC using GraphX).
transform GAR data between different file types (e.g., from ORC to parquet).
transform GAR data between different adjList types (e.g., from COO to CSR).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[Feat] Add CLA check for PRs

as titled.

Add `AdjListInfo` for EdgeInfo to store adj list information

Is your feature request related to a problem? Please describe.
Base on the graph information file design (example), the AdjList of graph is containing the informations about align, edge chunk file type, property_groups of edge. But in C++ library, the AdjList is only an enum type, and the other informations are stored in EdgeInfo with map.
This is not align to the yaml file design. To address the problem, maybe we should add a middle structure AdjListInfo between PropertyGroup and EdgeInfo to keep track of the adj list information of graph.

Describe the solution you'd like
The AdjListInfo could be like: (just a proposal)

class AdjListInfo {
    FileType file_type_;
    std::string prefix_;
    std::vector<PropertyGroup> property_groups_;

  public:
    // Constructor
    AdjListInfo(FileType file_type, std::string prefix);

    // some add methods
    void AddPropertyGroup(pg);

    // some getter methods
    FileType GetFileType() const;
}

Then, use AdjListInfo objects as member variables to update the implementation of EdgeInfo.

Refine the CONTRIBUTING.rst to make user/developer easy to get started

Add these:

what kind of contribution we are looking for
get started: how to report a bug, how to suggest a feature, how to fork repo and coding , how to open pull request (for newcomer)
code review process
Using a warm, friendly tone
Add code of conduct

Fully utilize the features of different file formats for improved efficiency

Is your feature request related to a problem? Please describe.
GraphAr supports the file formats of CSV, ORC and Parquet currently, and it's going to support more file types such as json, hdf5 and avro. For enhancing the efficiency of reading/writing and storing of the data, the features of different file formats should be considered and fully utilized, for example, applying the most appropriate compression and encoding scheme to the data, or enable filter pushdown to improve query performance.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

alibaba / graphar Goto Github PK

graphar's Issues

Is there an existing issue for this?

Current Behavior

Expected Behavior

Minimal Reproducible Example

Environment

Link to GraphAr Logs

Further Information

Is there an existing issue for this?

Current Behavior

Expected Behavior

Minimal Reproducible Example

Environment

Link to GraphAr Logs

Further Information

🛣️ Roadmap

Recommend Projects

Recommend Topics

Recommend Org