Coder Social home page Coder Social logo

substrait-cpp's People

Contributors

chaojun-zhang avatar epsilonprime avatar ingomueller-net avatar jacques-n avatar vibhatha avatar westonpace avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

substrait-cpp's Issues

The Antlr dependency is not like the others. Is that a problem?

I believe all of our non-antlr dependencies are obtained via submodules. Antlr, on the other hand, is downloaded from github via custom cmake. I think the reasoning behind this discrepancy is that these are the instructions suggested by Antlr itself. Or perhaps it is that Antlr requires java?

Part of me thinks that having these things be consistent will make it easier to understand and maintain for future developers. Is there any particular reason we can't have it as a submodule?

Break cyclic dependencies of CMake targets

There seem to be several cycles in the dependency graph of CMake targets. This isn't a problem for compiling static libraries, but if I build with cmake -DBUILD_SHARED_LIBS=ON ..., I get the following error (uninteresting parts omitted):

CMake Error: The inter-target dependency graph contains the following strongly connected component (cycle):
  "substrait_io" of type STATIC_LIBRARY
    depends on "substrait_textplan_converter" (weak)
    depends on "symbol_table" (weak)
    depends on "substrait_textplan_loader" (weak)
    depends on "substrait_textplan_converter" (strong)
    depends on "substrait_textplan_loader" (strong)
  "symbol_table" of type SHARED_LIBRARY
    depends on "substrait_textplan_converter" (weak)
    depends on "substrait_io" (weak)
    depends on "substrait_textplan_loader" (weak)
  "substrait_textplan_converter" of type SHARED_LIBRARY
    depends on "substrait_io" (weak)
    depends on "substrait_textplan_loader" (weak)
    depends on "symbol_table" (weak)
  "substrait_textplan_loader" of type SHARED_LIBRARY
    depends on "symbol_table" (weak)
    depends on "substrait_textplan_converter" (weak)
    depends on "substrait_io" (weak)
At least one of these targets is not a STATIC_LIBRARY.  Cyclic dependencies are allowed only among static libraries.

I understand that there are reasons not to build shared libraries, but I argue that there are situations where it is useful. (I am working on a project with many top-level binaries that share 99% of the code and compilation and disk usage is a lot better with shared libs, so I hugely prefer that for developing and CI.)

I am currently trying to consume substrait-cpp as a dependency using add_subdirectory. Since I build my top-level project with BUILD_SHARED_LIBS, substrait-cpp is built with the same configuration. For any scenario where that flag is desired or required for the top-level project, support that flag would be great.

Finally, I believe that cyclic dependencies are often an indication for sloppy design or code organization and therefor a good idea to clean up. However, I haven't looked into the cycles yet, so I don't know yet whether there are good reasons to have them or things have just grown this way...

Align on parameter passing style

The rules we use in Arrow are:

  • If a copy of the variable will never be made then pass const T& if the object is deep and T if the object is shallow (note, I find this deep/shallow thing annoying because things that start out shallow can become deep, so I'm open to improvement)
  • If a copy of the variable might be made (or will always be made) then pass T to allow the caller to std::move and avoid the copy
  • If the variable itself needs to be modified (e.g. an out parameter) then pass T*. This is different than the google C++ style guide which prefers T&. In Arrow, T* is preferred because it makes it obvious the value is being modified at the call site.

Substrait exceptions should note the code location of the caller

The current implementation of macros such as SUBSTRAIT_FAIL is to return an exception with the location of the exceptions library:

libc++abi: terminating with uncaught exception of type io::substrait::common::SubstraitRuntimeError: Exception: SubstraitRuntimeError
Error Code: INVALID_STATE
Type: system
Reason: Return type from visitExpression with case of 2 was not std::string!
Function: SubstraitException
File: /home/user/projects/substrait-cpp/src/substrait/common/Exceptions.cpp
:Line: 22

bug: type library does not correctly identify nullable for complex types

Example:

  testDecode<ParameterizedList>(
      "list<list<string?>>",
      [](const std::shared_ptr<const ParameterizedList>& typePtr) {
        ASSERT_FALSE(typePtr->nullable());  // returns true
        ASSERT_FALSE(typePtr->elementType()->nullable());  // returns true
        ASSERT_TRUE(std::dynamic_pointer_cast<const ParameterizedList>(
                        typePtr->elementType())
                        ->elementType()
                        ->nullable());
        ASSERT_EQ(typePtr->signature(), "list<list<str>>");
      });

Introduce a second text format for pipelines/plans

Right now pipelines are kind of hard to read if there are nodes that have multiple inputs. Let's add a second supported pipeline pattern that is more similar to other query plans. It can use root to leaf ordering. In the design, it will list out a straight pipeline one line after another. If an operator has more than one input, it's children are indented (and subsequent as well) but are prefixed with a dash. For example, a plan that was a two tables read followed each followed by a filter joined together and then aggregated would like this in the old format and the new format:

Old:

pipelines {
  read -> filter -> join
  read2 -> filter2 -> join
  join -> aggregate -> root
}

New Format:

plan {
  aggregate
  join
  -filter
   read
  -filter2
   read2
}

I propose that both formats are supported but only one can be used at a time. When printing, we can request which we want.

Extension space and function references

Use the symbol table to be the source of truth for extension URI spaces and function numbers for both the parser and the converter. Currently the parser assigns the numbers during the parse. The converter already has the correct numbers (from the binary plan) and stores them in the symbol table. There needs to be one cohesive data structure.

add initial CI checks

At a minimum we should have CI checks for conventional commit messages as well as running the unit tests.

rename variant to implementation

The current code refers to "variants" in a number of places, typically when talking about multiple implementations of the same function (e.g. add<int32,int32> and add<int64,int64>). Since the spec calls these "implementations" and the spec also has a different concept that it calls "variant" I think it would be good to update the repo changing all instances of "variant" to "implementation".

Disable warnings about usage of deprecated `args`

The converter is tolerant of older plans. This means it will visit the args variant of scalar function arguments which is deprecated. However, this generates compiler warnings. We should suppress those warnings with the appropriate pragmas.

Stress test binary to text converter using fuzzed plans

The plan specification allows for a myriad of combinations that are difficult to test exhaustively. Fuzzing could illuminate unexpected combinations such as those caused by unexpected return types encountered by the visitors.

How can i use this to transform a sql to plan(or something)?

I read the README.md and did as it said, and i got a dir named build-Debug. There are two exe files in substrait-cpp/build-Debug/debug, but i do not know what should i give as a input file, and what the exe file will export. Can someone tell me how to use this project to convert SQL into a plan? I would greatly appreciate your help

Reduce the need to provide output type to functions

Always providing an output type to functions is cumbersome and reduces readability. Investigate using stronger function definitions (such as equal always returning bool if the inputs are non-null) and intelligent defaults (such as add of two like types always returning the same type if not overridden). Override will still be possible and if they are needed they could raise a warning during parsing.

Add lint to CI

Required:

cmake_format
clang-format
Some kind of spdx licensing header check

Consider:

clang-tidy
iwyu

Vendor int128

We are about to be pulling in abseil, which is a rather large library, so that we can get int128 support. Assuming we aren't expecting to need all of abseil it might be better to vendor int128 which looks to be fairly separable (and has already been done, at least once, here).

Clarify if decimals can have negative scale

Per the spec, they cannot, even if it could make sense in certain interpretations. Let's discuss and decide what seems most appropriate and then update the implementation to conform.

Set up IWYU

It would be convenient to be able to run IWYU similarly to how we can run clang format and clang tidy today.

Add Windows build to CI

I don't think anyone is using Windows yet but we want this to be portable and this will help to ensure we don't stray too far into Linux-only headers.

Require pipelines and relation definitions to be one to one

Currently it is not an error to have a pipeline node to have a relation definition (or the reverse). This should be at least a warning as the pipeline will not be complete (a text plan with just pipeline information will emit a nearly empty plan).

Add library for converting binary and text plans

As part of the proposal to add a new text-based language for representing plans add a library that can be used to convert from one to another. Ultimately this new text-based language will replace the JSON-encoded protobuf used today.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.