Coder Social home page Coder Social logo

snowflake-semistructured-functions's Introduction

Snowflake Semi-Structred Data Functions

Implementation of filter_array and transform_array for Snowflake.

Operations are defined using expr_eval - see project for full list of operators and functions.

Motivation

While we can filter and transform arrays using FLATTEN and GROUP BY, it is complicated, and hard to both write and understand.

For instance, let's assume t1 is a table, and it has a column named number_array. If we'd like to filter it and keep only positive numbers, here is how it would look:

select
    array_agg(value) as filtered
from t1, lateral flatten(input => t1.number_array)
where value > 0
GROUP BY seq

Not only is it hard to understand, it would also cause a bug where rows with no positive numbers would disappear (instead of returning an empty array).

Here is how the same thing can be achieved now:

select filter_array('x > 0', t1.number_array) AS filtered
from t1

Usage

The files define two functions, with overloaded implementations:

  • filter_array(condition, array [, additional_context])

  • transform_array(transformation, array [, additional_context])

condition / transformation is a VARCHAR (string) defining the function to run over elements ("Lambda function"). It must be the same for all rows, as it is parsed only once. The array element is given as x, and additional variables can be defined using the optional parameter.

array is an ARRAY/VARIANT with the elements to be transformed/filtered.

additional_context (or kwargs) is an optional parameter. It is of type OBJECT/VARIANT, and defines additional variables to be used. Its usage is demonstrated in the example below. Unlike the transformation, which has to be constant for all rows, this can vary by row.

Full Example

-- Table definition
WITH t1 AS (
    select PARSE_JSON(
    $$
    [
        {"int_value": 5, "should_keep": false},
        {"int_value": 6, "should_keep": true}
    ]
    $$
    ) AS MY_COL,
    10 AS VALUE_TO_ADD
)

-- Usage
select 
    MY_COL, 
    VALUE_TO_ADD,
    transform_array('x.int_value + y', MY_COL, {'y': VALUE_TO_ADD}) AS TRANSFORMED_COL,
    filter_array('x.should_keep', MY_COL) AS FILTERED_COL
from t1
MY_COL VALUE_TO_ADD TRANSFORMED_COL FILTERED_COL
[ {"int_value": 5, "should_keep": false}, {"int_value": 6, "should_keep": true} ] 10 [ 15, 16 ] [ {"int_value": 6, "should_keep": true } ]

Additional functions

In addition to the functions defined in expr_eval, these functions have been implemented:

  • object - Creates an object, similar to OBJECT_CONSTRUCT in snowflake (e.g. transform_array('object("name1", x.arg1, "name2", x.arg2)')).
  • sorted - Sorts an array.

Implementation details

Other platforms, such as Spark and Presto/Trino, allow defining lambda functions for creating the transformations. In order to simulate this ability under Snowflake, we have turned to Javascript. However, Snowflake disallows running eval in UDFs (and rightfully so!), which means we can't define lambda functions dynamically.

For this reason, we have turned to Matthew Crumley's library, expr_eval. It allows us to parse a subset of Javascript operations, and execute them - without dynamically executing arbitrary code.

While currently only filter_array and transform_array are implemented, the underlying mechanism of parsing and executing operations in JS might prove useful in additional cases. It is easy to extract the function parsing the operation and reusing it in other UDFs.

Security considerations

As stated, these functions avoid the use of eval(). While they indeed provide a safer alternative to it, they cannot guarentee the safety of running untrusted user input, and in particular untrusted function statements. Evaluation of arbitrary user expressions is not recommended.

snowflake-semistructured-functions's People

Contributors

orrdermer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.