Coder Social home page Coder Social logo

feathr-ai / feathr Goto Github PK

View Code? Open in Web Editor NEW
1.9K 84.0 256.0 30.12 MB

Feathr – A scalable, unified data and AI engineering platform for enterprise

Home Page: https://join.slack.com/t/feathrai/shared_invite/zt-1ffva5u6v-voq0Us7bbKAw873cEzHOSg

License: Apache License 2.0

Scala 47.32% Makefile 0.01% Python 19.26% Batchfile 0.02% Java 30.22% HTML 0.01% TypeScript 2.80% Less 0.07% CSS 0.02% Dockerfile 0.09% Shell 0.12% JavaScript 0.05%
feature-engineering feature-store artificial-intelligence mlops data-engineering data-quality machine-learning apache-spark azure data-science

feathr's Introduction

A scalable, unified data and AI engineering platform for enterprise

Important Links: Slack & Discussions. Docs.

License GitHub Release Docs Latest Python API CII Best Practices

What is Feathr?

Feathr is a data and AI engineering platform that is widely used in production at LinkedIn for many years and was open sourced in 2022. It is currently a project under LF AI & Data Foundation.

Read our announcement on Open Sourcing Feathr and Feathr on Azure, as well as the announcement from LF AI & Data Foundation.

Feathr lets you:

  • Define data and feature transformations based on raw data sources (batch and streaming) using Pythonic APIs.
  • Register transformations by names and get transformed data(features) for various use cases including AI modeling, compliance, go-to-market and more.
  • Share transformations and data(features) across team and company.

Feathr is particularly useful in AI modeling where it automatically computes your feature transformations and joins them to your training data, using point-in-time-correct semantics to avoid data leakage, and supports materializing and deploying your features for use online in production.

🌟 Feathr Highlights

  • Native cloud integration with simplified and scalable architecture.
  • Battle tested in production for more than 6 years: LinkedIn has been using Feathr in production for over 6 years and backed by a dedicated team.
  • Scalable with built-in optimizations: Feathr can process billions of rows and PB scale data with built-in optimizations such as bloom filters and salted joins.
  • Rich transformation APIs including time-based aggregations, sliding window joins, look-up features, all with point-in-time correctness for AI.
  • Pythonic APIs and highly customizable user-defined functions (UDFs) with native PySpark and Spark SQL support to lower the learning curve for all data scientists.
  • Unified data transformation API works in offline batch, streaming, and online environments.
  • Feathr’s built-in registry makes named transformations and data/feature reuse a breeze.

🏃 Getting Started with Feathr - Feathr Sandbox

The easiest way to try out Feathr is to use the Feathr Sandbox which is a self-contained container with most of Feathr's capabilities and you should be productive in 5 minutes. To use it, simply run this command:

# 80: Feathr UI, 8888: Jupyter, 7080: Interpret
docker run -it --rm -p 8888:8888 -p 8081:80 -p 7080:7080 -e GRANT_SUDO=yes feathrfeaturestore/feathr-sandbox:releases-v1.0.0

And you can view Feathr quickstart jupyter notebook:

http://localhost:8888/lab/workspaces/auto-w/tree/local_quickstart_notebook.ipynb

After running the notebook, all the features will be registered in the UI, and you can visit the Feathr UI at:

http://localhost:8081

🛠️ Install Feathr Client Locally

If you want to install Feathr client in a python environment, use this:

pip install feathr

Or use the latest code from GitHub:

pip install git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project

☁️ Running Feathr on Cloud for Production

Feathr has native integrations with Databricks and Azure Synapse:

Follow the Feathr ARM deployment guide to run Feathr on Azure. This allows you to quickly get started with automated deployment using Azure Resource Manager template.

If you want to set up everything manually, you can checkout the Feathr CLI deployment guide to run Feathr on Azure. This allows you to understand what is going on and set up one resource at a time.

📓 Documentation

🧪 Samples

Name Description Platform
NYC Taxi Demo Quickstart notebook that showcases how to define, materialize, and register features with NYC taxi-fare prediction sample data. Azure Synapse, Databricks, Local Spark
Databricks Quickstart NYC Taxi Demo Quickstart Databricks notebook with NYC taxi-fare prediction sample data. Databricks
Feature Embedding Feathr UDF example showing how to define and use feature embedding with a pre-trained Transformer model and hotel review sample data. Databricks
Fraud Detection Demo An example to demonstrate Feature Store using multiple data sources such as user account and transaction data. Azure Synapse, Databricks, Local Spark
Product Recommendation Demo Feathr Feature Store example notebook with a product recommendation scenario Azure Synapse, Databricks, Local Spark

🔡 Feathr Highlighted Capabilities

Please read Feathr Full Capabilities for more examples. Below are a few selected ones:

Feathr UI

Feathr provides an intuitive UI so you can search and explore all the available features and their corresponding lineages.

You can use Feathr UI to search features, identify data sources, track feature lineages and manage access controls. Check out the latest live demo here to see what Feathr UI can do for you. Use one of following accounts when you are prompted to login:

  • A work or school organization account, includes Office 365 subscribers.
  • Microsoft personal account, this means an account can access to Skype, Outlook.com, OneDrive, and Xbox LIVE.

Feathr UI

For more information on the Feathr UI and the registry behind it, please refer to Feathr Feature Registry

Rich UDF Support

Feathr has highly customizable UDFs with native PySpark and Spark SQL integration to lower learning curve for data scientists:

def add_new_dropoff_and_fare_amount_column(df: DataFrame):
    df = df.withColumn("f_day_of_week", dayofweek("lpep_dropoff_datetime"))
    df = df.withColumn("fare_amount_cents", df.fare_amount.cast('double') * 100)
    return df

batch_source = HdfsSource(name="nycTaxiBatchSource",
                        path="abfss://[email protected]/demo_data/green_tripdata_2020-04.csv",
                        preprocessing=add_new_dropoff_and_fare_amount_column,
                        event_timestamp_column="new_lpep_dropoff_datetime",
                        timestamp_format="yyyy-MM-dd HH:mm:ss")

Defining Window Aggregation Features with Point-in-time correctness

agg_features = [Feature(name="f_location_avg_fare",
                        key=location_id,                          # Query/join key of the feature(group)
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(        # Window Aggregation transformation
                            agg_expr="cast_float(fare_amount)",
                            agg_func="AVG",                       # Apply average aggregation over the window
                            window="90d")),                       # Over a 90-day window
                ]

agg_anchor = FeatureAnchor(name="aggregationFeatures",
                           source=batch_source,
                           features=agg_features)

Define Features on Top of Other Features - Derived Features

# Compute a new feature(a.k.a. derived feature) on top of an existing feature
derived_feature = DerivedFeature(name="f_trip_time_distance",
                                 feature_type=FLOAT,
                                 key=trip_key,
                                 input_features=[f_trip_distance, f_trip_time_duration],
                                 transform="f_trip_distance * f_trip_time_duration")

# Another example to compute embedding similarity
user_embedding = Feature(name="user_embedding", feature_type=DENSE_VECTOR, key=user_key)
item_embedding = Feature(name="item_embedding", feature_type=DENSE_VECTOR, key=item_key)

user_item_similarity = DerivedFeature(name="user_item_similarity",
                                      feature_type=FLOAT,
                                      key=[user_key, item_key],
                                      input_features=[user_embedding, item_embedding],
                                      transform="cosine_similarity(user_embedding, item_embedding)")

Define Streaming Features

Read the Streaming Source Ingestion Guide for more details.

Point in Time Joins

Read Point-in-time Correctness and Point-in-time Join in Feathr for more details.

Running Feathr Examples

Follow the quick start Jupyter Notebook to try it out. There is also a companion quick start guide containing a bit more explanation on the notebook.

🗣️ Tech Talks on Feathr

⚙️ Cloud Integrations and Architecture

Architecture Diagram

Feathr component Cloud Integrations
Offline store – Object Store Azure Blob Storage, Azure ADLS Gen2, AWS S3
Offline store – SQL Azure SQL DB, Azure Synapse Dedicated SQL Pools, Azure SQL in VM, Snowflake
Streaming Source Kafka, EventHub
Online store Redis, Azure Cosmos DB
Feature Registry and Governance Azure Purview, ANSI SQL such as Azure SQL Server
Compute Engine Azure Synapse Spark Pools, Databricks
Machine Learning Platform Azure Machine Learning, Jupyter Notebook, Databricks Notebook
File Format Parquet, ORC, Avro, JSON, Delta Lake, CSV
Credentials Azure Key Vault

🚀 Roadmap

  • More Feathr online client libraries such as Java
  • Support feature versioning
  • Support feature monitoring

👨‍👨‍👦‍👦 Community Guidelines

Build for the community and build by the community. Check out Community Guidelines.

📢 Slack Channel

Join our Slack channel for questions and discussions (or click the invitation link).

feathr's People

Contributors

aabbasi-hbo avatar ahlag avatar anirudhagar13 avatar atangwbd avatar blee1234 avatar blrchen avatar bozhonghu avatar chinmaytredence avatar dependabot[bot] avatar donegjookim avatar dongbumlee avatar enya-yx avatar esadler-hbo avatar fendoe avatar hangfei avatar hyingyang-linkedin avatar jainr avatar jaymo001 avatar justintyc avatar lhayhurst avatar loftiskg avatar loomlike avatar rakeshkashyap123 avatar t-curiekim avatar thurstonchen avatar windoze avatar xiaoyongzhu avatar xiaoyzhuli avatar yihuiguo avatar yuqing-cat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

feathr's Issues

UDF E2E tests failed

After merging #139, UDF E2E tests failed as the PR changed to use pyspark.cloudpickle instead of inspect.getsource().

The root cause is that the cloudpickle by default will not pickle modules by value, and the PyTest doesn't run test code in __main__ as normal python execution flow, that means all UDFs defined in the test code belongs to a different module than __main__, this leads to a mysterious ModuleNotFound exception when pickled UDFs were used in the remote PySpark job, as it cannot find the module where the pickled function belongs to, because the test module doesn't exist on the remote side.

Since version 2.0, cloudpickle resolves this issue by involving a new register_pickle_by_value(module) function, but unfortunately this function doesn't exist in the cloudpickle bundled with PySpark, which stays at 1.6 for years.

This issue actually will not happen in typical usage, e.g. in a Python REPL env or Jupyter notebook, as they all run current code in __main__ module, it only affects PyTest.

One possible solution is to use the new cloudpickle but it introduces new dependencies which must be installed on the remote Spark cluster manually.

The other solution is to let user package their UDF into a separated Python file and submit this file along with the Spark job, but this is not compatible with current test code thus needs a major restructuring.

The workaround is pretty straightforward, changing the module name to __main__ at the very beginning of the E2E test script, it looks like a hack but actually does the job pretty well.

Consider using Maven based spark job submission

Since Feathr is on Maven now (https://search.maven.org/artifact/com.linkedin.feathr/feathr_2.12), we should consider using Maven as the source to submit Spark jobs,rather than using the public wasb path, since it's a bit hard to maintain and doesn't distribute well (like don't have mirrors etc., and will be slow).

Databricks supports maven based library: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/libraries#--library

Synapse can be achieved by using this config: spark.jars.packages. See (https://spark.apache.org/docs/3.2.1/configuration.html) for more details.

Get Feature From Registry

Target scenario Get features from registry (fetched from purview sdk for now), then perform transformations and enrichment, transform the purview object into list of FeatureAnchor and DerivedFeature.
With the transformation completed, return values could be fed into feathr client for further operations (like build features)

Should feathr_project/setup.py have `pytest`?

I think the dev guide (docs/dev_guide/python_dev_guide.md) assumes that pytest is installed.

## Integration Test

Run `pytest` in this folder to kick off the integration test. The integration test will test the creation of feature dataset, the materialization to online storage, and retrieve from online storage, as well as the CLI. It usually takes 5 ~ 10 minutes. It needs certain keys for cloud resources.

results in pytest not found.

Adding pytest in the install_requires of feathr_project/setup.py fixed this for me. I can submit a PR if you think pytest should be added to the setup.

Cheers

Use Databricks CLI as Python SDK to submit jobs and load files

Currently we are using REST APIs to submit databricks jobs. This is definitely less than ideal and hard to maintain (as there are a lot of hard coded paths etc.). Also it cause some troubles when downloading the files.

Although there's no databricks python SDK, we can use databricks-cli as a python SDK since it's written in Python.
https://github.com/databricks/databricks-cli

For example, copying file can be achieved using this API:
https://github.com/databricks/databricks-cli/blob/master/databricks_cli/dbfs/cli.py#L118

This issue is opened to change the implementation to use databricks-cli as a python SDK to submit jobs, get files, etc. to make the code more elegant.

credit goes to @windoze for this idea

FeathrClient.register_features() does not raise error when build_features() has not been called.

If I attempt to register features prior to building them in the client, instead of raising the expected RuntimeError [RuntimeError("Please call FeathrClient.build_features() first in order to register features")](), it fails after that with:

2022-03-23 [1](vscode-notebook-cell:/c%3A/Users/tarockey/OneDrive%20-%20Microsoft/Documents/source/projects/feathr-test/feathr_user_workspace/new_nyc_notebook.ipynb#ch0000024?line=0)0:56:08.229 | INFO     | feathr._feature_registry:_register_feathr_feature_types:252 - Feature Type System Initialized.
2022-03-23 10:56:08.231 | INFO     | feathr._feature_registry:_read_config_from_workspace:431 - Reading feature configuration from []
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
c:\Users\tarockey\OneDrive - Microsoft\Documents\source\projects\feathr-test\feathr_user_workspace\new_nyc_notebook.ipynb Cell 26' in <cell line: 1>()
----> 1 client.register_features()

File ~\.conda\envs\feathr\lib\site-packages\feathr\client.py:178, in FeathrClient.register_features(self)
    [176](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/client.py?line=175) else:
    [177](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/client.py?line=176)     RuntimeError("Please call FeathrClient.build_features() first in order to register features")
--> [178](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/client.py?line=177) self.registry.register_features(self.local_workspace_dir)

File ~\.conda\envs\feathr\lib\site-packages\feathr\_feature_registry.py:721, in _FeatureRegistry.register_features(self, workspace_path)
    [719](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=718) # register feature types each time when we register features.
    [720](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=719) self._register_feathr_feature_types()
--> [721](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=720) self._read_config_from_workspace(workspace_path)
    [722](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=721) # Upload all entities
    [723](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=722) # need to be all in one batch to be uploaded, otherwise the GUID reference won't work
    [724](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=723) results = self.purview_client.upload_entities(
    [725](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=724)     batch=self.entity_batch_queue)

File ~\.conda\envs\feathr\lib\site-packages\feathr\_feature_registry.py:445, in _FeatureRegistry._read_config_from_workspace(self, workspace_path)
    [441](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=440)     feature_join_paths = glob.glob(os.path.join(
    [442](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=441)         workspace_path, "feature_join_conf", '*.conf'))
    [443](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=442)     logger.info("Reading feature join configuration from {}",
    [444](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=443)             feature_join_paths)
--> [445](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=444) if len(feature_join_paths) > 0:
    [446](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=445)     feature_join_path = feature_join_paths[0]
    [447](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=446)     self.feathr_feature_join = ConfigFactory.parse_file(feature_join_path)

UnboundLocalError: local variable 'feature_join_paths' referenced before assignment

The RuntimeError is created, but not raised, so it does not stop the code execution.

If build_features() must be called before register_features(), the call for register_features() to save_to_feature_config_from_context() is redundant - as you cannot pass the anchor_list or derived_feature_list to the register_features() method.

The intent seems to be to enable the SDK to dynamically check for context, as well as for config files within the working directory. Maybe adding a flag for "from_context" to the register_features() method would better allow this?

Mac developer install: a howto

Hello, I was able to install the project on a mac with a few tweaks to the python_dev_guide.md.

After making the virtualenv and running python3 -m pip install -e ., I got this error mesage:

 fatal error: 'librdkafka/rdkafka.h' file not found
    #include <librdkafka/rdkafka.h>
             ^~~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command '/usr/bin/clang' failed with exit code 1

Here's the fix:

  1. install brew
  2. install librdkafka by running the command brew install librdkafka
  3. run brew info librdkafka, take note of the library install path (for example, "/opt/homebrew/Cellar/librdkafka/1.8.2/include"),
  4. run export C_INCLUDE_PATH=$LIBRDKAFKA_INCLUDE_PATH, where $LIBRDKAFKA_INCLUDE_PATH is the include path found in step 2.
  5. Run python3 -m pip install -e .

Happy to submit a PR with an update to the python_dev_guide.md if you think this info would be helpful to others; otherwise, I can close this ticket! Cheers

Add validation for feature name

Feature name should follow some naming conventions and rules otherwise it will break certain underlying engine, like spark or storage.

Feathr Web UI MVP

Need a web ui as enhancement to cli for feature registry, search, lineage features, etc.

Following should be included for MVP prototype:

  • Setup UX framework: site global style, layout, menu, header, etc.
  • Account login
  • Data sources listing
  • Feature listing and search
  • Feature lineage flow chart
  • Feature/Project assignment

Add developer related docs

We should have developer focused docs for:

  • CI pipeline
  • integration with Azure resources
  • Feathr internals

User guide review for fresh eyes

We just added quite a few wikis, and tutorials. It's good to have some new users that have never used Feathr try it out.
Review the user guide by new users so we know if it's easy to understand and onboard(from easy to hard).

  • Read the documentations to see if it's easy enough to understand
  • If there is something not well-explained or missing, try to fix them or raise a issue with us.

Here are the documentations:
https://github.com/linkedin/feathr/blob/main/docs/concepts/feathr-concepts-for-beginners.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feathr-capabilities.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-definition.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-generation.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-join.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/point-in-time-join.md

CI test does not work as expected

There are two issues:

  1. After @jainr 's PR, the CI code in main branch will not run
  2. For forked repository, the test will only run once and won't run after further changes (reported by @windoze )
  3. more comments below by other folks

Add more tutorials & cases for UDFs

Currently Feathr supports flexible UDFs for PySpark/SparkSQL/pandas on Spark. Definitely need more help on samples. Some tested ones are below:

def add_new_dropoff_and_fare_amount_column(df: DataFrame):
    df = df.withColumn("f_day_of_week", dayofweek("lpep_dropoff_datetime"))
    df = df.withColumn("fare_amount_cents", df.fare_amount.cast('double') * 100)
    return df


from pyspark.sql import SparkSession, DataFrame
def feathr_udf_filter_location_id(df: DataFrame) -> DataFrame:
  # if using Spark SQL, need to declare this default spark session, and create a temp view so that you can run Spark SQL on it.
  global spark
  df.createOrReplaceTempView("feathr_temp_table_feathr_udf_day_calc")
  sqlDF = spark.sql(
  """
  SELECT *
  FROM feathr_temp_table_feathr_udf_day_calc
  WHERE DOLocationID!= 100
  """
  )
  return sqlDF

def feathr_udf_pandas_spark(df: DataFrame) -> DataFrame:
  # using pandas on spark APIs. Fore more details, refer to the doc here: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
  # Note that this API is only available for Spark 3.2 and later, so make sure you submit to a spark cluster that has Spark 3.2 and later.
  psdf = df.to_pandas_on_spark()
  psdf['fare_amount_cents'] = psdf['fare_amount']*100
  # need to make sure converting the pandas-on-spark dataframe to Spark Dataframe.
  return psdf.to_spark()

batch_source = HdfsSource(name="nycTaxiBatchSource",
                        path="abfss://[email protected]/demo_data/green_tripdata_2020-04.csv",
                        preprocessing=add_new_dropoff_and_fare_amount_column,
                        event_timestamp_column="new_lpep_dropoff_datetime",
                        timestamp_format="yyyy-MM-dd HH:mm:ss")

Synapse notebook support

Currently Synapse notebook is not supported since current assumption is that users run feathr init in the shell environment. However, Synapse notebook doesn't support doing so.

To make Synapse notebook support, I suggest adding an API in the FeathrClient class which can initialize a feathr workspace from python API.

Implementation wise, it will be like adding some API in this file (https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py) referencing some of the CLI functions (https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/cli.py)

FeathrClient.register_features() only registers features from first auto_generated_*.conf in feature_conf folder

When calling FeathrClient.register_features(), multiple auto_generated_*.conf files are created - however in _feature_registry.py, only the first config file is used to identify features. This is true for generation paths and feature join paths as well.

When using build_features(), the method _save_to_feature_config(), or _save_to_feature_config_from_context() is called, both of which create 3 config files within one folder:
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L638-L651

In _feature_registry.register_features, _read_config_from_workspace is only called once:
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L712-L721

However, in that method, only the first config file found is used.
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L429-L439

Then, only the anchors, sources, and derivations for that one config file are referenced:
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L463-L477

Implement Basic RBAC Roles (Admin, Producer, Consumer)

General Design

  • Leverage AAD in login logics
  • A storage is needed for user/role mapping
  • RBAC info will be retrieced and stored in "user.profile"
  • RBAC will be a Python API protection extentions. API would act differently based on user profile.
  • Consider a API bahavior as an "Action", e.g. list_features(filter:tags, owners, etc). Action List can be stored as "Permission"
  • RBAC specific APIs
    • admin_management
      • add_user: add user info to user map with given / default role
      • add_role: add a new kind of role with role definition
      • assign_user: assign user to certain roles
      • add_permission: a set of Actions
      • grant: grant permission to a role
    • access_management
      • get_user_profile: user login and retrieve user profile info
      • get_access: list all the permissions allowed for user
      • check_access: check user access to a certain action / permission (action list).
      • is_user_in_role: if a user is in role
      • get_role: get all roles available in current project
    • review_management
      • get_access_log: return all the access changes for audit (nice to have for MVP)
  • Above APIs should supports add, delete, update, list and other necessary vairance.

Azure Resources

Microsoft Azure includes standard and built-in RBAC, which is an authorization system built on Azure Resource Manager that provides detailed access management to Azure resources.

Role Definition

Multi-layer roles is not revealed in below definition

{
    "Roles":[
        {
            "id":0,
            "name": "Admin",
            "description" : "",
            "permissions": ["admin_management", "access_management", "review_management", "registry_apis","spark_apis"]
            "AssignableScopes":["project","anchor"]

        },
        {
            "id":1,
            "name":"Producer",
            "description": "",
            "permissions": ["admin_management.add_permission/grant","access_management","registry_apis","spark_apis"],
            "AssignableScopes":["project","anchor"]
        },
        {
            "id":2,
            "name":"Consumer",
            "description":"",
            "permissions":["access_management","registry_apis",""spark_apis"],
            "AssignableScopes":["project","anchor"]
        },
        {
            "id":3,
            "name":"Monitoring",
            "description":"",
            "permissions":["review_management", "log_apis","spark_apis"],
            "AssignableScopes":["project"]
        }
    ]
}

Dynamic versioning for feathr

Right now feathr version is hardcoded. There are plugins such as https://github.com/sbt/sbt-dynver or https://github.com/sbt/sbt-git that will let us automatically update the version.

Things to be aware of are any configurations that depend on our hardcoded versions, such as the feathr_config.yaml file, which references the 0.1.0 jar.

Potential options, is to auto generate this feathr_config.yaml as part of the assembly, such that it always references the latest version, or be able to have a dynamic version reference in the config.

Dynamic versioning would be beneficial for publishing to maven, as maven requires unique versions for releases

A Simple way to validate Data sources connectivity in Spark Job

When more and more data sources are involved, current E2E test is not scalable for pure connectivity test.
A separate spark job to simply try to load all registered data sources into data frame can help to

  • Enhance Test and Engineer Experience
    • Scalable data format supportive check
    • Validate data connectivity for customized data source
  • Further Usage in Feature Store UI / Data Platform*
    • Data Source Healthy Telemetry (Daily job)
    • Data Visualization Sampling/Distribution

To achieve this, we may need to have:

  • A list of sample data source which covers every support data formats (can be a config file synced to sample feature registry) with
    • data path
    • credential pointer (to a centralized credential storage, e.g. Key Vault)
    • rules* (customized rules to make sure data source meets requirements)
  • Credential Storage
    • key or token
    • access type*: e.g. admin / read / write...
    • access level*: e.g. single file / folder; table / storage...
  • DataSourceCheckJob(ss, Seq[dataSourceDef])

"*" : means nice to have & low priority

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.