feathr-ai / feathr Goto Github PK

Feathr – A scalable, unified data and AI engineering platform for enterprise

Home Page: https://join.slack.com/t/feathrai/shared_invite/zt-1ffva5u6v-voq0Us7bbKAw873cEzHOSg

License: Apache License 2.0

Scala 47.32% Makefile 0.01% Python 19.26% Batchfile 0.02% Java 30.22% HTML 0.01% TypeScript 2.80% Less 0.07% CSS 0.02% Dockerfile 0.09% Shell 0.12% JavaScript 0.05%

feature-engineering feature-store artificial-intelligence mlops data-engineering data-quality machine-learning apache-spark azure data-science

feathr's Introduction

A scalable, unified data and AI engineering platform for enterprise

Important Links: Slack & Discussions. Docs.

What is Feathr?

Feathr is a data and AI engineering platform that is widely used in production at LinkedIn for many years and was open sourced in 2022. It is currently a project under LF AI & Data Foundation.

Read our announcement on Open Sourcing Feathr and Feathr on Azure, as well as the announcement from LF AI & Data Foundation.

Feathr lets you:

Define data and feature transformations based on raw data sources (batch and streaming) using Pythonic APIs.
Register transformations by names and get transformed data(features) for various use cases including AI modeling, compliance, go-to-market and more.
Share transformations and data(features) across team and company.

Feathr is particularly useful in AI modeling where it automatically computes your feature transformations and joins them to your training data, using point-in-time-correct semantics to avoid data leakage, and supports materializing and deploying your features for use online in production.

🌟 Feathr Highlights

Native cloud integration with simplified and scalable architecture.
Battle tested in production for more than 6 years: LinkedIn has been using Feathr in production for over 6 years and backed by a dedicated team.
Scalable with built-in optimizations: Feathr can process billions of rows and PB scale data with built-in optimizations such as bloom filters and salted joins.
Rich transformation APIs including time-based aggregations, sliding window joins, look-up features, all with point-in-time correctness for AI.
Pythonic APIs and highly customizable user-defined functions (UDFs) with native PySpark and Spark SQL support to lower the learning curve for all data scientists.
Unified data transformation API works in offline batch, streaming, and online environments.
Feathr’s built-in registry makes named transformations and data/feature reuse a breeze.

🏃 Getting Started with Feathr - Feathr Sandbox

The easiest way to try out Feathr is to use the Feathr Sandbox which is a self-contained container with most of Feathr's capabilities and you should be productive in 5 minutes. To use it, simply run this command:

# 80: Feathr UI, 8888: Jupyter, 7080: Interpret
docker run -it --rm -p 8888:8888 -p 8081:80 -p 7080:7080 -e GRANT_SUDO=yes feathrfeaturestore/feathr-sandbox:releases-v1.0.0

And you can view Feathr quickstart jupyter notebook:

http://localhost:8888/lab/workspaces/auto-w/tree/local_quickstart_notebook.ipynb

After running the notebook, all the features will be registered in the UI, and you can visit the Feathr UI at:

http://localhost:8081

🛠️ Install Feathr Client Locally

If you want to install Feathr client in a python environment, use this:

pip install feathr

Or use the latest code from GitHub:

pip install git+https://github.com/feathr-ai/feathr.git#subdirectory=feathr_project

☁️ Running Feathr on Cloud for Production

Feathr has native integrations with Databricks and Azure Synapse:

Follow the Feathr ARM deployment guide to run Feathr on Azure. This allows you to quickly get started with automated deployment using Azure Resource Manager template.

If you want to set up everything manually, you can checkout the Feathr CLI deployment guide to run Feathr on Azure. This allows you to understand what is going on and set up one resource at a time.

Please read the Quick Start Guide for Feathr on Databricks to run Feathr with Databricks.
Please read the Quick Start Guide for Feathr on Azure Synapse to run Feathr with Azure Synapse.

📓 Documentation

For more details on Feathr, read our documentation.
For Python API references, read the Python API Reference.
For technical talks on Feathr, see the slides here and here. The recording is here.

🧪 Samples

Name	Description	Platform
NYC Taxi Demo	Quickstart notebook that showcases how to define, materialize, and register features with NYC taxi-fare prediction sample data.	Azure Synapse, Databricks, Local Spark
Databricks Quickstart NYC Taxi Demo	Quickstart Databricks notebook with NYC taxi-fare prediction sample data.	Databricks
Feature Embedding	Feathr UDF example showing how to define and use feature embedding with a pre-trained Transformer model and hotel review sample data.	Databricks
Fraud Detection Demo	An example to demonstrate Feature Store using multiple data sources such as user account and transaction data.	Azure Synapse, Databricks, Local Spark
Product Recommendation Demo	Feathr Feature Store example notebook with a product recommendation scenario	Azure Synapse, Databricks, Local Spark

🔡 Feathr Highlighted Capabilities

Please read Feathr Full Capabilities for more examples. Below are a few selected ones:

Feathr UI

Feathr provides an intuitive UI so you can search and explore all the available features and their corresponding lineages.

You can use Feathr UI to search features, identify data sources, track feature lineages and manage access controls. Check out the latest live demo here to see what Feathr UI can do for you. Use one of following accounts when you are prompted to login:

A work or school organization account, includes Office 365 subscribers.
Microsoft personal account, this means an account can access to Skype, Outlook.com, OneDrive, and Xbox LIVE.

For more information on the Feathr UI and the registry behind it, please refer to Feathr Feature Registry

Rich UDF Support

Feathr has highly customizable UDFs with native PySpark and Spark SQL integration to lower learning curve for data scientists:

def add_new_dropoff_and_fare_amount_column(df: DataFrame):
    df = df.withColumn("f_day_of_week", dayofweek("lpep_dropoff_datetime"))
    df = df.withColumn("fare_amount_cents", df.fare_amount.cast('double') * 100)
    return df

batch_source = HdfsSource(name="nycTaxiBatchSource",
                        path="abfss://[email protected]/demo_data/green_tripdata_2020-04.csv",
                        preprocessing=add_new_dropoff_and_fare_amount_column,
                        event_timestamp_column="new_lpep_dropoff_datetime",
                        timestamp_format="yyyy-MM-dd HH:mm:ss")

Defining Window Aggregation Features with Point-in-time correctness

agg_features = [Feature(name="f_location_avg_fare",
                        key=location_id,                          # Query/join key of the feature(group)
                        feature_type=FLOAT,
                        transform=WindowAggTransformation(        # Window Aggregation transformation
                            agg_expr="cast_float(fare_amount)",
                            agg_func="AVG",                       # Apply average aggregation over the window
                            window="90d")),                       # Over a 90-day window
                ]

agg_anchor = FeatureAnchor(name="aggregationFeatures",
                           source=batch_source,
                           features=agg_features)

Define Features on Top of Other Features - Derived Features

# Compute a new feature(a.k.a. derived feature) on top of an existing feature
derived_feature = DerivedFeature(name="f_trip_time_distance",
                                 feature_type=FLOAT,
                                 key=trip_key,
                                 input_features=[f_trip_distance, f_trip_time_duration],
                                 transform="f_trip_distance * f_trip_time_duration")

# Another example to compute embedding similarity
user_embedding = Feature(name="user_embedding", feature_type=DENSE_VECTOR, key=user_key)
item_embedding = Feature(name="item_embedding", feature_type=DENSE_VECTOR, key=item_key)

user_item_similarity = DerivedFeature(name="user_item_similarity",
                                      feature_type=FLOAT,
                                      key=[user_key, item_key],
                                      input_features=[user_embedding, item_embedding],
                                      transform="cosine_similarity(user_embedding, item_embedding)")

Define Streaming Features

Read the Streaming Source Ingestion Guide for more details.

Point in Time Joins

Read Point-in-time Correctness and Point-in-time Join in Feathr for more details.

Running Feathr Examples

Follow the quick start Jupyter Notebook to try it out. There is also a companion quick start guide containing a bit more explanation on the notebook.

🗣️ Tech Talks on Feathr

⚙️ Cloud Integrations and Architecture

Feathr component	Cloud Integrations
Offline store – Object Store	Azure Blob Storage, Azure ADLS Gen2, AWS S3
Offline store – SQL	Azure SQL DB, Azure Synapse Dedicated SQL Pools, Azure SQL in VM, Snowflake
Streaming Source	Kafka, EventHub
Online store	Redis, Azure Cosmos DB
Feature Registry and Governance	Azure Purview, ANSI SQL such as Azure SQL Server
Compute Engine	Azure Synapse Spark Pools, Databricks
Machine Learning Platform	Azure Machine Learning, Jupyter Notebook, Databricks Notebook
File Format	Parquet, ORC, Avro, JSON, Delta Lake, CSV
Credentials	Azure Key Vault

🚀 Roadmap

More Feathr online client libraries such as Java
Support feature versioning
Support feature monitoring

👨‍👨‍👦‍👦 Community Guidelines

Build for the community and build by the community. Check out Community Guidelines.

📢 Slack Channel

Join our Slack channel for questions and discussions (or click the invitation link).

feathr's People

Contributors

Stargazers

Watchers

Forkers

jaymo001 daxsorbito abhidotravi windoze hangfei lebenhl amberz yuqing-cat xiaofengzhu tarockey kayvanshah1 picenipedro davitbzh retinadb moguijoe drapre micseb windb3ll dongbumlee masemxiao ibabekov xupercoin maigone dee-pac nicolesherwood awekling adilakhter joskid hay-man chinmaytredence louderthanthunderx1 abhinavm24 yanghua nssalian blee1234 data-insights-mango b-xiang hilerchyn dut3062796s kkxiaotikk dhee2211 ritesh-modi djsaunders1997 gavinljj duzhanyuan corner4world ahlag jinyuanlu venkat22 adityakaul lhayhurst bobwooster moupriyaroy25 jdk5115 hogaku bigrlab iemejia sashaov t-curiekim zhangjunqiang sshyran reloadbrain xiaoyongzhu murilo jepsonwong donegjookim enya-yx dhangerkapil sangamswadik agoda-com van76007 digitty-forks esadler-hbo rpatil524 stanleyjacob williamchw fudp mohitreddy1996 jainr hyingyang-linkedin farmingtong billionerd natdanailee-acn codeboyyong hisstar mistyr0se ntt720 nicbair lycokie mpiyush29 spicyguml wensiyuansix yihuiguo herpacker inmobi fendoe monsterdove vamoko cerviny cocowhale

feathr's Issues

Client side JDBC data source config and test

A way to local testing JDBC sources

Have a separate source for SQL based sources (rather than using HDFSSource)

Enable Azure Key Vault to hold secret

UDF E2E tests failed

After merging #139, UDF E2E tests failed as the PR changed to use pyspark.cloudpickle instead of inspect.getsource().

The root cause is that the cloudpickle by default will not pickle modules by value, and the PyTest doesn't run test code in __main__ as normal python execution flow, that means all UDFs defined in the test code belongs to a different module than __main__, this leads to a mysterious ModuleNotFound exception when pickled UDFs were used in the remote PySpark job, as it cannot find the module where the pickled function belongs to, because the test module doesn't exist on the remote side.

Since version 2.0, cloudpickle resolves this issue by involving a new register_pickle_by_value(module) function, but unfortunately this function doesn't exist in the cloudpickle bundled with PySpark, which stays at 1.6 for years.

This issue actually will not happen in typical usage, e.g. in a Python REPL env or Jupyter notebook, as they all run current code in __main__ module, it only affects PyTest.

One possible solution is to use the new cloudpickle but it introduces new dependencies which must be installed on the remote Spark cluster manually.

The other solution is to let user package their UDF into a separated Python file and submit this file along with the Spark job, but this is not compatible with current test code thus needs a major restructuring.

The workaround is pretty straightforward, changing the module name to __main__ at the very beginning of the E2E test script, it looks like a hack but actually does the job pretty well.

Consider using Maven based spark job submission

Since Feathr is on Maven now (https://search.maven.org/artifact/com.linkedin.feathr/feathr_2.12), we should consider using Maven as the source to submit Spark jobs,rather than using the public wasb path, since it's a bit hard to maintain and doesn't distribute well (like don't have mirrors etc., and will be slow).

Databricks supports maven based library: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/libraries#--library

Synapse can be achieved by using this config: spark.jars.packages. See (https://spark.apache.org/docs/3.2.1/configuration.html) for more details.

Get Feature From Registry

Target scenario Get features from registry (fetched from purview sdk for now), then perform transformations and enrichment, transform the purview object into list of FeatureAnchor and DerivedFeature.
With the transformation completed, return values could be fed into feathr client for further operations (like build features)

Should feathr_project/setup.py have `pytest`?

I think the dev guide (docs/dev_guide/python_dev_guide.md) assumes that pytest is installed.

## Integration Test

Run `pytest` in this folder to kick off the integration test. The integration test will test the creation of feature dataset, the materialization to online storage, and retrieve from online storage, as well as the CLI. It usually takes 5 ~ 10 minutes. It needs certain keys for cloud resources.

results in pytest not found.

Adding pytest in the install_requires of feathr_project/setup.py fixed this for me. I can submit a PR if you think pytest should be added to the setup.

Cheers

snowflake test Redis write conflict

Can be solved by adding timestamp suffix to every Redis table.

Allow tests for forked repository

Running tests (with secrets) for forked repo is disabled by GitHub, but there are ways to do it (like this: https://github.com/imjohnbo/ok-to-test). Consider adding that to the repository to allow for more contributions

Proof-read the developer guide for new devs

If you are a new developer and you just started developing things in Feathr, try to read our developer guide(https://github.com/linkedin/feathr/tree/main/docs/dev_guide).

If you found issues and things need to improve:

Raise an issue with us
Help us fix the issue.

It's a dev guide version of #160.

Investigate ways to relax restrictions to call PySpark UDFs

Currently the PySpark UDFs don't support calling functions outside of the UDF function body. @windoze proposed a ways that can solve this issue.

This issue is created for folks to have a discussion on those proposals.

Use Databricks CLI as Python SDK to submit jobs and load files

Currently we are using REST APIs to submit databricks jobs. This is definitely less than ideal and hard to maintain (as there are a lot of hard coded paths etc.). Also it cause some troubles when downloading the files.

Although there's no databricks python SDK, we can use databricks-cli as a python SDK since it's written in Python.
https://github.com/databricks/databricks-cli

For example, copying file can be achieved using this API:
https://github.com/databricks/databricks-cli/blob/master/databricks_cli/dbfs/cli.py#L118

This issue is opened to change the implementation to use databricks-cli as a python SDK to submit jobs, get files, etc. to make the code more elegant.

credit goes to @windoze for this idea

FeathrClient.register_features() does not raise error when build_features() has not been called.

If I attempt to register features prior to building them in the client, instead of raising the expected RuntimeError [RuntimeError("Please call FeathrClient.build_features() first in order to register features")](), it fails after that with:

2022-03-23 [1](vscode-notebook-cell:/c%3A/Users/tarockey/OneDrive%20-%20Microsoft/Documents/source/projects/feathr-test/feathr_user_workspace/new_nyc_notebook.ipynb#ch0000024?line=0)0:56:08.229 | INFO     | feathr._feature_registry:_register_feathr_feature_types:252 - Feature Type System Initialized.
2022-03-23 10:56:08.231 | INFO     | feathr._feature_registry:_read_config_from_workspace:431 - Reading feature configuration from []
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
c:\Users\tarockey\OneDrive - Microsoft\Documents\source\projects\feathr-test\feathr_user_workspace\new_nyc_notebook.ipynb Cell 26' in <cell line: 1>()
----> 1 client.register_features()

File ~\.conda\envs\feathr\lib\site-packages\feathr\client.py:178, in FeathrClient.register_features(self)
    [176](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/client.py?line=175) else:
    [177](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/client.py?line=176)     RuntimeError("Please call FeathrClient.build_features() first in order to register features")
--> [178](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/client.py?line=177) self.registry.register_features(self.local_workspace_dir)

File ~\.conda\envs\feathr\lib\site-packages\feathr\_feature_registry.py:721, in _FeatureRegistry.register_features(self, workspace_path)
    [719](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=718) # register feature types each time when we register features.
    [720](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=719) self._register_feathr_feature_types()
--> [721](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=720) self._read_config_from_workspace(workspace_path)
    [722](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=721) # Upload all entities
    [723](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=722) # need to be all in one batch to be uploaded, otherwise the GUID reference won't work
    [724](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=723) results = self.purview_client.upload_entities(
    [725](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=724)     batch=self.entity_batch_queue)

File ~\.conda\envs\feathr\lib\site-packages\feathr\_feature_registry.py:445, in _FeatureRegistry._read_config_from_workspace(self, workspace_path)
    [441](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=440)     feature_join_paths = glob.glob(os.path.join(
    [442](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=441)         workspace_path, "feature_join_conf", '*.conf'))
    [443](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=442)     logger.info("Reading feature join configuration from {}",
    [444](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=443)             feature_join_paths)
--> [445](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=444) if len(feature_join_paths) > 0:
    [446](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=445)     feature_join_path = feature_join_paths[0]
    [447](file:///c%3A/Users/tarockey/.conda/envs/feathr/lib/site-packages/feathr/_feature_registry.py?line=446)     self.feathr_feature_join = ConfigFactory.parse_file(feature_join_path)

UnboundLocalError: local variable 'feature_join_paths' referenced before assignment

The RuntimeError is created, but not raised, so it does not stop the code execution.

If build_features() must be called before register_features(), the call for register_features() to save_to_feature_config_from_context() is redundant - as you cannot pass the anchor_list or derived_feature_list to the register_features() method.

The intent seems to be to enable the SDK to dynamically check for context, as well as for config files within the working directory. Maybe adding a flag for "from_context" to the register_features() method would better allow this?

Feature Monitoring and Feature Drift Detection

Mac developer install: a howto

Hello, I was able to install the project on a mac with a few tweaks to the python_dev_guide.md.

After making the virtualenv and running python3 -m pip install -e ., I got this error mesage:

 fatal error: 'librdkafka/rdkafka.h' file not found
    #include <librdkafka/rdkafka.h>
             ^~~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command '/usr/bin/clang' failed with exit code 1

Here's the fix:

install brew
install librdkafka by running the command brew install librdkafka
run brew info librdkafka, take note of the library install path (for example, "/opt/homebrew/Cellar/librdkafka/1.8.2/include"),
run export C_INCLUDE_PATH=$LIBRDKAFKA_INCLUDE_PATH, where $LIBRDKAFKA_INCLUDE_PATH is the include path found in step 2.
Run python3 -m pip install -e .

Happy to submit a PR with an update to the python_dev_guide.md if you think this info would be helpful to others; otherwise, I can close this ticket! Cheers

Add validation for feature name

Feature name should follow some naming conventions and rules otherwise it will break certain underlying engine, like spark or storage.

CSVDataLoader E2E Setup and Data Loader Integration

Current CSV Data Loader does not work, still use ad-hoc code.
loadDataFrame() Logics should refer to Data Loaders
Clarify Support Data Formats as well as default happy path.

Feathr Web UI MVP

Need a web ui as enhancement to cli for feature registry, search, lineage features, etc.

Following should be included for MVP prototype:

Setup UX framework: site global style, layout, menu, header, etc.
Account login
Data sources listing
Feature listing and search
Feature lineage flow chart
Feature/Project assignment

Add developer related docs

We should have developer focused docs for:

CI pipeline
integration with Azure resources
Feathr internals

User guide review for fresh eyes

We just added quite a few wikis, and tutorials. It's good to have some new users that have never used Feathr try it out.
Review the user guide by new users so we know if it's easy to understand and onboard(from easy to hard).

Read the documentations to see if it's easy enough to understand
If there is something not well-explained or missing, try to fix them or raise a issue with us.

Here are the documentations:
https://github.com/linkedin/feathr/blob/main/docs/concepts/feathr-concepts-for-beginners.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feathr-capabilities.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-definition.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-generation.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-join.md

https://github.com/linkedin/feathr/blob/main/docs/concepts/point-in-time-join.md

CI test does not work as expected

There are two issues:

After @jainr 's PR, the CI code in main branch will not run
For forked repository, the test will only run once and won't run after further changes (reported by @windoze )
more comments below by other folks

Init the FeathrClient or feathr init to a specific location provided by the user

Some users need to init their workspace to a specific location.

Use redux for state management in UI

The ask in this issue is enable state management in ui code.

Possible technical solution: redux, mobx

Add e2e test for derived features

Add e2e tests for derived featuers.

Requires access to testing cluster.

Supporting token based authentication for offline stores

Add more tutorials & cases for UDFs

Currently Feathr supports flexible UDFs for PySpark/SparkSQL/pandas on Spark. Definitely need more help on samples. Some tested ones are below:

def add_new_dropoff_and_fare_amount_column(df: DataFrame):
    df = df.withColumn("f_day_of_week", dayofweek("lpep_dropoff_datetime"))
    df = df.withColumn("fare_amount_cents", df.fare_amount.cast('double') * 100)
    return df


from pyspark.sql import SparkSession, DataFrame
def feathr_udf_filter_location_id(df: DataFrame) -> DataFrame:
  # if using Spark SQL, need to declare this default spark session, and create a temp view so that you can run Spark SQL on it.
  global spark
  df.createOrReplaceTempView("feathr_temp_table_feathr_udf_day_calc")
  sqlDF = spark.sql(
  """
  SELECT *
  FROM feathr_temp_table_feathr_udf_day_calc
  WHERE DOLocationID!= 100
  """
  )
  return sqlDF

def feathr_udf_pandas_spark(df: DataFrame) -> DataFrame:
  # using pandas on spark APIs. Fore more details, refer to the doc here: https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
  # Note that this API is only available for Spark 3.2 and later, so make sure you submit to a spark cluster that has Spark 3.2 and later.
  psdf = df.to_pandas_on_spark()
  psdf['fare_amount_cents'] = psdf['fare_amount']*100
  # need to make sure converting the pandas-on-spark dataframe to Spark Dataframe.
  return psdf.to_spark()

batch_source = HdfsSource(name="nycTaxiBatchSource",
                        path="abfss://[email protected]/demo_data/green_tripdata_2020-04.csv",
                        preprocessing=add_new_dropoff_and_fare_amount_column,
                        event_timestamp_column="new_lpep_dropoff_datetime",
                        timestamp_format="yyyy-MM-dd HH:mm:ss")

Submitting Feathr job to databricks ML 10.2 cluster might get NullPointer exception

repro step:

use a databricks ML 10.2 cluster (or other "ML" cluster)
run the databricks quick start notebook inside the cluster
the get_offline_feature() job will fail with a NULLPointerException,

Weird thing is -if the job is submitted to the current cluster; however submitting to a new cluster will not see this error.

return spark job logs if the job is failed

Support Cosmos DB as online store

https://github.com/Azure/azure-cosmosdb-spark with this connector

Feature list and lineage flow pages load slow

Currently web ui pages take a long time to load.

Update Feathr Expression Language to reflect the latest change

https://github.com/linkedin/feathr/blob/main/docs/how-to-guides/expression-language.md#L0-L1

supporting common parameters for csv loader

https://spark.apache.org/docs/3.2.0/sql-data-sources-csv.html Spark supports many kinds of csv options and those options should be configurable in Feathr API

Synapse notebook support

Currently Synapse notebook is not supported since current assumption is that users run feathr init in the shell environment. However, Synapse notebook doesn't support doing so.

To make Synapse notebook support, I suggest adding an API in the FeathrClient class which can initialize a feathr workspace from python API.

Implementation wise, it will be like adding some API in this file (https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py) referencing some of the CLI functions (https://github.com/linkedin/feathr/blob/main/feathr_project/feathrcli/cli.py)

FeathrClient.register_features() only registers features from first auto_generated_*.conf in feature_conf folder

When calling FeathrClient.register_features(), multiple auto_generated_*.conf files are created - however in _feature_registry.py, only the first config file is used to identify features. This is true for generation paths and feature join paths as well.

When using build_features(), the method _save_to_feature_config(), or _save_to_feature_config_from_context() is called, both of which create 3 config files within one folder:
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L638-L651

In _feature_registry.register_features, _read_config_from_workspace is only called once:
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L712-L721

However, in that method, only the first config file found is used.
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L429-L439

Then, only the anchors, sources, and derivations for that one config file are referenced:
https://github.com/linkedin/feathr/blob/73ae7621101ee1ee7e2cee60eb55cab925e89a18/feathr_project/feathr/_feature_registry.py#L463-L477

Implement Basic RBAC Roles (Admin, Producer, Consumer)

General Design

Leverage AAD in login logics
A storage is needed for user/role mapping
RBAC info will be retrieced and stored in "user.profile"
RBAC will be a Python API protection extentions. API would act differently based on user profile.
Consider a API bahavior as an "Action", e.g. list_features(filter:tags, owners, etc). Action List can be stored as "Permission"
RBAC specific APIs
- admin_management
  - add_user: add user info to user map with given / default role
  - add_role: add a new kind of role with role definition
  - assign_user: assign user to certain roles
  - add_permission: a set of Actions
  - grant: grant permission to a role
- access_management
  - get_user_profile: user login and retrieve user profile info
  - get_access: list all the permissions allowed for user
  - check_access: check user access to a certain action / permission (action list).
  - is_user_in_role: if a user is in role
  - get_role: get all roles available in current project
- review_management
  - get_access_log: return all the access changes for audit (nice to have for MVP)
Above APIs should supports add, delete, update, list and other necessary vairance.

Azure Resources

Microsoft Azure includes standard and built-in RBAC, which is an authorization system built on Azure Resource Manager that provides detailed access management to Azure resources.

Role Definition

Multi-layer roles is not revealed in below definition

{
    "Roles":[
        {
            "id":0,
            "name": "Admin",
            "description" : "",
            "permissions": ["admin_management", "access_management", "review_management", "registry_apis","spark_apis"]
            "AssignableScopes":["project","anchor"]

        },
        {
            "id":1,
            "name":"Producer",
            "description": "",
            "permissions": ["admin_management.add_permission/grant","access_management","registry_apis","spark_apis"],
            "AssignableScopes":["project","anchor"]
        },
        {
            "id":2,
            "name":"Consumer",
            "description":"",
            "permissions":["access_management","registry_apis",""spark_apis"],
            "AssignableScopes":["project","anchor"]
        },
        {
            "id":3,
            "name":"Monitoring",
            "description":"",
            "permissions":["review_management", "log_apis","spark_apis"],
            "AssignableScopes":["project"]
        }
    ]
}

Fix typo in previous key vault PR

Pretty-print the features produced by buildFeatures and submit job

Provide a pretty-print flag in buildFeautre(https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py#L213). If it's true, then print the features that are constructed by buildFeature in pretty-print form.

Same for get_offline_feature: https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py#L392 and https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py#L514.

Use spark conf to pass storage configuration to jobs instead of passing spark job arguments

Dynamic versioning for feathr

Right now feathr version is hardcoded. There are plugins such as https://github.com/sbt/sbt-dynver or https://github.com/sbt/sbt-git that will let us automatically update the version.

Things to be aware of are any configurations that depend on our hardcoded versions, such as the feathr_config.yaml file, which references the 0.1.0 jar.

Potential options, is to auto generate this feathr_config.yaml as part of the assembly, such that it always references the latest version, or be able to have a dynamic version reference in the config.

Dynamic versioning would be beneficial for publishing to maven, as maven requires unique versions for releases

Add Aerospike for Online Store

Add a centralized documentation for all the environment variables that Feathr uses/expects

Support Snowflake Data source

Add API Layer for Feathr

Description:
Need to build and publish UI-based user experience, consisting of a Restful backend and a frontend UI.
Backend design doc:
https://microsoft.sharepoint.com/:w:/t/AICustomerEngineeringTeam/ESCFZIcBrpxBmkITXDwT-7EB9dB_V9F-eStOqcFPh1ZfOg

Add delete feature API in Redis

The FeathrClient provides API to access features but we don't provide APIs to delete them. Add a delete API to the client so users can delete the feature data at their will.

Redis-realted code is here: https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py#L353

Add data lineage in addition to feature lineage

Allow `OUTPUT_PARALLELISM` and `OVERWRITE_MODE` in join output

Seems it's not honored per this line:
https://github.com/linkedin/feathr/blob/main/src/main/scala/com/linkedin/feathr/offline/generation/SparkIOUtils.scala#L21

ImportError: cannot import name 'ParamSpec' from 'typing_extensions'

seems caused by a recently added param spec reference:
related issue: Azure/azure-sdk-for-python#23697

ParamSpec is added in 3.10:
https://docs.python.org/3/library/typing.html#typing.ParamSpec

and seems caused by this PR in azure-core:
Azure/azure-sdk-for-python#22891

Add E2E test for notebooks as well

A Simple way to validate Data sources connectivity in Spark Job

When more and more data sources are involved, current E2E test is not scalable for pure connectivity test.
A separate spark job to simply try to load all registered data sources into data frame can help to

Enhance Test and Engineer Experience
- Scalable data format supportive check
- Validate data connectivity for customized data source
Further Usage in Feature Store UI / Data Platform*
- Data Source Healthy Telemetry (Daily job)
- Data Visualization Sampling/Distribution

To achieve this, we may need to have:

A list of sample data source which covers every support data formats (can be a config file synced to sample feature registry) with
- data path
- credential pointer (to a centralized credential storage, e.g. Key Vault)
- rules* (customized rules to make sure data source meets requirements)
Credential Storage
- key or token
- access type*: e.g. admin / read / write...
- access level*: e.g. single file / folder; table / storage...
DataSourceCheckJob(ss, Seq[dataSourceDef])

"*" : means nice to have & low priority