amundsen-io / amundsen Goto Github PK

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Home Page: https://www.amundsen.io/amundsen/

License: Apache License 2.0

Shell 0.02% Mustache 0.03% HTML 0.12% Makefile 0.08% Python 67.42% TypeScript 30.03% JavaScript 0.08% SCSS 2.14% Scala 0.08%

amundsen metadata data-catalog data-discovery linuxfoundation

amundsen's Introduction

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library and one common library.

amundsenfrontendlibrary: Frontend service which is a Flask application with a React frontend.
amundsensearchlibrary: Search service, which leverages Elasticsearch for search capabilities, is used to power frontend metadata searching.
amundsenmetadatalibrary: Metadata service, which leverages Neo4j or Apache Atlas as the persistent layer, to provide various metadata.
amundsendatabuilder: Data ingestion library for building metadata graph and search index. Users could either load the data with a python script with the library or with an Airflow DAG importing the library.
amundsencommon: Amundsen Common library holds common codes among microservices in Amundsen.
amundsengremlin: Amundsen Gremlin library holds code used for converting model objects into vertices and edges in gremlin. It's used for loading data into an AWS Neptune backend.
amundsenrds: Amundsenrds contains ORM models to support relational database as metadata backend store in Amundsen. The schema in ORM models follows the logic of databuilder models. Amundsenrds will be used in databuilder and metadatalibrary for metadata storage and retrieval with relational databases.

Documentation

Community Roadmap

We want your input about what is important, for that, add your votes using the 👍 reaction:

Requirements

Python >= 3.8
Node v12

User Interface

Please note that the mock images only served as demonstration purpose.

Landing Page: The landing page for Amundsen including 1. search bars; 2. popular used tables;
Search Preview: See inline search results as you type
Table Detail Page: Visualization of a Hive / Redshift table
Column detail: Visualization of columns of a Hive / Redshift table which includes an optional stats display
Data Preview Page: Visualization of table data preview which could integrate with Apache Superset or other Data Visualization Tools.

Getting Started and Installation

Please visit the Amundsen installation documentation for a quick start to bootstrap a default version of Amundsen with dummy data.

Supported Entities

Tables (from Databases)
Dashboards
ML Features
People (from HR systems)

Supported Integrations

Table Connectors

Amundsen can also connect to any database that provides dbapi or sql_alchemy interface (which most DBs provide).

Table Column Statistics

Pandas Profiling

Dashboard Connectors

ETL Orchestration

Apache Airflow

Get Involved in the Community

Want help or want to help? Use the button in our header to join our slack channel.

Contributions are also more than welcome! As explained in CONTRIBUTING.md there are many ways to contribute, it does not all have to be code with new features and bug fixes, also documentation, like FAQ entries, bug reports, blog posts sharing experiences etc. all help move Amundsen forward. If you find a security vulnerability, please follow this guide.

Architecture Overview

Please visit Architecture for Amundsen architecture overview.

Resources

Blog Posts and Interviews

Amundsen - Lyft's data discovery & metadata engine (April 2019)
Software Engineering Daily podcast on Amundsen (April 2019)
How Lyft Drives Data Discovery (July 2019)
Data Engineering podcast on Solving Data Discovery At Lyft (Aug 2019)
Open Sourcing Amundsen: A Data Discovery And Metadata Platform (Oct 2019)
Adding Data Quality into Amundsen with Programmatic Descriptions by Sam Shuster from Edmunds.com (May 2020)
Facilitating Data discovery with Apache Atlas and Amundsen by Mariusz Górski from ING (June 2020)
Using Amundsen to Support User Privacy via Metadata Collection at Square by Alyssa Ransbury from Square (July 14, 2020)
Amundsen Joins LF AI as New Incubation Project (Aug 11, 2020)
Amundsen: one year later (Oct 6, 2020)

Talks

Disrupting Data Discovery {slides, recording} (Strata SF, March 2019)
Amundsen: A Data Discovery Platform from Lyft {slides} (Data Council SF, April 2019)
Disrupting Data Discovery {slides} (Strata London, May 2019)
ING Data Analytics Platform (Amundsen is mentioned) {slides, recording } (Kubecon Barcelona, May 2019)
Disrupting Data Discovery {slides, recording} (Making Big Data Easy SF, May 2019)
Disrupting Data Discovery {slides, recording} (Neo4j Graph Tour Santa Monica, September 2019)
Disrupting Data Discovery {slides} (IDEAS SoCal AI & Data Science Conference, Oct 2019)
Data Discovery with Amundsen by Gerard Toonstra from Coolblue {slides} and {talk} (BigData Vilnius 2019)
Towards Enterprise Grade Data Discovery and Data Lineage with Apache Atlas and Amundsen by Verdan Mahmood and Marek Wiewiorka from ING {slides, talk} (Big Data Technology Warsaw Summit 2020)
Airflow @ Lyft (which covers how we integrate Airflow and Amundsen) by Tao Feng {slides and website} (Airflow Summit 2020)
Data DAGs with lineage for fun and for profit by Bolke de Bruin {website} (Airflow Summit 2020)
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform by Tao Feng (Data+AI summit Europe 2020)
Data Discovery at Databricks with Amundsen by Tao Feng and Tianru Zhou (Data+AI summit NA 2021)

Community meetings

Community meetings are held on the first Thursday of every month at 9 AM Pacific, Noon Eastern, 6 PM Central European Time. Link to join

Upcoming meetings & notes

You can the exact date for the next meeting and the agenda a few weeks before the meeting in this doc.

Notes from all past meetings are available here.

Who uses Amundsen?

Here is the list of organizations that are officially using Amundsen today. If your organization uses Amundsen, please file a PR and update this list.

Contributors ✨

Thanks goes to these incredible people:

amundsen's People

Contributors

Stargazers

Watchers

Forkers

jornh feng-tao benzei youngyjd liu-zhengyi jamescotrotsios whazor bchalamayya philippemizrahi tristankrass satishsverma arthurli1126 qwang6 alinxie liangz1 hanyucui annaewolf jlmartin7282 bnornholm markgrover vitaly-am mandy-chessell swatro mikhail-ivanov gwenshap saidbouras white54503 anurag870 rahulraina7 samshuster taihd ieaves mborukhava shenghuy juliands scoyne2 ceohockey60 alejobs djcoder100 rheehot gjeevanm ajaymuppuri gilbertobotaro mbrukman shoman2 chissycode sd37 kriti-sc zkan dataength tarasevicius ldkhanh manikant92 paulmwatson stckcrsh anmolaj duckofdeath87 lukelowery eachsaj qrfxiaoge dwarszawski elisabetao verdan kvasagiri douglaszickuhr mrtharry jakemongaya giomerlin lonely7345 yennanliu bryfox zhengxle nikshepsvn konciergemd xqandpan karthicktv wzamites ismailhammounou paschalisdim wildcardops dmateusp chunmk oyilmaz friendtocephalopods gschmutz rsyi databill86 gabrielucelli javamonkey79 rezaprimasatya jonathan-vieira axelcouty laranea tmaia87 dewwi jaynoel yaxche-io vrejany bharathraghunathan arilmav

amundsen's Issues

Add sample presto view data for quick start

@youngyjd provides a presto view extractor in https://github.com/lyft/amundsendatabuilder/pull/25/files. We should provide sample presto view data for quick start documentation.

Support for showing lineage of table across ETL's in a data warehouse

Hey,
Wanted to understand if there is a way to push table lineage related metadata to amundsen.
Couldn't find any related documentation. If someone can point towards it, that would be really helpful

Implement git repo extractor

https://github.com/gitpython-developers/GitPython

Feature request - knowledge posts/repository

As a 'data discovery' platform, I should aim to empower users across my organization to make better decisions - based on data. I should allow new users to onboard quicker to the organization and learn about the business, help users answer (most of) their own 'data' questions through past analysis, and reduce the barriers of access to analyze data on questions that have yet to be answered.

Once analysis has been done on a 'data' questions, the results are often saved or archived in the form of a wiki or knowledge post. For Airbnb + Lyft...Knowledge Repo + Stache Overflow.

https://github.com/airbnb/knowledge-repo
https://stackoverflow.com/teams

Although this is not a concrete implementation plan - the story should be the same = within amundsen, I should be able to search past analysis or knowledge posts (created and connected to users on the platform), ranked by popularity, has owners, followers, etc.

Should we add PUT/POST endpoints for tables?

We wanted to do this at Square so our databuilder instance doesn't have to connect directly to the neo4j database instance. It can just call some simple metadata library APIs that handle the database details.

We think this gives a better separation of concerns between the databuilder and the metadata library code.

I built this for the Square instance of amundsen, I'd be happy to upstream it if y'all are amenable.

Implement a config variable that will allow the owners to edit their datasets

For now, we are relying on UNEDITABLE_SCHEMAS, but what if want the owners to change the datasets they own.
For example, it could be: (I don't know atm, which would take precedence)

is_editable = True

if results['schema'] in app.config['UNEDITABLE_SCHEMAS']:
    is_editable = False

if app.config['ALLOW_ADMINS_TO_EDIT'] and current_user in results['owners']:
    is_editable = True

thoughts ?

Setup PyUp to automatically update the python dependencies

Free for open source projects, automatically creates Pull Request to update python dependencies.

https://pyup.io/

We are using this in python atlas client.
ref PRs: https://github.com/jpoullet2000/atlasclient/pulls?q=is%3Apr+is%3Aclosed+author%3Apyup-bot

Add instructions for BigQuery Extractor

Could we please add some setup instructions for BigQuery metadata extractor and usage extractor? Instructions could include IAM roles and permissions that are needed for extractor access and extract metadata and usage information. An example response from Bigquery for metadata and usage would be ideal.

The terms `schema` and `db` can have multiple meanings and confusing

The terminology used for Amundsen table URI is a bit confusing and people can interpret that in multiple ways. We should make them consistent as what is being used normally.

As per @feng-tao here: I think the schema term at Amundsen is same as database (e.g create database) while database is used to define the system(hive, redshift etc).

And Amundsen is using these terms to generate table URN {db}://{cluster}.{schema}/{table}

In my opinion both terms are overloaded somehow, schema means table columns and in Amundsen we are using as database <this thing>.table, while database means physical database, and also <this thing>.table. I think we need to have a 3rd term here.

I propose to use a different term for the system itself, and use db instead of schema.
Something like this: {system|type|db_type}://{cluster}.{db}/{table}

@feng-tao @jinhyukchang

Redshift spectrum metadata extraction is not supported by postgres extractor

Figured this out yesterday while working on some spectrum tables; the current query in amundsen-io/amundsendatabuilder#63 works for native redshift tables (which is what I tested with earlier), but does not work with spectrum tables (as spectrum metadata is stored outside sql standard INFORMATION_SCHEMA). svv_external_columns should provide access to the equivalent external tables metadata.

If/when I have some time I'll add a PR, or if anyone else wants to tackle in the meantime - have at it 😉.

Use entity IDs where possible instead of using the table URI

For most of the part, amundsen metadata relies on the table_uri, which is being passed from frontend, that is making it complex and extra processing in case of systems which supports IDs.
I am proposing to use the IDs of entities like Table, Column, and User etc to be used for the different kinds of operations. This information can be passed from metadata (when available, None otherwise), and simply be returned from frontend for each operation.
Example Use Case:

Get Column Description:
At the moment, the frontend is sending the table_uri and the column name. I need to parse the table_uri, find table details, find the column based on the column name, and then get the column detail entity, and return the description. If there would be ID available for the column, then we can simply use that ID to get the record, and return the description.

Same is the case for Put, and Table detail itself. If we'd have the ID, we won't need to parse the table_uri.

Just out of curiosity, how does Neo4j handles the IDs, does it support IDs ?

Document config through env vars

For this and metadata + search repos for much of the localConfig.py settings there is also the possibility to give configuration through environment vars. Documentation should reflect that as an alternative to using localConfig which might require rebuilding python packages and/or docker images.

Update the architecture diagram

The architecture diagram(https://github.com/lyft/amundsen/blob/master/docs/architecture.md) is a bit outdated as it still shows we persist update description back to Hive metastore.

Can we add a pre-push git hook to help enforce code style checks?

Amundsen runs flake8, mypy, and tsc to enforce python + typescript style-checks and stray warnings. For us, we have this enforcement in the form of a pre-push git hook in addition to our CI:

#!/bin/sh

red='\033[0;31m'
green='\033[0;32m'
yellow='\033[0;33m'
NC='\033[0m'

# Get only the files different on this branch
BASE_SHA="$(git merge-base master HEAD)"
FILES="$(git diff --name-only --diff-filter=AMC $BASE_SHA HEAD | grep "\.py$" | tr '\n' ' ')"

echo "${green}[Python Checks][Info]: Checking Python Style, Types${NC}"

if [ -n "$FILES" ]
then
  echo "${green}[Python Checks][Info]: ${FILES}${NC}"

  # Run flake8
  venv/bin/flake8 .

  if [ $? -ne 0 ]; then
    echo "${red}[Python Style][Error]: Fix the issues and commit again (or commit with --no-verify if you are sure)${NC}"
    exit 1
  fi

  # Run mypy
  venv/bin/mypy .

  if [ $? -ne 0 ]; then
    echo "${red}[Python Type Checks][Error]: Fix the issues and commit again (or commit with --no-verify if you are sure)${NC}"
    exit 1
  fi
else
  echo "${green}[Python Checks][Info]: No files to check${NC}"
fi

TS_FILES=$(git diff --name-only --diff-filter=AMC $BASE_SHA HEAD | grep ".tsx\{0,1\}$")
echo "${green}[Typescript Checks][Info]: Checking Typescript Style, Types${NC}"

if [ -n "$TS_FILES" ]
then
  cd amundsen_application/static
  npm run tsc

  if [ $? -ne 0 ]; then
    echo "${red}[Typescript Type Checks][Error]: Fix the issues and commit again (or commit with --no-verify if you are sure)${NC}"
    exit 1
  fi
else
  echo "${green}[Typescript Checks][Info]: No files to check${NC}"
fi

exit 0

It would be nice to have this in amundsen proper too, since I occasionally run into linting differences when pulling from upstream.

Unnecessary code - proxy/elasticsearch.py vs api/table.py

Hi,
in my opinion, the code below can't be reached and the default value is set from different places.

https://github.com/lyft/amundsensearchlibrary/blob/ab13456f29f1effbe7360e300af7c469fe9fe320/search_service/proxy/elasticsearch.py#L219-L220

elasticsearch.py is called from table.py below, but there is a default value set in table.py and if I set the default value in table.py as a param of add_argument, it will never reach else condition in elasticsearch.py

https://github.com/lyft/amundsensearchlibrary/blob/ea46e71de6e9e345a4b0207f364d2e9afc110cc6/search_service/api/table.py#L28

https://github.com/lyft/amundsensearchlibrary/blob/ea46e71de6e9e345a4b0207f364d2e9afc110cc6/search_service/api/table.py#L43

I am I right and should it be fixed or is it on purpose and I can't see the reason?
I would prefer to leave the default value just in config and remove it from add_argument from table.py.

Thanks

Add CONTRIBUTING.md

Position databuilder more clearly as an ETL tool(box) in its own right

Background originated in Q&A on this PR amundsen-io/amundsendatabuilder#31 (comment)

Hence making databuilder not aware of the search service could help to grow the library into a more generic way.

Yes, that framing certainly puts a different perspective on databuilder. I guess what fooled me into that train of thinking was it being included under the Amundsen naming umbrella etc. instead of it being a lower level thing “closer to the metal” more in its own right. Perhaps longer term worth explaining more — somehow...

Oh, wait; Just found this right at the top of the README.md under concept:

Amundsen Databuilder is a ETL framework for Amundsen ...

(bolding text is mine)

You guys certainly nudged me a bit into having my former perspective. :) Guess it could sit on a list like https://github.com/pawl/awesome-etl. Tools I’m familiar with it reminds me most of is https://singer.io, Pandas or Embulk on ⬅️ that list

Care to share if you think it’s got a unique value proposition compared to it’s competitors (ie why it was built in the first place)?

Just came across a nice example of such a WHY statement in yet-another-ETL-tool at https://metalpipe.readthedocs.io/en/latest/overview.html#what-isn-t-it
Do you think it’s worth it for me to submit a PR or two trying to boost databuilder slightly more in its own right? (Maybe better to spend our time on the 3 Amundsen services)
One immediate thing that comes to mind is databuilder (or maybe more the 3 Amundsen services) might benefit from having a CLI wrapper for something like the example Amundsen script, then you could:
pip install amundsendatabuilder
->
databuilder loadcsv <CSVurl> targetURL <publisherURL> ... parameters ...

Implement Apache Atlas publisher

Document local workflow

Problem

We want to document how someone would go about getting set up to develop locally in Amundsen.

This is not intuitive for newcomers given our micro service architecture and multiple repositories/submodules.
This is also difficult for folks who do not have Amundsen stood up in their company's internal development environments. For example at Lyft we are able to develop and test features with real data while hooked up to our internal infrastructure.

We have our docker compose script that helps standup a default version of Amundsen, but we currently lack any instruction on how to take what is accessible via this repository and set up an environment where a developer can modify and test changes in any part of Amundsen.

Goal

Create support for and document how to get Amundsen running locally, with the ability to build and test local changes with dummy data.

What I'm Considering

@jornh has mentioned hack that involves modifying docker-amundsen.yml to point to the image tag of the service in question some other dummy tag name, building the container for that service with those local changes, and re-running docker-compose -f docker-amundsen-atlas.yml up. (Jorn please correct me or give more detail if I misrepresented this idea) Two problems I faced here:
- I couldn't get it to work, highly likely due to missing some detail or hack with docker-compose.
- Folks have to deal with the overhead that is takes to build the local container, and to re-rerun the docker-compose script.
I then took a stab at something that worked better than I thought it would.
- Stood up amundsenfrontend locally on port 5000 (according to instructions in the repo)
- Stood up amundsensearch locally on port 5001
- Stood up amundsenmetadata locally on port 5002
- Ran docker-compose -f docker-amundsen-atlas.yml up, which only started the neo4j_amundsen and es_amundsen containers because the other ports were already in use.
- Visiting localhost:5000 now actually worked the same way it does for the quickstart. Except instead of this being the 5 docker containers working together it is now my local copy of amundsenfrontend, amundsemetadata, and amundsensearch utilizing the neo4j_amundsen and es_amundsen containers from the quickstart, so I actually get some dummy data.
- Made some changes in my local copy of amundsenfrontendlibrary and amundsenmetadatalibrary for kicks, could now see my changes locally.
- But what about instructions for amundsendatabuilderlibrary? TBD...my understanding is that the logic in databuilder can be tested locally via a couple different methods and I'm thinking the best place to document that is in the docs for that repo.

What I'm Proposing

To document the second approach I described above, followed by filing some issues for the community to help us make this better . Some such ideas include:

Creating a script to do all the steps in the second option above. Besides changing the metadata and search wsgi's to run on different ports, all I did was change directories into each submodule and copy paste the local setup commands from each repo.
Richer dummy data. This local development approach still uses the neo4j and elasticsearch containers from the quick start, so the task of improving that will improve not just the quickstart, but standalone local development as well.

Enable --strictNullChecks

Enabling --strictNullChecks helps us better leverage TypeScript to ensure that if we write code that attempts to access a property of a null or undefined variable, the error will be caught. Expect that many of our type definitions will have to be updated as well.

https://www.typescriptlang.org/docs/handbook/release-notes/typescript-2-0.html

Right now, we are using betterer to keep track of regressions on this side. This ticket will be done once we are ready to remove our betterer check.

Can we make a common package that contains all the common classes between amundsen?

Specifically things like the user/table classes as they get sent to neo4j/atlas/elasticsearch seem a little unfortunate to have to copy-paste between projects.

Proposal: documentation of possible queries for data generated from `action_logging callback`

As per chat, if folks are able to get data out of action_logging callback in amundsenfrontendlibrary into a single table, it would be great to see how some sample queries could be structured using this data, to build Looker dashboards and the like.

Clarify Quickstart fake Extractor limitations vs real Extractors use

We should clarify that the /example/scripts/sample_data_loader.py does not represent real Extractors very well. It's merely loading some "pre-cooked" CSV representations of what roughly should come out of the real Extractors (& Transformers) - into a SQLite staging DB, then feeding it to Loaders so it can be ingested into Neo4j

(surfaced on Slack in https://amundsenworkspace.slack.com/archives/CGFBVT23V/p1564478013214900)

Add TableLastUpdatedExtractor for Redshift

Currently we have HiveTableLastUpdatedExtractor that figures out when the table is last modified for Hive. This issue is to support the same feature to Redshift.

Production Installation guidelines

It would be great to have some documentation for guidelines to install Amundsen in production. Maybe with some Helm Chart to deploy it in a Kubernetes Cluster. Thanks!

Add proper CLI parameter handling to the Quickstart sample loader script

A bunch of input parameter stuff is hard coded like this: https://github.com/lyft/amundsendatabuilder/blob/v1.3.6/example/scripts/sample_data_loader.py#L37-L43

As well as the Elasticsearch host a few lines above. It would be much more elegant to be able to run Databuilder on any host and point it to another host (hosts) if the defaults would stay at localhost but it could take input parameters

@worldwise001 you mentioned work on alternative loading APIs in Slack. Not sure if these two things would affect each other or not. But maybe worth considering both together?

@jinhyukchang etc. I think this is a candidate for a First Issue tag

Add sample high/low watermark data to quick start

We use

SELECT From_unixtime(A0.create_time) as create_time,
               C0.NAME as schema_name,
               B0.tbl_name as table_name,
               {func}(A0.part_name) as part_name,
               {watermark} as part_type
        FROM   PARTITIONS A0
               LEFT OUTER JOIN TBLS B0
                            ON A0.tbl_id = B0.tbl_id
               LEFT OUTER JOIN DBS C0
                            ON B0.db_id = C0.db_id
        WHERE  C0.NAME IN {schemas}
               AND B0.tbl_type IN ( 'EXTERNAL_TABLE', 'MANAGED_TABLE' )
               AND A0.PART_NAME NOT LIKE '%%__HIVE_DEFAULT_PARTITION__%%'
        GROUP  BY C0.NAME, B0.tbl_name
        ORDER by create_time desc

as the high metastore query to get the high watermark and low watermark for a hive table. And we have already included the datamodel in https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/models/hive_watermark.py .

We should be able to provide a sample data with csv format to quick start. UI has already had support with watermark data.

Refactor table usage model

See the amundsen-io/amundsendatabuilder#100 (comment) and onward discussion in that PR

This is potentially a good first issue to get familiar with the inner workings of Databuilder

DataPreview needs rendering improvements

The data preview component needs several UI updates. While maintaining the same look and feel, we want to:

Render a table that keeps fixed headers when scrolling vertically
Fit the height of the rendered content to the modal -- the horizontal scroll bar should be visible without having to scroll down.
Support the ability to render non-primitive values. For example, values like numbers, strings, booleans render fine. However if the value is something like a Map<string, int>, we do not currently render a good representation of that data.

We may want to consider leveraging a 3rd party table component for this, as long as it can be styled to match out current look and feel.

Frontend configurations to be set using API

In order to support all the customizations of the frontend i.e., the logo path etc (mentioned here) when installed using a pip package (already compiled), we need to pass those configs from backend and store them in React's store and use them instead of hardcoding them on frontend .js file.

@ttannis @danwom FYI.

Use 'raise_for_status' in python API calls

If the response is not a success message, this would not go in the Exception block, and will raise exception when try to proceed normal try block.
I think you should call raise_for_status() method of response, to properly raise the exception.

Also, the whole try except block is repetitive, how about make this a decorator?

https://github.com/lyft/amundsenfrontendlibrary/pull/156/files/11ead489d7a46d12f986aa0d673c313d74cbb81b#r287533106

feature request - sample queries for tables

It would be useful, to provide a sample queries field in the table view, such that users can get a jump start, on how to use a table and it's columns.

Updating Tags does not update Elastic Indexes

When updating "Tags" from frontend, it does update backend Neo4j but does not update Elastic indexes. Hence, once tags are updated, they are not searchable.

Transient Docker pull timeout from Travis.

This is known issue from Travis
travis-ci/travis-ci#9127

One action item from Amundsen:

Research if there's a way to get a notification of failure

Move the databuilder to py3, document warning when using the usage extractor

We should document the repo to let users know that if they want to use python3 , the usage extractor won't work.

Design lineage UI

Lineage information should be displayed in the UI. Atlas shows the following

Which should be more "Amundsen" like

Add doc to document API response between FE and metadata / search

Make a connection to Active Directory to get user details

In order to get details of user from Active Directory (and not from Neo4j), we need a connection and configurations of the AD. At the moment the user details is highly coupled with Neo4j proxy.

Move antlr4 dependency from python 2 to python 3

All three microservices for Amundsen(data portal) could run on both python 2.x or python 3.x. The data ingestion library(amundsendatabuilder) could run only on python 2.x because of antlr4-python2-runtime.

It is used in ColumnUsage parser. We should explore and see if we could move the dependency to python 3(e.g. https://github.com/antlr/antlr4/tree/master/runtime/Python3)

Would like a guide for How-To deploy Amundsen in production

Please add points on what you expect from such a guide in a comment below. I will then try to consolidate input and draft up an outline in this comment.

The guide can end up as ~~/docs/deployment.md~~ is /docs/owners_manual.md better?

Initial outline:

Unique identifier of User entity is email, but should be configurable

In the current implementation, the unique/required identifier and required field is email, and I think that should be configurable. Or, we can introduce a new field username which is required and email can be optional. In case where username and email address are same, simply fill both attributes with the email address.

Use Case:
At ING, we do use corporate keys, which are totally unique and can not be changed, but email address can be changed (for example, name change after marriage, change of domain part based on geolocation etc.) So, to handle that scenario we can have corporate key as the unique identifier. It would make no sense and may be confusing for the users if we send corporate key in email address field and also, then there would not be an attribute to put email address.

Happy to discuss in details, and make the change in both frontend and backend, once we finalize this.

@feng-tao @jinhyukchang

Experiences running amundsen on latest osx

Could be helpful to have everything in one place, so documenting this here.

When using brew and latest of everything on mac osx, I got errors building npm install:

../../nan/nan_implementation_12_inl.h:103:42: error: no viable conversion from 'v8::Isolate *' to 'Localv8::Context'

This is right after installing npm using brew, which installs node version 12.

I downgraded to node version 10:

brew uninstall node@12
brew install node@10
brew link --force --overwrite node@10

Then npm install as normal, which gave me another error:

Node Sass could not find a binding for your current environment: OS X 64-bit with Node.js 10.x

This got fixed by:

npm rebuild node-sass

Then I managed to build everything as usual.

Integrate readthedoc with the repo

Popular open source projects leverage readthedoc to build and host their docuements(e.g Apache Airflow doc). We should integrate with the tool as well.

Add sample data for hive table last updated timestamp

We had the last update timestamp extractor in https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/extractor/hive_table_last_updated_extractor.py.

This issue is to provide a quick start sample data to show hive table last updated timestamp.

running amundsen in ec2 results in error

Create an ec2 instance (I am working with t2.xlarge)
switch to root (sudo su)
run the following commands

yum update -y && \
yum install -y docker && \
yum install -y git && \
service docker start && \
sudo usermod -aG docker ec2-user && \
mkdir /amundsen && \
cd /amundsen && \
git clone https://github.com/lyft/amundsenfrontendlibrary && \
git clone https://github.com/lyft/amundsenmetadatalibrary && \
git clone https://github.com/lyft/amundsensearchlibrary && \
cd /amundsen/amundsenmetadatalibrary/ && \
docker build -f public.Dockerfile -t amundsendev/amundsen-metadata:latest . && \
cd /amundsen/amundsenfrontendlibrary/ && \
docker build -f public.Dockerfile -t amundsendev/amundsen-frontend:latest . && \
cd /amundsen/amundsensearchlibrary/ && \
docker build -f public.Dockerfile -t amundsendev/amundsen-search:latest . && \
cd /amundsen && \
curl -L "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose && \
chmod +x /usr/local/bin/docker-compose && \
ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose && \
docker-compose -f amundsenfrontendlibrary/docker-amundsen.yml up

Note the following errors:

es_amundsen         | [1]: max file descriptors [40000] for elasticsearch process is too low, increase to at least [65535]
es_amundsen         | [2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Support versioning of Tables in amundsen

To explore the historical metadata, we should support the versioning of tables in Amundsen to understand what changed in each version of Table.
There would be a detail page for each version of a table, including the column information of that particular version.

Apache Atlas provides this information in the form of "Deleted" Table and Columns entities, and can be filtered out using the entityStatus attribute.

Mypy check seems not working in PR check.

Mypy is important piece to keep our code readable and it seems like not working in PR check.

Example PR: amundsen-io/amundsendatabuilder#48

Misleading variables in databuilder readme

Hi,
I would like to fix examples in readme of databuilder. Maybe it has some reason and I just don't see it, but probably that is just copy-paste issue.

There are some variables in readme https://github.com/lyft/amundsendatabuilder/blob/master/README.md for example in Elasticsearch publisher

tmp_folder = '/var/tmp/amundsen/dummy_metadata'
node_files_folder = '{tmp_folder}/nodes/'.format(tmp_folder=tmp_folder)
relationship_files_folder = '{tmp_folder}/relationships/'.format(tmp_folder=tmp_folder)

but those variables are not used in the example code. It was a bit misleading for me at the beginning. I would fix it by removing not used variables and adding those used in the example code.

Add TableLastUpdatedExtractor for BigQuery

Currently we have HiveTableLastUpdatedExtractor that figures out when the table is last modified for Hive. This issue is to support the same feature to BigQuery.