Coder Social home page Coder Social logo

amundsen-io / amundsen Goto Github PK

View Code? Open in Web Editor NEW
4.3K 234.0 950.0 29.65 MB

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Home Page: https://www.amundsen.io/amundsen/

License: Apache License 2.0

Shell 0.02% Mustache 0.03% HTML 0.12% Makefile 0.08% Python 67.42% TypeScript 30.03% JavaScript 0.08% SCSS 2.14% Scala 0.08%
amundsen metadata data-catalog data-discovery linuxfoundation

amundsen's Introduction

Amundsen

Slack

Amundsen is a data discovery and metadata engine for improving the productivity of data analysts, data scientists and engineers when interacting with data. It does that today by indexing data resources (tables, dashboards, streams, etc.) and powering a page-rank style search based on usage patterns (e.g. highly queried tables show up earlier than less queried tables). Think of it as Google search for data. The project is named after Norwegian explorer Roald Amundsen, the first person to discover the South Pole.

LF AI & Data

Amundsen is hosted by the LF AI & Data Foundation. It includes three microservices, one data ingestion library and one common library.

  • amundsenfrontendlibrary: Frontend service which is a Flask application with a React frontend.
  • amundsensearchlibrary: Search service, which leverages Elasticsearch for search capabilities, is used to power frontend metadata searching.
  • amundsenmetadatalibrary: Metadata service, which leverages Neo4j or Apache Atlas as the persistent layer, to provide various metadata.
  • amundsendatabuilder: Data ingestion library for building metadata graph and search index. Users could either load the data with a python script with the library or with an Airflow DAG importing the library.
  • amundsencommon: Amundsen Common library holds common codes among microservices in Amundsen.
  • amundsengremlin: Amundsen Gremlin library holds code used for converting model objects into vertices and edges in gremlin. It's used for loading data into an AWS Neptune backend.
  • amundsenrds: Amundsenrds contains ORM models to support relational database as metadata backend store in Amundsen. The schema in ORM models follows the logic of databuilder models. Amundsenrds will be used in databuilder and metadatalibrary for metadata storage and retrieval with relational databases.

Documentation

Community Roadmap

We want your input about what is important, for that, add your votes using the 👍 reaction:

Requirements

  • Python >= 3.8
  • Node v12

User Interface

Please note that the mock images only served as demonstration purpose.

  • Landing Page: The landing page for Amundsen including 1. search bars; 2. popular used tables;

  • Search Preview: See inline search results as you type

  • Table Detail Page: Visualization of a Hive / Redshift table

  • Column detail: Visualization of columns of a Hive / Redshift table which includes an optional stats display

  • Data Preview Page: Visualization of table data preview which could integrate with Apache Superset or other Data Visualization Tools.

Getting Started and Installation

Please visit the Amundsen installation documentation for a quick start to bootstrap a default version of Amundsen with dummy data.

Supported Entities

  • Tables (from Databases)
  • Dashboards
  • ML Features
  • People (from HR systems)

Supported Integrations

Table Connectors

Amundsen can also connect to any database that provides dbapi or sql_alchemy interface (which most DBs provide).

Table Column Statistics

Dashboard Connectors

ETL Orchestration

Get Involved in the Community

Want help or want to help? Use the button in our header to join our slack channel.

Contributions are also more than welcome! As explained in CONTRIBUTING.md there are many ways to contribute, it does not all have to be code with new features and bug fixes, also documentation, like FAQ entries, bug reports, blog posts sharing experiences etc. all help move Amundsen forward. If you find a security vulnerability, please follow this guide.

Architecture Overview

Please visit Architecture for Amundsen architecture overview.

Resources

Blog Posts and Interviews

Talks

  • Disrupting Data Discovery {slides, recording} (Strata SF, March 2019)
  • Amundsen: A Data Discovery Platform from Lyft {slides} (Data Council SF, April 2019)
  • Disrupting Data Discovery {slides} (Strata London, May 2019)
  • ING Data Analytics Platform (Amundsen is mentioned) {slides, recording } (Kubecon Barcelona, May 2019)
  • Disrupting Data Discovery {slides, recording} (Making Big Data Easy SF, May 2019)
  • Disrupting Data Discovery {slides, recording} (Neo4j Graph Tour Santa Monica, September 2019)
  • Disrupting Data Discovery {slides} (IDEAS SoCal AI & Data Science Conference, Oct 2019)
  • Data Discovery with Amundsen by Gerard Toonstra from Coolblue {slides} and {talk} (BigData Vilnius 2019)
  • Towards Enterprise Grade Data Discovery and Data Lineage with Apache Atlas and Amundsen by Verdan Mahmood and Marek Wiewiorka from ING {slides, talk} (Big Data Technology Warsaw Summit 2020)
  • Airflow @ Lyft (which covers how we integrate Airflow and Amundsen) by Tao Feng {slides and website} (Airflow Summit 2020)
  • Data DAGs with lineage for fun and for profit by Bolke de Bruin {website} (Airflow Summit 2020)
  • Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform by Tao Feng (Data+AI summit Europe 2020)
  • Data Discovery at Databricks with Amundsen by Tao Feng and Tianru Zhou (Data+AI summit NA 2021)

Related Articles

Community meetings

Community meetings are held on the first Thursday of every month at 9 AM Pacific, Noon Eastern, 6 PM Central European Time. Link to join

Upcoming meetings & notes

You can the exact date for the next meeting and the agenda a few weeks before the meeting in this doc.

Notes from all past meetings are available here.

Who uses Amundsen?

Here is the list of organizations that are officially using Amundsen today. If your organization uses Amundsen, please file a PR and update this list.

Contributors ✨

Thanks goes to these incredible people:

amundsen's People

Contributors

allisonsuarez avatar alran avatar andrewciambrone avatar b-t-d avatar csteez avatar dechoma avatar dependabot-preview[bot] avatar dependabot[bot] avatar dikshathakur3119 avatar dorianj avatar feng-tao avatar friendtocephalopods avatar github-actions[bot] avatar golodhros avatar javamonkey79 avatar jinhyukchang avatar jornh avatar jroof88 avatar kristenarmes avatar markgrover avatar mgorsk1 avatar ozandogrultan avatar paschalisdim avatar ryanlieu avatar samshuster avatar sewardgw avatar ttannis avatar verdan avatar xuan616 avatar youngyjd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amundsen's Issues

Feature request - knowledge posts/repository

As a 'data discovery' platform, I should aim to empower users across my organization to make better decisions - based on data. I should allow new users to onboard quicker to the organization and learn about the business, help users answer (most of) their own 'data' questions through past analysis, and reduce the barriers of access to analyze data on questions that have yet to be answered.

Once analysis has been done on a 'data' questions, the results are often saved or archived in the form of a wiki or knowledge post. For Airbnb + Lyft...Knowledge Repo + Stache Overflow.

https://github.com/airbnb/knowledge-repo
https://stackoverflow.com/teams

Although this is not a concrete implementation plan - the story should be the same = within amundsen, I should be able to search past analysis or knowledge posts (created and connected to users on the platform), ranked by popularity, has owners, followers, etc.

Should we add PUT/POST endpoints for tables?

We wanted to do this at Square so our databuilder instance doesn't have to connect directly to the neo4j database instance. It can just call some simple metadata library APIs that handle the database details.

We think this gives a better separation of concerns between the databuilder and the metadata library code.

I built this for the Square instance of amundsen, I'd be happy to upstream it if y'all are amenable.

Implement a config variable that will allow the owners to edit their datasets

For now, we are relying on UNEDITABLE_SCHEMAS, but what if want the owners to change the datasets they own.
For example, it could be: (I don't know atm, which would take precedence)

is_editable = True

if results['schema'] in app.config['UNEDITABLE_SCHEMAS']:
    is_editable = False

if app.config['ALLOW_ADMINS_TO_EDIT'] and current_user in results['owners']:
    is_editable = True

thoughts ?

Add instructions for BigQuery Extractor

Could we please add some setup instructions for BigQuery metadata extractor and usage extractor? Instructions could include IAM roles and permissions that are needed for extractor access and extract metadata and usage information. An example response from Bigquery for metadata and usage would be ideal.

The terms `schema` and `db` can have multiple meanings and confusing

The terminology used for Amundsen table URI is a bit confusing and people can interpret that in multiple ways. We should make them consistent as what is being used normally.

As per @feng-tao here: I think the schema term at Amundsen is same as database (e.g create database) while database is used to define the system(hive, redshift etc).

And Amundsen is using these terms to generate table URN {db}://{cluster}.{schema}/{table}

In my opinion both terms are overloaded somehow, schema means table columns and in Amundsen we are using as database <this thing>.table, while database means physical database, and also <this thing>.table. I think we need to have a 3rd term here.

I propose to use a different term for the system itself, and use db instead of schema.
Something like this: {system|type|db_type}://{cluster}.{db}/{table}

@feng-tao @jinhyukchang

Redshift spectrum metadata extraction is not supported by postgres extractor

Figured this out yesterday while working on some spectrum tables; the current query in amundsen-io/amundsendatabuilder#63 works for native redshift tables (which is what I tested with earlier), but does not work with spectrum tables (as spectrum metadata is stored outside sql standard INFORMATION_SCHEMA). svv_external_columns should provide access to the equivalent external tables metadata.

If/when I have some time I'll add a PR, or if anyone else wants to tackle in the meantime - have at it 😉.

Use entity IDs where possible instead of using the table URI

For most of the part, amundsen metadata relies on the table_uri, which is being passed from frontend, that is making it complex and extra processing in case of systems which supports IDs.
I am proposing to use the IDs of entities like Table, Column, and User etc to be used for the different kinds of operations. This information can be passed from metadata (when available, None otherwise), and simply be returned from frontend for each operation.
Example Use Case:

  • Get Column Description:
    At the moment, the frontend is sending the table_uri and the column name. I need to parse the table_uri, find table details, find the column based on the column name, and then get the column detail entity, and return the description. If there would be ID available for the column, then we can simply use that ID to get the record, and return the description.

Same is the case for Put, and Table detail itself. If we'd have the ID, we won't need to parse the table_uri.

Just out of curiosity, how does Neo4j handles the IDs, does it support IDs ?

Document config through env vars

For this and metadata + search repos for much of the localConfig.py settings there is also the possibility to give configuration through environment vars. Documentation should reflect that as an alternative to using localConfig which might require rebuilding python packages and/or docker images.

Related Slack thread: https://amundsenworkspace.slack.com/archives/CHQERT0D7/p1563544756136000?thread_ts=1563303575.106800&channel=CHQERT0D7&message_ts=1563544756.136000

Can we add a pre-push git hook to help enforce code style checks?

Amundsen runs flake8, mypy, and tsc to enforce python + typescript style-checks and stray warnings. For us, we have this enforcement in the form of a pre-push git hook in addition to our CI:

#!/bin/sh

red='\033[0;31m'
green='\033[0;32m'
yellow='\033[0;33m'
NC='\033[0m'

# Get only the files different on this branch
BASE_SHA="$(git merge-base master HEAD)"
FILES="$(git diff --name-only --diff-filter=AMC $BASE_SHA HEAD | grep "\.py$" | tr '\n' ' ')"

echo "${green}[Python Checks][Info]: Checking Python Style, Types${NC}"

if [ -n "$FILES" ]
then
  echo "${green}[Python Checks][Info]: ${FILES}${NC}"

  # Run flake8
  venv/bin/flake8 .

  if [ $? -ne 0 ]; then
    echo "${red}[Python Style][Error]: Fix the issues and commit again (or commit with --no-verify if you are sure)${NC}"
    exit 1
  fi

  # Run mypy
  venv/bin/mypy .

  if [ $? -ne 0 ]; then
    echo "${red}[Python Type Checks][Error]: Fix the issues and commit again (or commit with --no-verify if you are sure)${NC}"
    exit 1
  fi
else
  echo "${green}[Python Checks][Info]: No files to check${NC}"
fi

TS_FILES=$(git diff --name-only --diff-filter=AMC $BASE_SHA HEAD | grep ".tsx\{0,1\}$")
echo "${green}[Typescript Checks][Info]: Checking Typescript Style, Types${NC}"

if [ -n "$TS_FILES" ]
then
  cd amundsen_application/static
  npm run tsc

  if [ $? -ne 0 ]; then
    echo "${red}[Typescript Type Checks][Error]: Fix the issues and commit again (or commit with --no-verify if you are sure)${NC}"
    exit 1
  fi
else
  echo "${green}[Typescript Checks][Info]: No files to check${NC}"
fi

exit 0

It would be nice to have this in amundsen proper too, since I occasionally run into linting differences when pulling from upstream.

Unnecessary code - proxy/elasticsearch.py vs api/table.py

Hi,
in my opinion, the code below can't be reached and the default value is set from different places.

https://github.com/lyft/amundsensearchlibrary/blob/ab13456f29f1effbe7360e300af7c469fe9fe320/search_service/proxy/elasticsearch.py#L219-L220

elasticsearch.py is called from table.py below, but there is a default value set in table.py and if I set the default value in table.py as a param of add_argument, it will never reach else condition in elasticsearch.py

https://github.com/lyft/amundsensearchlibrary/blob/ea46e71de6e9e345a4b0207f364d2e9afc110cc6/search_service/api/table.py#L28

https://github.com/lyft/amundsensearchlibrary/blob/ea46e71de6e9e345a4b0207f364d2e9afc110cc6/search_service/api/table.py#L43

I am I right and should it be fixed or is it on purpose and I can't see the reason?
I would prefer to leave the default value just in config and remove it from add_argument from table.py.

Thanks

Position databuilder more clearly as an ETL tool(box) in its own right

Background originated in Q&A on this PR amundsen-io/amundsendatabuilder#31 (comment)


Hence making databuilder not aware of the search service could help to grow the library into a more generic way.

Yes, that framing certainly puts a different perspective on databuilder. I guess what fooled me into that train of thinking was it being included under the Amundsen naming umbrella etc. instead of it being a lower level thing “closer to the metal” more in its own right. Perhaps longer term worth explaining more — somehow...

Oh, wait; Just found this right at the top of the README.md under concept:

Amundsen Databuilder is a ETL framework for Amundsen ...

(bolding text is mine)

You guys certainly nudged me a bit into having my former perspective. :) Guess it could sit on a list like https://github.com/pawl/awesome-etl. Tools I’m familiar with it reminds me most of is https://singer.io, Pandas or Embulk on ⬅️ that list

  • Care to share if you think it’s got a unique value proposition compared to it’s competitors (ie why it was built in the first place)?

    Just came across a nice example of such a WHY statement in yet-another-ETL-tool at https://metalpipe.readthedocs.io/en/latest/overview.html#what-isn-t-it

  • Do you think it’s worth it for me to submit a PR or two trying to boost databuilder slightly more in its own right? (Maybe better to spend our time on the 3 Amundsen services)
  • One immediate thing that comes to mind is databuilder (or maybe more the 3 Amundsen services) might benefit from having a CLI wrapper for something like the example Amundsen script, then you could:
    pip install amundsendatabuilder
    ->
    databuilder loadcsv <CSVurl> targetURL <publisherURL> ... parameters ...

Document local workflow

Problem

We want to document how someone would go about getting set up to develop locally in Amundsen.

  1. This is not intuitive for newcomers given our micro service architecture and multiple repositories/submodules.
  2. This is also difficult for folks who do not have Amundsen stood up in their company's internal development environments. For example at Lyft we are able to develop and test features with real data while hooked up to our internal infrastructure.

We have our docker compose script that helps standup a default version of Amundsen, but we currently lack any instruction on how to take what is accessible via this repository and set up an environment where a developer can modify and test changes in any part of Amundsen.

Goal

Create support for and document how to get Amundsen running locally, with the ability to build and test local changes with dummy data.

What I'm Considering

  1. @jornh has mentioned hack that involves modifying docker-amundsen.yml to point to the image tag of the service in question some other dummy tag name, building the container for that service with those local changes, and re-running docker-compose -f docker-amundsen-atlas.yml up. (Jorn please correct me or give more detail if I misrepresented this idea) Two problems I faced here:

    • I couldn't get it to work, highly likely due to missing some detail or hack with docker-compose.
    • Folks have to deal with the overhead that is takes to build the local container, and to re-rerun the docker-compose script.
  2. I then took a stab at something that worked better than I thought it would.

    • Stood up amundsenfrontend locally on port 5000 (according to instructions in the repo)
    • Stood up amundsensearch locally on port 5001
    • Stood up amundsenmetadata locally on port 5002
    • Ran docker-compose -f docker-amundsen-atlas.yml up, which only started the neo4j_amundsen and es_amundsen containers because the other ports were already in use.
    • Visiting localhost:5000 now actually worked the same way it does for the quickstart. Except instead of this being the 5 docker containers working together it is now my local copy of amundsenfrontend, amundsemetadata, and amundsensearch utilizing the neo4j_amundsen and es_amundsen containers from the quickstart, so I actually get some dummy data.
    • Made some changes in my local copy of amundsenfrontendlibrary and amundsenmetadatalibrary for kicks, could now see my changes locally.
      image
    • But what about instructions for amundsendatabuilderlibrary? TBD...my understanding is that the logic in databuilder can be tested locally via a couple different methods and I'm thinking the best place to document that is in the docs for that repo.

What I'm Proposing

To document the second approach I described above, followed by filing some issues for the community to help us make this better . Some such ideas include:

  1. Creating a script to do all the steps in the second option above. Besides changing the metadata and search wsgi's to run on different ports, all I did was change directories into each submodule and copy paste the local setup commands from each repo.
  2. Richer dummy data. This local development approach still uses the neo4j and elasticsearch containers from the quick start, so the task of improving that will improve not just the quickstart, but standalone local development as well.

Enable --strictNullChecks

Enabling --strictNullChecks helps us better leverage TypeScript to ensure that if we write code that attempts to access a property of a null or undefined variable, the error will be caught. Expect that many of our type definitions will have to be updated as well.

https://www.typescriptlang.org/docs/handbook/release-notes/typescript-2-0.html

Right now, we are using betterer to keep track of regressions on this side. This ticket will be done once we are ready to remove our betterer check.

Production Installation guidelines

It would be great to have some documentation for guidelines to install Amundsen in production. Maybe with some Helm Chart to deploy it in a Kubernetes Cluster. Thanks!

Add proper CLI parameter handling to the Quickstart sample loader script

A bunch of input parameter stuff is hard coded like this: https://github.com/lyft/amundsendatabuilder/blob/v1.3.6/example/scripts/sample_data_loader.py#L37-L43

As well as the Elasticsearch host a few lines above. It would be much more elegant to be able to run Databuilder on any host and point it to another host (hosts) if the defaults would stay at localhost but it could take input parameters

@worldwise001 you mentioned work on alternative loading APIs in Slack. Not sure if these two things would affect each other or not. But maybe worth considering both together?

@jinhyukchang etc. I think this is a candidate for a First Issue tag

Add sample high/low watermark data to quick start

We use

SELECT From_unixtime(A0.create_time) as create_time,
               C0.NAME as schema_name,
               B0.tbl_name as table_name,
               {func}(A0.part_name) as part_name,
               {watermark} as part_type
        FROM   PARTITIONS A0
               LEFT OUTER JOIN TBLS B0
                            ON A0.tbl_id = B0.tbl_id
               LEFT OUTER JOIN DBS C0
                            ON B0.db_id = C0.db_id
        WHERE  C0.NAME IN {schemas}
               AND B0.tbl_type IN ( 'EXTERNAL_TABLE', 'MANAGED_TABLE' )
               AND A0.PART_NAME NOT LIKE '%%__HIVE_DEFAULT_PARTITION__%%'
        GROUP  BY C0.NAME, B0.tbl_name
        ORDER by create_time desc

as the high metastore query to get the high watermark and low watermark for a hive table. And we have already included the datamodel in https://github.com/lyft/amundsendatabuilder/blob/master/databuilder/models/hive_watermark.py .

We should be able to provide a sample data with csv format to quick start. UI has already had support with watermark data.

DataPreview needs rendering improvements

The data preview component needs several UI updates. While maintaining the same look and feel, we want to:

  1. Render a table that keeps fixed headers when scrolling vertically
  2. Fit the height of the rendered content to the modal -- the horizontal scroll bar should be visible without having to scroll down.
  3. Support the ability to render non-primitive values. For example, values like numbers, strings, booleans render fine. However if the value is something like a Map<string, int>, we do not currently render a good representation of that data.

We may want to consider leveraging a 3rd party table component for this, as long as it can be styled to match out current look and feel.

Frontend configurations to be set using API

In order to support all the customizations of the frontend i.e., the logo path etc (mentioned here) when installed using a pip package (already compiled), we need to pass those configs from backend and store them in React's store and use them instead of hardcoding them on frontend .js file.

@ttannis @danwom FYI.

Design lineage UI

Lineage information should be displayed in the UI. Atlas shows the following

Screenshot 2019-04-28 at 18 02 21

Which should be more "Amundsen" like

Would like a guide for How-To deploy Amundsen in production

Please add points on what you expect from such a guide in a comment below. I will then try to consolidate input and draft up an outline in this comment.

The guide can end up as /docs/deployment.md is /docs/owners_manual.md better?

Initial outline:

Unique identifier of User entity is email, but should be configurable

In the current implementation, the unique/required identifier and required field is email, and I think that should be configurable. Or, we can introduce a new field username which is required and email can be optional. In case where username and email address are same, simply fill both attributes with the email address.

Use Case:
At ING, we do use corporate keys, which are totally unique and can not be changed, but email address can be changed (for example, name change after marriage, change of domain part based on geolocation etc.) So, to handle that scenario we can have corporate key as the unique identifier. It would make no sense and may be confusing for the users if we send corporate key in email address field and also, then there would not be an attribute to put email address.

Happy to discuss in details, and make the change in both frontend and backend, once we finalize this.

@feng-tao @jinhyukchang

Experiences running amundsen on latest osx

Could be helpful to have everything in one place, so documenting this here.

When using brew and latest of everything on mac osx, I got errors building npm install:

../../nan/nan_implementation_12_inl.h:103:42: error: no viable conversion from 'v8::Isolate *' to 'Localv8::Context'

This is right after installing npm using brew, which installs node version 12.

I downgraded to node version 10:

brew uninstall node@12
brew install node@10
brew link --force --overwrite node@10

Then npm install as normal, which gave me another error:

Node Sass could not find a binding for your current environment: OS X 64-bit with Node.js 10.x

This got fixed by:

npm rebuild node-sass

Then I managed to build everything as usual.

running amundsen in ec2 results in error

  1. Create an ec2 instance (I am working with t2.xlarge)
  2. switch to root (sudo su)
  3. run the following commands
yum update -y && \
yum install -y docker && \
yum install -y git && \
service docker start && \
sudo usermod -aG docker ec2-user && \
mkdir /amundsen && \
cd /amundsen && \
git clone https://github.com/lyft/amundsenfrontendlibrary && \
git clone https://github.com/lyft/amundsenmetadatalibrary && \
git clone https://github.com/lyft/amundsensearchlibrary && \
cd /amundsen/amundsenmetadatalibrary/ && \
docker build -f public.Dockerfile -t amundsendev/amundsen-metadata:latest . && \
cd /amundsen/amundsenfrontendlibrary/ && \
docker build -f public.Dockerfile -t amundsendev/amundsen-frontend:latest . && \
cd /amundsen/amundsensearchlibrary/ && \
docker build -f public.Dockerfile -t amundsendev/amundsen-search:latest . && \
cd /amundsen && \
curl -L "https://github.com/docker/compose/releases/download/1.24.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose && \
chmod +x /usr/local/bin/docker-compose && \
ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose && \
docker-compose -f amundsenfrontendlibrary/docker-amundsen.yml up

Note the following errors:

es_amundsen         | [1]: max file descriptors [40000] for elasticsearch process is too low, increase to at least [65535]
es_amundsen         | [2]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

Support versioning of Tables in amundsen

To explore the historical metadata, we should support the versioning of tables in Amundsen to understand what changed in each version of Table.
There would be a detail page for each version of a table, including the column information of that particular version.

Apache Atlas provides this information in the form of "Deleted" Table and Columns entities, and can be filtered out using the entityStatus attribute.

Misleading variables in databuilder readme

Hi,
I would like to fix examples in readme of databuilder. Maybe it has some reason and I just don't see it, but probably that is just copy-paste issue.

There are some variables in readme https://github.com/lyft/amundsendatabuilder/blob/master/README.md for example in Elasticsearch publisher

tmp_folder = '/var/tmp/amundsen/dummy_metadata'
node_files_folder = '{tmp_folder}/nodes/'.format(tmp_folder=tmp_folder)
relationship_files_folder = '{tmp_folder}/relationships/'.format(tmp_folder=tmp_folder)

but those variables are not used in the example code. It was a bit misleading for me at the beginning. I would fix it by removing not used variables and adding those used in the example code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.