Coder Social home page Coder Social logo

aws-samples / accelerated-data-lake Goto Github PK

View Code? Open in Web Editor NEW
72.0 72.0 32.0 182 KB

A packaged Data Lake solution, that builds a highly functional Data Lake, with a data catalog queryable via Elasticsearch

License: Apache License 2.0

Python 99.30% Shell 0.70%

accelerated-data-lake's People

Contributors

cchew avatar grusy avatar jpeddicord avatar paulmacey1 avatar sukenshah avatar tpbrogan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

accelerated-data-lake's Issues

Acceptable error threshold [Enhancement]

What is an error threshold?

The accelerated data lake framework follows strict validation policy. The data file is considered failed if there is even a single validation error. The error threshold is a mechanism to allow an acceptable number of validation errors per data file.

Does error threshold apply to all validations?

The error threshold only applies to semantic validation rules e.g. number ranges, enumeration values etc. The data file is rejected if there are syntactic errors.

How is error threshold configured?

The error threshold can be configured via the data source config file. The error threshold is represented as % of allowed errors per total number of records in a file.

Where do I need error threshold?

The error threshold can be useful when dealing with low quality data files. With error threshold, the customers can still build the data lake without having to correct the errors for each files.

cc @paulmacey1 @tpbrogan

Update to default botocore impacting streaming to ElasticSearch

An update to the botocore version inside Lambda prevents the ElasticSearch streaming function from working. You will see an error in the Cloudwatch logs similar to "cannot import name 'BotocoreHTTPSession'"

This can be fixed by adding a Lambda Layer specific to the region the function is running in:

ap-northeast-1: arn:aws:lambda:ap-northeast-1:249908578461:layer:AWSLambda-Python-AWS-SDK:1
us-east-1: arn:aws:lambda:us-east-1:668099181075:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-1: arn:aws:lambda:ap-southeast-1:468957933125:layer:AWSLambda-Python-AWS-SDK:1
eu-west-1: arn:aws:lambda:eu-west-1:399891621064:layer:AWSLambda-Python-AWS-SDK:1
us-west-1: arn:aws:lambda:us-west-1:325793726646:layer:AWSLambda-Python-AWS-SDK:1
ap-east-1: arn:aws:lambda:ap-east-1:118857876118:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-2: arn:aws:lambda:ap-northeast-2:296580773974:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-3: arn:aws:lambda:ap-northeast-3:961244031340:layer:AWSLambda-Python-AWS-SDK:1
ap-south-1:631267018583: arn:aws:lambda:ap-south-1:631267018583:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-2: arn:aws:lambda:ap-southeast-2:817496625479:layer:AWSLambda-Python-AWS-SDK:1
ca-central-1: arn:aws:lambda:ca-central-1:778625758767:layer:AWSLambda-Python-AWS-SDK:1
eu-central-1: arn:aws:lambda:eu-central-1:292169987271:layer:AWSLambda-Python-AWS-SDK:1
eu-north-1: arn:aws:lambda:eu-north-1:642425348156:layer:AWSLambda-Python-AWS-SDK:1
eu-west-2: arn:aws:lambda:eu-west-2:142628438157:layer:AWSLambda-Python-AWS-SDK:1
eu-west-3: arn:aws:lambda:eu-west-3:959311844005:layer:AWSLambda-Python-AWS-SDK:1
sa-east-1: arn:aws:lambda:sa-east-1:640010853179:layer:AWSLambda-Python-AWS-SDK:1us-us-east-2: arn:aws:lambda:us-east-2:259788987135:layer:AWSLambda-Python-AWS-SDK:1
us-west-2: arn:aws:lambda:us-west-2:420165488524:layer:AWSLambda-Python-AWS-SDK:1
cn-north-1: arn:aws-cn:lambda:cn-north-1:683298794825:layer:AWSLambda-Python-AWS-SDK:1
cn-northwest-1: arn:aws-cn:lambda:cn-northwest-1:382066503313:layer:AWSLambda-Python-AWS-SDK:1
us-gov-west-1:: arn:aws-us-gov:lambda:us-gov-west-1:556739011827:layer:AWSLambda-Python-AWS-SDK:1
us-gov-east-1: arn:aws-us-gov:lambda:us-gov-east-1:138526772879:layer:AWSLambda-Python-AWS-SDK:1

Visulation Template File

When generating the Visulation Lambdas sam package --template-file ./lambdaDeploy.yaml --output-template-file lambdaDeployCFN.yaml

The template file created has and incorrect TemplateFormatVersion line at the end of the file and wont deploy

"\xEF\xBB\xBFAWSTemplateFormatVersion": '2010-09-09')

When changed to

"AWSTemplateFormatVersion": '2010-09-09')

I was able to deploy. This was also happening for the Staging Engine as well but when I downloaded the latest version yesterday the Staging Engine template didn't have that problem

Multi Part Causing Loop

I have been working on implementing the accelerated data lake and was having an issue with larger files that were using the multipart upload.

Once the process hit the staging-engine-AttachTagsAndMetaDataToFi Lambda step it would start the whole process again for the same file over and over again. If I disabled the multipart upload through the AWS CLI and uploaded a large file then the process went through fine. After some testing I was able to lock it down to this section in the Lambda

s3.copy( copy_source, bucket, key, ExtraArgs={ "Metadata": metadata, "MetadataDirective": "REPLACE" } )

Changing it to this has stopped the looping problem

s3.copy_object( Bucket =bucket, Key = key, CopySource = copy_source, Metadata = metadata, MetadataDirective='REPLACE' )

Updating existing file meta data tags

Hi,

I saw in the presentation describing the accelerated data lake, that updating the DataSource in Dynamo DB should work retrospecively on existing data, eg. if I add a metadata tag all existing Files would be updated to contain this Meta Data Tag. I don't seem to be able to get this to work?

Is this actually a feature of the accelerated data lake?
Is it possible to update all existing files with new or updated metadata?

Thanks in advance

Sample cannot be queried through Athena

Whilst the sample "rydebooking-1234567890.json" is successfully ingested and staged, it cannot be queried by Athena. Athena does not support multi-line .json. Attempting to query results in an error:

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Stripping the newlines from the input file, allows you to query via Athena.

Suggest that there should be a further step at the end of the instructions, for query via Athena.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.