aws-samples / accelerated-data-lake Goto Github PK

A packaged Data Lake solution, that builds a highly functional Data Lake, with a data catalog queryable via Elasticsearch

License: Apache License 2.0

Python 99.30% Shell 0.70%

accelerated-data-lake's People

Contributors

Stargazers

Watchers

accelerated-data-lake's Issues

Acceptable error threshold [Enhancement]

What is an error threshold?

The accelerated data lake framework follows strict validation policy. The data file is considered failed if there is even a single validation error. The error threshold is a mechanism to allow an acceptable number of validation errors per data file.

Does error threshold apply to all validations?

The error threshold only applies to semantic validation rules e.g. number ranges, enumeration values etc. The data file is rejected if there are syntactic errors.

How is error threshold configured?

The error threshold can be configured via the data source config file. The error threshold is represented as % of allowed errors per total number of records in a file.

Where do I need error threshold?

The error threshold can be useful when dealing with low quality data files. With error threshold, the customers can still build the data lake without having to correct the errors for each files.

cc @paulmacey1 @tpbrogan

Update to default botocore impacting streaming to ElasticSearch

An update to the botocore version inside Lambda prevents the ElasticSearch streaming function from working. You will see an error in the Cloudwatch logs similar to "cannot import name 'BotocoreHTTPSession'"

This can be fixed by adding a Lambda Layer specific to the region the function is running in:

ap-northeast-1: arn:aws:lambda:ap-northeast-1:249908578461:layer:AWSLambda-Python-AWS-SDK:1
us-east-1: arn:aws:lambda:us-east-1:668099181075:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-1: arn:aws:lambda:ap-southeast-1:468957933125:layer:AWSLambda-Python-AWS-SDK:1
eu-west-1: arn:aws:lambda:eu-west-1:399891621064:layer:AWSLambda-Python-AWS-SDK:1
us-west-1: arn:aws:lambda:us-west-1:325793726646:layer:AWSLambda-Python-AWS-SDK:1
ap-east-1: arn:aws:lambda:ap-east-1:118857876118:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-2: arn:aws:lambda:ap-northeast-2:296580773974:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-3: arn:aws:lambda:ap-northeast-3:961244031340:layer:AWSLambda-Python-AWS-SDK:1
ap-south-1:631267018583: arn:aws:lambda:ap-south-1:631267018583:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-2: arn:aws:lambda:ap-southeast-2:817496625479:layer:AWSLambda-Python-AWS-SDK:1
ca-central-1: arn:aws:lambda:ca-central-1:778625758767:layer:AWSLambda-Python-AWS-SDK:1
eu-central-1: arn:aws:lambda:eu-central-1:292169987271:layer:AWSLambda-Python-AWS-SDK:1
eu-north-1: arn:aws:lambda:eu-north-1:642425348156:layer:AWSLambda-Python-AWS-SDK:1
eu-west-2: arn:aws:lambda:eu-west-2:142628438157:layer:AWSLambda-Python-AWS-SDK:1
eu-west-3: arn:aws:lambda:eu-west-3:959311844005:layer:AWSLambda-Python-AWS-SDK:1
sa-east-1: arn:aws:lambda:sa-east-1:640010853179:layer:AWSLambda-Python-AWS-SDK:1us-us-east-2: arn:aws:lambda:us-east-2:259788987135:layer:AWSLambda-Python-AWS-SDK:1
us-west-2: arn:aws:lambda:us-west-2:420165488524:layer:AWSLambda-Python-AWS-SDK:1
cn-north-1: arn:aws-cn:lambda:cn-north-1:683298794825:layer:AWSLambda-Python-AWS-SDK:1
cn-northwest-1: arn:aws-cn:lambda:cn-northwest-1:382066503313:layer:AWSLambda-Python-AWS-SDK:1
us-gov-west-1:: arn:aws-us-gov:lambda:us-gov-west-1:556739011827:layer:AWSLambda-Python-AWS-SDK:1
us-gov-east-1: arn:aws-us-gov:lambda:us-gov-east-1:138526772879:layer:AWSLambda-Python-AWS-SDK:1

Visulation Template File

When generating the Visulation Lambdas sam package --template-file ./lambdaDeploy.yaml --output-template-file lambdaDeployCFN.yaml

The template file created has and incorrect TemplateFormatVersion line at the end of the file and wont deploy

"\xEF\xBB\xBFAWSTemplateFormatVersion": '2010-09-09')

When changed to

"AWSTemplateFormatVersion": '2010-09-09')

I was able to deploy. This was also happening for the Staging Engine as well but when I downloaded the latest version yesterday the Staging Engine template didn't have that problem

Multi Part Causing Loop

I have been working on implementing the accelerated data lake and was having an issue with larger files that were using the multipart upload.

Once the process hit the staging-engine-AttachTagsAndMetaDataToFi Lambda step it would start the whole process again for the same file over and over again. If I disabled the multipart upload through the AWS CLI and uploaded a large file then the process went through fine. After some testing I was able to lock it down to this section in the Lambda

s3.copy( copy_source, bucket, key, ExtraArgs={ "Metadata": metadata, "MetadataDirective": "REPLACE" } )

Changing it to this has stopped the looping problem

s3.copy_object( Bucket =bucket, Key = key, CopySource = copy_source, Metadata = metadata, MetadataDirective='REPLACE' )

Updating existing file meta data tags

Hi,

I saw in the presentation describing the accelerated data lake, that updating the DataSource in Dynamo DB should work retrospecively on existing data, eg. if I add a metadata tag all existing Files would be updated to contain this Meta Data Tag. I don't seem to be able to get this to work?

Is this actually a feature of the accelerated data lake?
Is it possible to update all existing files with new or updated metadata?

Thanks in advance

Sample cannot be queried through Athena

Whilst the sample "rydebooking-1234567890.json" is successfully ingested and staged, it cannot be queried by Athena. Athena does not support multi-line .json. Attempting to query results in an error:

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Stripping the newlines from the input file, allows you to query via Athena.

Suggest that there should be a further step at the end of the instructions, for query via Athena.

aws-samples / accelerated-data-lake Goto Github PK

accelerated-data-lake's People

Contributors

Stargazers

Watchers

Forkers

accelerated-data-lake's Issues

Acceptable error threshold [Enhancement]

What is an error threshold?

Does error threshold apply to all validations?

How is error threshold configured?

Where do I need error threshold?

Update to default botocore impacting streaming to ElasticSearch

Visulation Template File

Multi Part Causing Loop

Updating existing file meta data tags

Sample cannot be queried through Athena

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent