aws-samples / accelerated-data-lake Goto Github PK
View Code? Open in Web Editor NEWA packaged Data Lake solution, that builds a highly functional Data Lake, with a data catalog queryable via Elasticsearch
License: Apache License 2.0
A packaged Data Lake solution, that builds a highly functional Data Lake, with a data catalog queryable via Elasticsearch
License: Apache License 2.0
The accelerated data lake framework follows strict validation policy. The data file is considered failed if there is even a single validation error. The error threshold is a mechanism to allow an acceptable number of validation errors per data file.
The error threshold only applies to semantic validation rules e.g. number ranges, enumeration values etc. The data file is rejected if there are syntactic errors.
The error threshold can be configured via the data source config file. The error threshold is represented as % of allowed errors per total number of records in a file.
The error threshold can be useful when dealing with low quality data files. With error threshold, the customers can still build the data lake without having to correct the errors for each files.
An update to the botocore version inside Lambda prevents the ElasticSearch streaming function from working. You will see an error in the Cloudwatch logs similar to "cannot import name 'BotocoreHTTPSession'"
This can be fixed by adding a Lambda Layer specific to the region the function is running in:
ap-northeast-1: arn:aws:lambda:ap-northeast-1:249908578461:layer:AWSLambda-Python-AWS-SDK:1
us-east-1: arn:aws:lambda:us-east-1:668099181075:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-1: arn:aws:lambda:ap-southeast-1:468957933125:layer:AWSLambda-Python-AWS-SDK:1
eu-west-1: arn:aws:lambda:eu-west-1:399891621064:layer:AWSLambda-Python-AWS-SDK:1
us-west-1: arn:aws:lambda:us-west-1:325793726646:layer:AWSLambda-Python-AWS-SDK:1
ap-east-1: arn:aws:lambda:ap-east-1:118857876118:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-2: arn:aws:lambda:ap-northeast-2:296580773974:layer:AWSLambda-Python-AWS-SDK:1
ap-northeast-3: arn:aws:lambda:ap-northeast-3:961244031340:layer:AWSLambda-Python-AWS-SDK:1
ap-south-1:631267018583: arn:aws:lambda:ap-south-1:631267018583:layer:AWSLambda-Python-AWS-SDK:1
ap-southeast-2: arn:aws:lambda:ap-southeast-2:817496625479:layer:AWSLambda-Python-AWS-SDK:1
ca-central-1: arn:aws:lambda:ca-central-1:778625758767:layer:AWSLambda-Python-AWS-SDK:1
eu-central-1: arn:aws:lambda:eu-central-1:292169987271:layer:AWSLambda-Python-AWS-SDK:1
eu-north-1: arn:aws:lambda:eu-north-1:642425348156:layer:AWSLambda-Python-AWS-SDK:1
eu-west-2: arn:aws:lambda:eu-west-2:142628438157:layer:AWSLambda-Python-AWS-SDK:1
eu-west-3: arn:aws:lambda:eu-west-3:959311844005:layer:AWSLambda-Python-AWS-SDK:1
sa-east-1: arn:aws:lambda:sa-east-1:640010853179:layer:AWSLambda-Python-AWS-SDK:1us-us-east-2: arn:aws:lambda:us-east-2:259788987135:layer:AWSLambda-Python-AWS-SDK:1
us-west-2: arn:aws:lambda:us-west-2:420165488524:layer:AWSLambda-Python-AWS-SDK:1
cn-north-1: arn:aws-cn:lambda:cn-north-1:683298794825:layer:AWSLambda-Python-AWS-SDK:1
cn-northwest-1: arn:aws-cn:lambda:cn-northwest-1:382066503313:layer:AWSLambda-Python-AWS-SDK:1
us-gov-west-1:: arn:aws-us-gov:lambda:us-gov-west-1:556739011827:layer:AWSLambda-Python-AWS-SDK:1
us-gov-east-1: arn:aws-us-gov:lambda:us-gov-east-1:138526772879:layer:AWSLambda-Python-AWS-SDK:1
When generating the Visulation Lambdas sam package --template-file ./lambdaDeploy.yaml --output-template-file lambdaDeployCFN.yaml
The template file created has and incorrect TemplateFormatVersion line at the end of the file and wont deploy
"\xEF\xBB\xBFAWSTemplateFormatVersion": '2010-09-09')
When changed to
"AWSTemplateFormatVersion": '2010-09-09')
I was able to deploy. This was also happening for the Staging Engine as well but when I downloaded the latest version yesterday the Staging Engine template didn't have that problem
I have been working on implementing the accelerated data lake and was having an issue with larger files that were using the multipart upload.
Once the process hit the staging-engine-AttachTagsAndMetaDataToFi Lambda step it would start the whole process again for the same file over and over again. If I disabled the multipart upload through the AWS CLI and uploaded a large file then the process went through fine. After some testing I was able to lock it down to this section in the Lambda
s3.copy( copy_source, bucket, key, ExtraArgs={ "Metadata": metadata, "MetadataDirective": "REPLACE" } )
Changing it to this has stopped the looping problem
s3.copy_object( Bucket =bucket, Key = key, CopySource = copy_source, Metadata = metadata, MetadataDirective='REPLACE' )
Hi,
I saw in the presentation describing the accelerated data lake, that updating the DataSource in Dynamo DB should work retrospecively on existing data, eg. if I add a metadata tag all existing Files would be updated to contain this Meta Data Tag. I don't seem to be able to get this to work?
Is this actually a feature of the accelerated data lake?
Is it possible to update all existing files with new or updated metadata?
Thanks in advance
Whilst the sample "rydebooking-1234567890.json" is successfully ingested and staged, it cannot be queried by Athena. Athena does not support multi-line .json. Attempting to query results in an error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
Stripping the newlines from the input file, allows you to query via Athena.
Suggest that there should be a further step at the end of the instructions, for query via Athena.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.