amazon-archives / data-pipeline-samples Goto Github PK

View Code? Open in Web Editor NEW

464.0 47.0 270.0 251 KB

This repository hosts sample pipelines

License: MIT No Attribution

Python 100.00%

data-pipeline-samples's People

Contributors

Stargazers

Watchers

Forkers

garrdaniel vinayaktaws aws-austin-lee tendril chenjunren0406 imclab ericchang aravndr smartpcr raharsha shunataws jiangkeshu babasomanath ilia-semenov mbeitchman sandhyae awsdan rbkaya soupsez aws-support sepulworld pariyat paulcorbani stisti joyrahman macchiatoism jamshedmelik pieces201020 katerynad dinob0t witsoej mepuji zdpg alihussain-nbs shylaharild talebgufran louhow richa1995 mrphishxxx snandam notseandavis jsposato galegoperu rbramwell shirleycohen ryanmaclean mitesh91 chrisgorgo emmanueljob gwsu2008 sanjoyb tuapuikia yeonghoey 548568 rshade pmmarq sudhirdharan jilanbshk oliverkinne isac saurabhkpatel ganeshrajulinaro iollari glathrop xntric78 bubbleupdotnet anandlahoti neuralnoise vineetgoel diverted247 guillermo-lopez mobrahim infrascup fayoubi unndevops mattcvincent guilhermesmi nazim8765 mikelazzaro hons darrylsosborne speculator55005 oviis juan-vargas raghavvidya meganwang07 folpindo yuchu89 sqlheisenberg billyteves ayomacro bryant1410 blacktooth tomerbaumel alinmindroc skytechco deepaksatiec r4sutton ssurendar ndegardin

data-pipeline-samples's Issues

Failed to open native connection: (Datastax) : dse spark-submit

I am using Shellcommandactivity to first copy the script from s3 and execute the same.
The resource is m3.xlarge , paravirtualization,
Failed to open native connection: (Datastax) : dse spark-submit ....
The error is :
Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {xxx.xxx.xxx.xxx}:9042
The connections / port connectivity as checked are all good.
This is a non-EMR standalone datastax cluster and the above shell activity is executed on a driver machine.

Step 3 fails: aws datapipeline create-default-roles

Thsi sample no longer works, fails at step 3.

-> aws datapipeline create-default-roles
usage: aws [options] [parameters]
aws: error: argument operation: Invalid choice, valid choices are:

3 node backup fails on backup part 1

Hello,

I am trying to backup an EFS of 5TB size and the Data pipeline fails on Backup part 1 with the following error.

Unable to create resource for @EC2Resource1_2017-08-30T05:56:06 due to: Your quota allows for 0 more running instance(s). You requested at least 1 (Service: AmazonEC2; Status Code: 400; Error Code: InstanceLimitExceeded; Request ID: 0585067a-e291-472a-8581-a2a5108a2cdd)

The m3.xlarge instance limits are well within the range however it still fails.

AMI - ami-0188776c

I am a newbie to Data pipeline and any guidance is appreciated.

Thanks,
Hemanth

3-Node-EFSBackupPipeline.txt

The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.

Hello,

When I want to access the activity logs I get this error: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.

I use https://github.com/awslabs/data-pipeline-samples/blob/master/samples/EFSBackup/2-Node-EFSBackupPipeline.json

Any ideea?

Thank you

RedshiftPassword variable is prefixed with an asterix

"username": "#{myRedshiftUsername}",
"*password": "#{*myRedshiftPassword}"
{
"description": "The password for the above user to establish connection to the Redshift cluster.",
"id": "*myRedshiftPassword",
"type": "String"
}

EFSBackup runs into timeout when mounting EFS volumes

Hello

I am facing a timeout issue when applying the DataPipeline template for EFS Backups. Usually that effect suggests a wrong security group configuration, however, I manually launched an EC2 instance belonging to mySrcSecGroupID and myBackupSecGroupID and accessing both EFS volumes was OK.

Attaching the StdErr.log below.
Thanks,
Peter

--2016-09-22 08:52:24-- https://s3-us-west-2.amazonaws.com/XXXXXX/efs-backup.sh
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.169.16
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.169.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2986 (2.9K) [application/x-sh]
Saving to: â€˜efs-backup.shâ€™

 0K ..                                                    100% 93.5M=0s

2016-09-22 08:52:24 (93.5 MB/s) - â€˜efs-backup.shâ€™ saved [2986/2986]

mount.nfs: Connection timed out
mount.nfs: Connection timed out
rm: cannot remove â€˜/tmp/efs-backup.logâ€™: No such file or directory

Not working with Data Piplelines as of Nov 8

We had this script stop working on data pipeline on or around Nov 8, 2016 (using the AWS walk through approach with Data pipelines). We also can't get it running on a new instance either. I'm not sure what changed, still investigating. It seems the instance created by Data pipelines cant see the mounts

The mount commands weren't throwing any timeout errors, so I spun up an EC2 instance with the same ami as Data pipeline uses. The mount command works the first time but no files appear on the share (they do on a "standard" EC2 Ami using the same mount command. If I unmount and run the command a second time, it hangs and doesn't time out (even after 20 mins)

Will do some more investigating but for now we just have the efs-backup.sh command running on a t2.micro as a cron (which works fine)

aws efs restore fail : ./efs-restore.sh: line 22: [: too many arguments

Hello,

I do some test use efs with data pipeline, according to backup efs .
I can backup efs , but tried restore efs many times failed. The efs size only 200M,

It is the efs s3 log

27 Jul 2017 08:42:35,923 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.TaskPoller: Executing: amazonaws.datapipeline.activity.ShellCommandActivity@31e1783b
27 Jul 2017 08:42:36,027 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: Executing command: wget https://raw.githubusercontent.com/awslabs/data-pipeline-samples/master/samples/EFSBackup/efs-restore.sh
chmod a+x efs-restore.sh
./efs-restore.sh $1 $2 $3 $4 $5
27 Jul 2017 08:42:36,042 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: configure ApplicationRunner with stdErr file: output/logs/df-02164142M6NFNIT11Y63/ShellCommandActivityObj/@ShellCommandActivityObj_2017-07-27T08:40:29/@ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1/StdError  and stdout file :output/logs/df-02164142M6NFNIT11Y63/ShellCommandActivityObj/@ShellCommandActivityObj_2017-07-27T08:40:29/@ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1/StdOutput
27 Jul 2017 08:42:36,043 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: Executing command: output/tmp/df-02164142M6NFNIT11Y63-1f87ddb121394f15b1d638c67340a48e/ShellCommandActivityObj20170727T084029Attempt1_command.sh with env variables :{} with argument : [10.1.2.200:/, 10.1.2.251:/, daily, 0, backup-fs-12345678]
27 Jul 2017 08:42:38,569 [ERROR] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.connector.staging.StageFromS3Connector: Script returned with exit status 23
27 Jul 2017 08:42:38,605 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :--2017-07-27 08:42:36--  https://raw.githubusercontent.com/awslabs/data-pipeline-samples/master/samples/EFSBackup/efs-restore.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1474 (1.4K) [text/plain]
Saving to: 鈥榚fs-restore.sh鈥�

     0K .                                                     100%  127M=0s

2017-07-27 08:42:36 (127 MB/s) - 鈥榚fs-restore.sh鈥� saved [1474/1474]

./efs-restore.sh: line 22: [: too many arguments
rsync: change_dir "/mnt/backups/backup-fs-12345678/daily.0" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
27 Jul 2017 08:42:38,606 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.HeartBeatService: Finished waiting for heartbeat thread @ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1
27 Jul 2017 08:42:38,606 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.TaskPoller: Work ShellCommandActivity took 0:2 to complete

The configuration file is nothing special.
myImageID use the the Amazon Linux AMI 2017.03.1 (PV) ami-98f3e7e1
myInstanceType use t1.micro

Thanks for your answer

(DynamoDB->Redshift) Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

Hey, for the "RedshiftCopyActivityFromDynamoDBTable", I followed exactly the same steps as the sample. However, the pipeline is always give me an error of "ava.lang.RuntimeException: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."

If I use SQL Bench with JDBC driver directly, the same command will work. It just doesn't work on AWS pipeline

How to run with standalone

If I have my standalone spark cluster with hdfs/yarn configured , What changes are required to run this code?

EFSBackup: rsync error: received SIGINT, SIGTERM, or SIGHUP

Hello,

rsync process gets killed for an unknown reason. Please see log attached below. The production EFS volume has 50 GB of data, backup volume ends up with approximately 17 GB of backup data before rsync gets killed.

Thanks
Peter

--2016-09-23 13:24:14-- https://s3-us-west-2.amazonaws.com/xxx/aws/efsbackup/efs-backup.sh
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.168.196
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.168.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2986 (2.9K) [application/x-sh]
Saving to: â€˜efs-backup.shâ€™

 0K ..                                                    100% 90.7M=0s

2016-09-23 13:24:14 (90.7 MB/s) - â€˜efs-backup.shâ€™ saved [2986/2986]

rm: cannot remove â€˜/tmp/efs-backup.logâ€™: No such file or directory
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(544) [sender=3.0.6]
rsync: writefd_unbuffered failed to write 97 bytes to socket [generator]: Broken pipe (32)

EFS mounts should use

The efs mounts in the backup script should use the additional mount options:http://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-general.html
rsize=1048576
wsize=1048576
hard
timeo=600
retrans=2

No sample in rds-to-rds copy folder

This folder https://github.com/awslabs/data-pipeline-samples/tree/master/samples/rds-to-rds-copy has no JSON file.

Just some Boto guidance

Hi,

Poking around in your code looking for useful Boto examples I noted you are explicitly deleting s3 buckets provisioned by a CloudFormation stack.

https://github.com/awslabs/data-pipeline-samples/blob/master/setup/stacker.py#L79

if r.resource_type == "AWS::S3::Bucket":
       if not s3:
           s3 = boto3.resource("s3")

           bucket = s3.Bucket(r.physical_resource_id)
              for key in bucket.objects.all():
                 key.delete()

Was wondering why you felt the need to explicitly delete s3 buckets that have been provisioned by CloudFormation. Are they not being handled by Stack.Delete()

Thanks
Terry

Fails to execute jar file in export DynamoDB to CSV

Data Pipeline newbie, any thoughts as to what is causing this error?
amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg : at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:275)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:227)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:366)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:463)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:479)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:697)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:636)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Job Submission failed with exception 'java.lang.NullPointerException(null)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec

(DynamoDB->Redshift) Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

For the "RedshiftCopyActivityFromDynamoDBTable", I followed exactly the same steps as the sample. However, the pipeline is always give me an error of "ava.lang.RuntimeException: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."

If I use SQL Bench with JDBC driver directly, the same command will work. It just doesn't work on AWS pipeline

EFS backup data pipeline bad format

I've been trying to use the script https://github.com/aws-samples/data-pipeline-samples/blob/master/samples/EFSBackup/efs-backup.sh to make my efs backups. Even though Data Pipeline said it was healthy, the stderr file shows:

mount.nfs: remote share not in 'host:dir' format

When i did it manually, it also showed the message and I realized that the mount command format for efs has changed from

sudo mount -t nfs -o nfsvers=4.1 -o rsize=1048576 -o wsize=1048576 -o timeo=600 -o retrans=2 -o hard {efs-ip-addr} /backup

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport {efs-id}.efs.eu-west-1.amazonaws.com:/ /backup

It took me a bit to realize or maybe I am doing something wrong and the older command is still valid?

On-Premisis Teradata to AWS S3

How to move daily/weekly/monthly data from Teradata On-Premisis server to AWS S3 storage?

Issue on billing samples

Hi,

i was testing your billing sample but apparently it didnt work anymore.
it broke about to create folder on this step "directoryPath": "#{myS3StagingLoc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"

Could be usefull to have this sample fixed

Thanks for your help.

Regards,
Julien.

create virtual environment for running python script for RDStoRedshiftSqoop sample

This may be worthwhile so that we have more control over the environment we are running in.

PostgreSQL -> Redshift?

Is there any working sample / template for loading postgresql data onto redshift?
What is the ideal way to handle schema creation, and deleted / updated data?

DynamoDBImportCSV : CSV file format

Hi guys,

Please, can you tell me what is the correct CSV format for the script DynamoDBImportCSV.
Comma separated only ?
Headers are mandatory ?

Thanks for your answer ;)
Cheers

MD5 not working correctly with PostgreSQL

Hi,

I'm moving few data to Redshift on daily basis. The data are copied to Redshift with the help of a shell script which uses PSQL for inserting data into Redshift from a CSV file.

Since it runs every day and takes data from last one week, lot of duplicate data are inserted. So to avoid this I compute hash using MD5, and using the hash, I insert only the new data and ignore the duplicate one. But PSQL is not computing the hash correctly. Means that when I compute row_hash with the same query form SQLWorkbench, it works fine, but now with PSQL.

The shell script which performs the above task is stored in S3.

Code wise everything is fine. Because when I execute the same query from the Workbench, I don't find any problem.

Thanks in advance.

amazon-archives / data-pipeline-samples Goto Github PK

data-pipeline-samples's People

Contributors

Stargazers

Watchers

Forkers

data-pipeline-samples's Issues

Recommend Projects

Recommend Topics

Recommend Org