Coder Social home page Coder Social logo

data-pipeline-samples's People

Contributors

autumnschild avatar aws-austin-lee avatar awsdan avatar chappidim avatar hyandell avatar jamesiri avatar jotok avatar jsposato avatar mbeitchman avatar mosheeshel avatar reecestart avatar rshade avatar ryanmaclean avatar sandhyae avatar sepulworld avatar soupsez avatar stmcpherson avatar tendril avatar vinayaktaws avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-pipeline-samples's Issues

Failed to open native connection: (Datastax) : dse spark-submit

I am using Shellcommandactivity to first copy the script from s3 and execute the same.
The resource is m3.xlarge , paravirtualization,
Failed to open native connection: (Datastax) : dse spark-submit ....
The error is :
Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {xxx.xxx.xxx.xxx}:9042
The connections / port connectivity as checked are all good.
This is a non-EMR standalone datastax cluster and the above shell activity is executed on a driver machine.

3 node backup fails on backup part 1

Hello,

I am trying to backup an EFS of 5TB size and the Data pipeline fails on Backup part 1 with the following error.

Unable to create resource for @EC2Resource1_2017-08-30T05:56:06 due to: Your quota allows for 0 more running instance(s). You requested at least 1 (Service: AmazonEC2; Status Code: 400; Error Code: InstanceLimitExceeded; Request ID: 0585067a-e291-472a-8581-a2a5108a2cdd)

The m3.xlarge instance limits are well within the range however it still fails.

AMI - ami-0188776c

I am a newbie to Data pipeline and any guidance is appreciated.

Thanks,
Hemanth

3-Node-EFSBackupPipeline.txt

RedshiftPassword variable is prefixed with an asterix

"username": "#{myRedshiftUsername}",
"*password": "#{*myRedshiftPassword}"
{
"description": "The password for the above user to establish connection to the Redshift cluster.",
"id": "*myRedshiftPassword",
"type": "String"
}

EFSBackup runs into timeout when mounting EFS volumes

Hello

I am facing a timeout issue when applying the DataPipeline template for EFS Backups. Usually that effect suggests a wrong security group configuration, however, I manually launched an EC2 instance belonging to mySrcSecGroupID and myBackupSecGroupID and accessing both EFS volumes was OK.

Attaching the StdErr.log below.
Thanks,
Peter

--2016-09-22 08:52:24-- https://s3-us-west-2.amazonaws.com/XXXXXX/efs-backup.sh
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.169.16
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.169.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2986 (2.9K) [application/x-sh]
Saving to: ‘efs-backup.sh’

 0K ..                                                    100% 93.5M=0s

2016-09-22 08:52:24 (93.5 MB/s) - ‘efs-backup.sh’ saved [2986/2986]

mount.nfs: Connection timed out
mount.nfs: Connection timed out
rm: cannot remove ‘/tmp/efs-backup.log’: No such file or directory

Not working with Data Piplelines as of Nov 8

We had this script stop working on data pipeline on or around Nov 8, 2016 (using the AWS walk through approach with Data pipelines). We also can't get it running on a new instance either. I'm not sure what changed, still investigating. It seems the instance created by Data pipelines cant see the mounts

The mount commands weren't throwing any timeout errors, so I spun up an EC2 instance with the same ami as Data pipeline uses. The mount command works the first time but no files appear on the share (they do on a "standard" EC2 Ami using the same mount command. If I unmount and run the command a second time, it hangs and doesn't time out (even after 20 mins)

Will do some more investigating but for now we just have the efs-backup.sh command running on a t2.micro as a cron (which works fine)

aws efs restore fail : ./efs-restore.sh: line 22: [: too many arguments

Hello,

I do some test use efs with data pipeline, according to backup efs .
I can backup efs , but tried restore efs many times failed. The efs size only 200M,

It is the efs s3 log

27 Jul 2017 08:42:35,923 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.TaskPoller: Executing: amazonaws.datapipeline.activity.ShellCommandActivity@31e1783b
27 Jul 2017 08:42:36,027 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: Executing command: wget https://raw.githubusercontent.com/awslabs/data-pipeline-samples/master/samples/EFSBackup/efs-restore.sh
chmod a+x efs-restore.sh
./efs-restore.sh $1 $2 $3 $4 $5
27 Jul 2017 08:42:36,042 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: configure ApplicationRunner with stdErr file: output/logs/df-02164142M6NFNIT11Y63/ShellCommandActivityObj/@ShellCommandActivityObj_2017-07-27T08:40:29/@ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1/StdError  and stdout file :output/logs/df-02164142M6NFNIT11Y63/ShellCommandActivityObj/@ShellCommandActivityObj_2017-07-27T08:40:29/@ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1/StdOutput
27 Jul 2017 08:42:36,043 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: Executing command: output/tmp/df-02164142M6NFNIT11Y63-1f87ddb121394f15b1d638c67340a48e/ShellCommandActivityObj20170727T084029Attempt1_command.sh with env variables :{} with argument : [10.1.2.200:/, 10.1.2.251:/, daily, 0, backup-fs-12345678]
27 Jul 2017 08:42:38,569 [ERROR] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.connector.staging.StageFromS3Connector: Script returned with exit status 23
27 Jul 2017 08:42:38,605 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :--2017-07-27 08:42:36--  https://raw.githubusercontent.com/awslabs/data-pipeline-samples/master/samples/EFSBackup/efs-restore.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1474 (1.4K) [text/plain]
Saving to: 鈥榚fs-restore.sh鈥�

     0K .                                                     100%  127M=0s

2017-07-27 08:42:36 (127 MB/s) - 鈥榚fs-restore.sh鈥� saved [1474/1474]

./efs-restore.sh: line 22: [: too many arguments
rsync: change_dir "/mnt/backups/backup-fs-12345678/daily.0" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
27 Jul 2017 08:42:38,606 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.HeartBeatService: Finished waiting for heartbeat thread @ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1
27 Jul 2017 08:42:38,606 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.TaskPoller: Work ShellCommandActivity took 0:2 to complete

The configuration file is nothing special.
myImageID use the the Amazon Linux AMI 2017.03.1 (PV) ami-98f3e7e1
myInstanceType use t1.micro

Thanks for your answer

(DynamoDB->Redshift) Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

Hey, for the "RedshiftCopyActivityFromDynamoDBTable", I followed exactly the same steps as the sample. However, the pipeline is always give me an error of "ava.lang.RuntimeException: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."

If I use SQL Bench with JDBC driver directly, the same command will work. It just doesn't work on AWS pipeline

How to run with standalone

If I have my standalone spark cluster with hdfs/yarn configured , What changes are required to run this code?

EFSBackup: rsync error: received SIGINT, SIGTERM, or SIGHUP

Hello,

rsync process gets killed for an unknown reason. Please see log attached below. The production EFS volume has 50 GB of data, backup volume ends up with approximately 17 GB of backup data before rsync gets killed.

Thanks
Peter

--2016-09-23 13:24:14-- https://s3-us-west-2.amazonaws.com/xxx/aws/efsbackup/efs-backup.sh
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.168.196
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.168.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2986 (2.9K) [application/x-sh]
Saving to: ‘efs-backup.sh’

 0K ..                                                    100% 90.7M=0s

2016-09-23 13:24:14 (90.7 MB/s) - ‘efs-backup.sh’ saved [2986/2986]

rm: cannot remove ‘/tmp/efs-backup.log’: No such file or directory
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(544) [sender=3.0.6]
rsync: writefd_unbuffered failed to write 97 bytes to socket [generator]: Broken pipe (32)

Just some Boto guidance

Hi,

Poking around in your code looking for useful Boto examples I noted you are explicitly deleting s3 buckets provisioned by a CloudFormation stack.

https://github.com/awslabs/data-pipeline-samples/blob/master/setup/stacker.py#L79

if r.resource_type == "AWS::S3::Bucket":
       if not s3:
           s3 = boto3.resource("s3")

           bucket = s3.Bucket(r.physical_resource_id)
              for key in bucket.objects.all():
                 key.delete()

Was wondering why you felt the need to explicitly delete s3 buckets that have been provisioned by CloudFormation. Are they not being handled by Stack.Delete()

Thanks
Terry

Fails to execute jar file in export DynamoDB to CSV

Data Pipeline newbie, any thoughts as to what is causing this error?
amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg : at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:275)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:227)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:366)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:463)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:479)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:697)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:636)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Job Submission failed with exception 'java.lang.NullPointerException(null)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec

(DynamoDB->Redshift) Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

For the "RedshiftCopyActivityFromDynamoDBTable", I followed exactly the same steps as the sample. However, the pipeline is always give me an error of "ava.lang.RuntimeException: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."

If I use SQL Bench with JDBC driver directly, the same command will work. It just doesn't work on AWS pipeline

EFS backup data pipeline bad format

I've been trying to use the script https://github.com/aws-samples/data-pipeline-samples/blob/master/samples/EFSBackup/efs-backup.sh to make my efs backups. Even though Data Pipeline said it was healthy, the stderr file shows:

mount.nfs: remote share not in 'host:dir' format

When i did it manually, it also showed the message and I realized that the mount command format for efs has changed from

sudo mount -t nfs -o nfsvers=4.1 -o rsize=1048576 -o wsize=1048576 -o timeo=600 -o retrans=2 -o hard {efs-ip-addr} /backup

to

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport {efs-id}.efs.eu-west-1.amazonaws.com:/ /backup

It took me a bit to realize or maybe I am doing something wrong and the older command is still valid?

Issue on billing samples

Hi,

i was testing your billing sample but apparently it didnt work anymore.
it broke about to create folder on this step "directoryPath": "#{myS3StagingLoc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"

Could be usefull to have this sample fixed

Thanks for your help.

Regards,
Julien.

PostgreSQL -> Redshift?

Is there any working sample / template for loading postgresql data onto redshift?
What is the ideal way to handle schema creation, and deleted / updated data?

DynamoDBImportCSV : CSV file format

Hi guys,

Please, can you tell me what is the correct CSV format for the script DynamoDBImportCSV.
Comma separated only ?
Headers are mandatory ?

Thanks for your answer ;)
Cheers

MD5 not working correctly with PostgreSQL

Hi,

I'm moving few data to Redshift on daily basis. The data are copied to Redshift with the help of a shell script which uses PSQL for inserting data into Redshift from a CSV file.

Since it runs every day and takes data from last one week, lot of duplicate data are inserted. So to avoid this I compute hash using MD5, and using the hash, I insert only the new data and ignore the duplicate one. But PSQL is not computing the hash correctly. Means that when I compute row_hash with the same query form SQLWorkbench, it works fine, but now with PSQL.

The shell script which performs the above task is stored in S3.

Code wise everything is fine. Because when I execute the same query from the Workbench, I don't find any problem.

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.