Coder Social home page Coder Social logo

s4cmd's Introduction

s4cmd

Super S3 command line tool

Build Status Join the chat at https://gitter.im/bloomreach/     s4cmd Packaging status


Author: Chou-han Yang (@chouhanyang)

Current Maintainers: Debodirno Chandra (@debodirno) | Naveen Vardhi (@rozuur) | Navin Pai (@navinpai)


What's New in s4cmd 2.x

  • Fully migrated from old boto 2.x to new boto3 library, which provides more reliable and up-to-date S3 backend.
  • Support S3 --API-ServerSideEncryption along with 36 new API pass-through options. See API pass-through options section for complete list.
  • Support batch delete (with delete_objects API) to delete up to 1000 files with single call. 100+ times faster than sequential deletion.
  • Support S4CMD_OPTS environment variable for commonly used options such as --API-ServerSideEncryption across all your s4cmd operations.
  • Support moving files larger than 5GB with multipart upload. 20+ times faster then sequential move operation when moving large files.
  • Support timestamp filtering with --last-modified-before and --last-modified-after options for all operations. Human friendly timestamps are supported, e.g. --last-modified-before='2 months ago'
  • Faster upload with lazy evaluation of md5 hash.
  • Listing large number of files with S3 pagination, with memory is the limit.
  • New directory to directory dsync command is better and standalone implementation to replace old sync command, which is implemented based on top of get/put/mv commands. --delete-removed work for all cases including local to s3, s3 to local, and s3 to s3. sync command preserves the old behavior in this version for compatibility.
  • Support for S3 compatible storage services such as DreamHost and Cloudian using --endpoint-url (Community Supported Beta Feature).
  • Tested on both Python 2.7, 3.6, 3.7, 3.8, 3.9 and nightly.
  • Special thanks to onera.com for supporting s4cmd.

Motivation

S4cmd is a command-line utility for accessing Amazon S3, inspired by s3cmd.

We have used s3cmd heavily for a number of scripted, data-intensive applications. However as the need for a variety of small improvements arose, we created our own implementation, s4cmd. It is intended as an alternative to s3cmd for enhanced performance and for large files, and with a number of additional features and fixes that we have found useful.

It strives to be compatible with the most common usage scenarios for s3cmd. It does not offer exact drop-in compatibility, due to a number of corner cases where different behavior seems preferable, or for bugfixes.

Features

S4cmd supports the regular commands you might expect for fetching and storing files in S3: ls, put, get, cp, mv, sync, del, du.

The main features that distinguish s4cmd are:

  • Simple (less than 1500 lines of code) and implemented in pure Python, based on the widely used Boto3 library.
  • Multi-threaded/multi-connection implementation for enhanced performance on all commands. As with many network-intensive applications (like web browsers), accessing S3 in a single-threaded way is often significantly less efficient than having multiple connections actively transferring data at once. In general, we get a 2X boost to upload/download speeds from this.
  • Path handling: S3 is not a traditional filesystem with built-in support for directory structure: internally, there are only objects, not directories or folders. However, most people use S3 in a hierarchical structure, with paths separated by slashes, to emulate traditional filesystems. S4cmd follows conventions to more closely replicate the behavior of traditional filesystems in certain corner cases. For example, "ls" and "cp" work much like in Unix shells, to avoid odd surprises. (For examples see compatibility notes below.)
  • Wildcard support: Wildcards, including multiple levels of wildcards, like in Unix shells, are handled. For example: s3://my-bucket/my-folder/20120512/*/*chunk00?1?
  • Automatic retry: Failure tasks will be executed again after a delay.
  • Multi-part upload support for files larger than 5GB.
  • Handling of MD5s properly with respect to multi-part uploads (for the sordid details of this, see below).
  • Miscellaneous enhancements and bugfixes:
    • Partial file creation: Avoid creating empty target files if source does not exist. Avoid creating partial output files when commands are interrupted.
    • General thread safety: Tool can be interrupted or killed at any time without being blocked by child threads or leaving incomplete or corrupt files in place.
    • Ensure exit code is nonzero on all failure scenarios (a very important feature in scripts).
    • Expected handling of symlinks (they are followed).
    • Support both s3:// and s3n:// prefixes (the latter is common with Amazon Elastic Mapreduce).

Limitations:

  • No CloudFront or other feature support.
  • Currently, we simulate sync with get and put with --recursive --force --sync-check.

Installation and Setup

You can install s4cmd PyPI.

pip install s4cmd
  • Copy or create a symbolic link so you can run s4cmd.py as s4cmd. (It is just a single file!)
  • If you already have a ~/.s3cfg file from configuring s3cmd, credentials from this file will be used. Otherwise, set the S3_ACCESS_KEY and S3_SECRET_KEY environment variables to contain your S3 credentials.
  • If no keys are provided, but an IAM role is associated with the EC2 instance, it will be used transparently.

s4cmd Commands

s4cmd ls [path]

List all contents of a directory.

  • -r/--recursive: recursively display all contents including subdirectories under the given path.
  • -d/--show-directory: show the directory entry instead of its content.

s4cmd put [source] [target]

Upload local files up to S3.

  • -r/--recursive: also upload directories recursively.
  • -s/--sync-check: check md5 hash to avoid uploading the same content.
  • -f/--force: override existing file instead of showing error message.
  • -n/--dry-run: emulate the operation without real upload.

s4cmd get [source] [target]

Download files from S3 to local filesystem.

  • -r/--recursive: also download directories recursively.
  • -s/--sync-check: check md5 hash to avoid downloading the same content.
  • -f/--force: override existing file instead of showing error message.
  • -n/--dry-run: emulate the operation without real download.

s4cmd dsync [source dir] [target dir]

Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories.

  • -r/--recursive: also sync directories recursively.
  • -s/--sync-check: check md5 hash to avoid syncing the same content.
  • -f/--force: override existing file instead of showing error message.
  • -n/--dry-run: emulate the operation without real sync.
  • --delete-removed: delete files not in source directory.

s4cmd sync [source] [target]

(Obsolete, use dsync instead) Synchronize the contents of two directories. The directory can either be local or remote, but currently, it doesn't support two local directories. This command simply invoke get/put/mv commands.

  • -r/--recursive: also sync directories recursively.
  • -s/--sync-check: check md5 hash to avoid syncing the same content.
  • -f/--force: override existing file instead of showing error message.
  • -n/--dry-run: emulate the operation without real sync.
  • --delete-removed: delete files not in source directory. Only works when syncing local directory to s3 directory.

s4cmd cp [source] [target]

Copy a file or a directory from a S3 location to another.

  • -r/--recursive: also copy directories recursively.
  • -s/--sync-check: check md5 hash to avoid copying the same content.
  • -f/--force: override existing file instead of showing error message.
  • -n/--dry-run: emulate the operation without real copy.

s4cmd mv [source] [target]

Move a file or a directory from a S3 location to another.

  • -r/--recursive: also move directories recursively.
  • -s/--sync-check: check md5 hash to avoid moving the same content.
  • -f/--force: override existing file instead of showing error message.
  • -n/--dry-run: emulate the operation without real move.

s4cmd del [path]

Delete files or directories on S3.

  • -r/--recursive: also delete directories recursively.
  • -n/--dry-run: emulate the operation without real delete.

s4cmd du [path]

Get the size of the given directory.

Available parameters:

  • -r/--recursive: also add sizes of sub-directories recursively.

s4cmd Control Options

-p S3CFG, --config=[filename]

path to s3cfg config file

-f, --force

force overwrite files when download or upload

-r, --recursive

recursively checking subdirectories

-s, --sync-check

check file md5 before download or upload

-n, --dry-run

trial run without actual download or upload

-t RETRY, --retry=[integer]

number of retries before giving up

--retry-delay=[integer]

seconds to sleep between retries

-c NUM_THREADS, --num-threads=NUM_THREADS

number of concurrent threads

--endpoint-url

endpoint url used in boto3 client

-d, --show-directory

show directory instead of its content

--ignore-empty-source

ignore empty source from s3

--use-ssl

(obsolete) use SSL connection to S3

--ignore-certificate

use ignore certificate to ignore the ssl verification

--verbose

verbose output

--debug

debug output

--validate

(obsolete) validate lookup operation

-D, --delete-removed

delete remote files that do not exist in source after sync

--multipart-split-size=[integer]

size in bytes to split multipart transfers

--max-singlepart-download-size=[integer]

files with size (in bytes) greater than this will be downloaded in multipart transfers

--max-singlepart-upload-size=[integer]

files with size (in bytes) greater than this will be uploaded in multipart transfers

--max-singlepart-copy-size=[integer]

files with size (in bytes) greater than this will be copied in multipart transfers

--batch-delete-size=[integer]

Number of files (<1000) to be combined in batch delete.

--last-modified-before=[datetime]

Condition on files where their last modified dates are before given parameter.

--last-modified-after=[datetime]

Condition on files where their last modified dates are after given parameter.

S3 API Pass-through Options

Those options are directly translated to boto3 API commands. The options provided will be filtered by the APIs that are taking parameters. For example, --API-ServerSideEncryption is only needed for put_object, create_multipart_upload but not for list_buckets and get_objects for example. Therefore, providing --API-ServerSideEncryption for s4cmd ls has no effect.

For more information, please see boto3 s3 documentations http://boto3.readthedocs.io/en/latest/reference/services/s3.html

--API-ACL=[string]

The canned ACL to apply to the object.

--API-CacheControl=[string]

Specifies caching behavior along the request/reply chain.

--API-ContentDisposition=[string]

Specifies presentational information for the object.

--API-ContentEncoding=[string]

Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.

--API-ContentLanguage=[string]

The language the content is in.

--API-ContentMD5=[string]

The base64-encoded 128-bit MD5 digest of the part data.

--API-ContentType=[string]

A standard MIME type describing the format of the object data.

--API-CopySourceIfMatch=[string]

Copies the object if its entity tag (ETag) matches the specified tag.

--API-CopySourceIfModifiedSince=[datetime]

Copies the object if it has been modified since the specified time.

--API-CopySourceIfNoneMatch=[string]

Copies the object if its entity tag (ETag) is different than the specified ETag.

--API-CopySourceIfUnmodifiedSince=[datetime]

Copies the object if it hasn't been modified since the specified time.

--API-CopySourceRange=[string]

The range of bytes to copy from the source object. The range value must use the form bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example, bytes=0-9 indicates that you want to copy the first ten bytes of the source. You can copy a range only if the source object is greater than 5 GB.

--API-CopySourceSSECustomerAlgorithm=[string]

Specifies the algorithm to use when decrypting the source object (e.g., AES256).

--API-CopySourceSSECustomerKeyMD5=[string]

Specifies the 128-bit MD5 digest of the encryption key according to RFC 1321. Amazon S3 uses this header for a message integrity check to ensure the encryption key was transmitted without error. Please note that this parameter is automatically populated if it is not provided. Including this parameter is not required

--API-CopySourceSSECustomerKey=[string]

Specifies the customer-provided encryption key for Amazon S3 to use to decrypt the source object. The encryption key provided in this header must be one that was used when the source object was created.

--API-ETag=[string]

Entity tag returned when the part was uploaded.

--API-Expires=[datetime]

The date and time at which the object is no longer cacheable.

--API-GrantFullControl=[string]

Gives the grantee READ, READ_ACP, and WRITE_ACP permissions on the object.

--API-GrantReadACP=[string]

Allows grantee to read the object ACL.

--API-GrantRead=[string]

Allows grantee to read the object data and its metadata.

--API-GrantWriteACP=[string]

Allows grantee to write the ACL for the applicable object.

--API-IfMatch=[string]

Return the object only if its entity tag (ETag) is the same as the one specified, otherwise return a 412 (precondition failed).

--API-IfModifiedSince=[datetime]

Return the object only if it has been modified since the specified time, otherwise return a 304 (not modified).

--API-IfNoneMatch=[string]

Return the object only if its entity tag (ETag) is different from the one specified, otherwise return a 304 (not modified).

--API-IfUnmodifiedSince=[datetime]

Return the object only if it has not been modified since the specified time, otherwise return a 412 (precondition failed).

--API-Metadata=[dict]

A map (in json string) of metadata to store with the object in S3

--API-MetadataDirective=[string]

Specifies whether the metadata is copied from the source object or replaced with metadata provided in the request.

--API-MFA=[string]

The concatenation of the authentication device's serial number, a space, and the value that is displayed on your authentication device.

--API-RequestPayer=[string]

Confirms that the requester knows that she or he will be charged for the request. Bucket owners need not specify this parameter in their requests. Documentation on downloading objects from requester pays buckets can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectsinRequesterPaysBuckets.html

--API-ServerSideEncryption=[string]

The Server-side encryption algorithm used when storing this object in S3 (e.g., AES256, aws:kms).

--API-SSECustomerAlgorithm=[string]

Specifies the algorithm to use to when encrypting the object (e.g., AES256).

--API-SSECustomerKeyMD5=[string]

Specifies the 128-bit MD5 digest of the encryption key according to RFC 1321. Amazon S3 uses this header for a message integrity check to ensure the encryption key was transmitted without error. Please note that this parameter is automatically populated if it is not provided. Including this parameter is not required

--API-SSECustomerKey=[string]

Specifies the customer-provided encryption key for Amazon S3 to use in encrypting data. This value is used to store the object and then it is discarded; Amazon does not store the encryption key. The key must be appropriate for use with the algorithm specified in the x-amz-server-side-encryption-customer-algorithm header.

--API-SSEKMSKeyId=[string]

Specifies the AWS KMS key ID to use for object encryption. All GET and PUT requests for an object protected by AWS KMS will fail if not made via SSL or using SigV4. Documentation on configuring any of the officially supported AWS SDKs and CLI can be found at http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingAWSSDK.html#specify-signature-version

--API-StorageClass=[string]

The type of storage to use for the object. Defaults to 'STANDARD'.

--API-VersionId=[string]

VersionId used to reference a specific version of the object.

--API-WebsiteRedirectLocation=[string]

If the bucket is configured as a website, redirects requests for this object to another object in the same bucket or to an external URL. Amazon S3 stores the value of this header in the object metadata.

Debugging Tips

Simply enable --debug option to see the full log of s4cmd. If you even need to check what APIs are invoked from s4cmd to boto3, you can run:

s4cmd --debug [op] .... 2>&1 >/dev/null | grep S3APICALL

To see all the parameters sending to S3 API.

Compatibility between s3cmd and s4cmd

Prefix matching: In s3cmd, unlike traditional filesystems, prefix names match listings:

>> s3cmd ls s3://my-bucket/ch
s3://my-bucket/charlie/
s3://my-bucket/chyang/

In s4cmd, behavior is the same as with a Unix shell:

>>s4cmd ls s3://my-bucket/ch
>(empty)

To get prefix behavior, use explicit wildcards instead: s4cmd ls s3://my-bucket/ch*

Similarly, sync and cp commands emulate the Unix cp command, so directory to directory sync use different syntax:

>> s3cmd sync s3://bucket/path/dirA s3://bucket/path/dirB/

will copy contents in dirA to dirB.

>> s4cmd sync s3://bucket/path/dirA s3://bucket/path/dirB/

will copy dirA into dirB.

To achieve the s3cmd behavior, use wildcards:

s4cmd sync s3://bucket/path/dirA/* s3://bucket/path/dirB/

Note s4cmd doesn't support dirA without trailing slash indicating dirA/* as what rsync supported.

No automatic override for put command: s3cmd put fileA s3://bucket/path/fileB will return error if fileB exists. Use -f as well as get command.

Bugfixes for handling of non-existent paths: Often s3cmd creates empty files when specified paths do not exist: s3cmd get s3://my-bucket/no_such_file downloads an empty file. s4cmd get s3://my-bucket/no_such_file returns an error. s3cmd put no_such_file s3://my-bucket/ uploads an empty file. s4cmd put no_such_file s3://my-bucket/ returns an error.

Additional technical notes

Etags, MD5s and multi-part uploads: Traditionally, the etag of an object in S3 has been its MD5. However, this changed with the introduction of S3 multi-part uploads; in this case the etag is still a unique ID, but it is not the MD5 of the file. Amazon has not revealed the definition of the etag in this case, so there is no way we can calculate and compare MD5s based on the etag header in general. The workaround we use is to upload the MD5 as a supplemental content header (called "md5", instead of "etag"). This enables s4cmd to check the MD5 hash before upload or download. The only limitation is that this only works for files uploaded via s4cmd. Programs that do not understand this header will still have to download and verify the MD5 directly.

Unimplemented features

  • CloudFront or other feature support beyond basic S3 access.

Credits

s4cmd's People

Contributors

abridgett avatar apetresc avatar burritothief avatar chenrui333 avatar debodirno avatar dependabot[bot] avatar doctorxwrites avatar ehaupt avatar gitter-badger avatar holdenk avatar jlevy avatar knil-sama avatar linsomniac avatar missey avatar navinpai avatar nmishin avatar omega359 avatar onilton avatar piavlo avatar rameshrajagopal avatar rozuur avatar sodabrew avatar stormy avatar thetacoscott avatar timgates42 avatar viksit avatar woodb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

s4cmd's Issues

--multipart-split-size claims to be in MB, but is actually in bytes

s4cmd --help says:

  --multipart-split-size=MULTIPART_SPLIT_SIZE
                        size in MB to split multipart transfers

I wanted to use a larger-than-normal split size while uploading a very large file, so I tried using --multipart-split-size=1000. s4cmd responded by eating up tens of gigabytes of memory and had to killed.

It turns out that contrary to what --help says, in the current implementation the units for --multipart-split-size are bytes, not MB. So s4cmd crashed because it tried to allocate python objects to describe tens of millions of splits.

Either the docs should be fixed, or the code should be changed so that this is actually measured in MB (probably the latter is better).

Files that equal 0 bytes break this program

/cassandra_backups/data/cassandra/data/clearcore/transactions/clearcore-transactions-jb-15105-Summary.db => s3://cassandra-backups.us-east[Runtime Failure] Unable to read data from source: /cassandra_backups/data/cassandra/data/clearcore/transactions/clearcore-transactions-jb-17476-Statistics.db
[Runtime Failure] Unable to read data from source: /cassandra_backups/data/cassandra/data/clearcore/transactions/clearcore-transactions-jb-17476-Data.db
[Runtime Failure] Unable to read data from source: /cassandra_backups/data/cassandra/data/clearcore/transactions/clearcore-transactions-jb-17476-CompressionInfo.db
[Runtime Failure] Unable to read data from source: /cassandra_backups/data/cassandra/data/clearcore/transactions/clearcore-transactions-jb-17476-Index.db
[Runtime Failure] Unable to read data from source: /cassandra_backups/data/cassandra/data/clearcore/transactions/clearcore-transactions-jb-17476-Filter.db

The thing all of these files share is that they all (for whatever reason) were 0 bytes. After testing on a single 0 file created by dd sure enough this breaks the program.

Specific flags I am using

s4cmd dsync --verbose --recursive --delete-removed -c 7 $dir_src/ "$s3_url/$dir_target/"

I will look in the code tomorrow to see if anything pops out that I can help on in looking at this issue.

Allow setting access key & secret key from command line, like s3cmd

s3cmd has the option to set these variables from the command line, which I think is very useful and sadly seems to be missing from s4cmd.

I connect to different hosts for s3 (eu, us, china), and/or with different credentials or in subshells spawned by the build server. Not having the ability to set this on the command line means I can't use s4cmd at all, though it seems worth a try.

Add adjustable timeout setting

When trying to download a very large multipart file i would like to be able to adjust the timeout setting on the socket from the current default 5min.

eg:
--socket-timeout=30s

SOCKET_TIMEOUT = 5 * 60 # in sec(s) (timeout if we don't receive any recv() callback)

acl-public not working

i have set in my ./s3cfg acl_public = True, but my files are all not readable on my s3bucket i do a local to s3 sync. Any help?

An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

I am trying to sync two buckets, both of which I have access to:

# created destination bucket to store data
$ s3cmd ls s3://toil.20k
                       DIR   s3://toil.20k/gtex/
                       DIR   s3://toil.20k/pnoc/
                       DIR   s3://toil.20k/target/
                       DIR   s3://toil.20k/tcga/

# publicly accesible source bucket I am trying to get data from
# without --requester-pays I cannot even ls on this bucket
$ s3cmd --requester-pays ls s3://cgl-rnaseq-recompute-fixed/gtex
                       DIR   s3://cgl-rnaseq-recompute-fixed/gtex/
2016-06-03 17:02    435553   s3://cgl-rnaseq-recompute-fixed/gtex-manifest

I had two approaches:

  1. Get-Put
# get from source bucket to an EBS volume
s3cmd get s3://cgl-rnaseq-recompute-fixed/gtex/$i ./
# put into destination bucket
s3cmd put ./$i s3://toil.20k/gtex/ 
# remove from EBS
rm ./$i 

I have this working in the background but I wanted to test the second approach where you can just copy/sync between buckets without having to store them temporarily on an EBS volume.

  1. Sync between buckets
    I read that s4cmd is faster than s3cmd but when I use this:
$ s4cmd --dry-run sync s3://cgl-rnaseq-recompute-fixed/gtex s3://rnaseq.toil.20k/gtex

# I get the following error
[Exception] An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
[Thread Failure] An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

Like I said I cannot ls on the source bucket without --requester-pays. Would that be causing the error? I see there is no --requester-pays argument for s4cmd. How can I use s4cmd to sync two buckets?

Thanks!

Can't GET from buckets with mixed-case names

Hi there,

I'm trying to do an s4cmd GET from a bucket with a mixed case name but I see the following:

"BotoClientError: Bucket names cannot contain upper-case characters when using either the sub-domain or virtual hosting calling format."

Is this a known issue or any workarounds?

Thanks

Does it support adding extra headers ?

I am wondering if there is or will it be add support to add custom or extra headers, ie is s3cmd we do:

s3cmd -v put --add-header=Content-Encoding:gzip --guess-mime-type . . . .

Please update your PyPI package

The latest version on PyPI is 1.5.20 which is incompatible to python3, please update to the latest release version (1.5.22 at the moment).

/usr/bin/python: No module named s4cmd

On CentOS 6.7 with python 2.6 system default and python 3.4 side by side via IUS Community repo

locate /usr/bin/pip
/usr/bin/pip
/usr/bin/pip3
/usr/bin/pip3.4

pip --version
pip 7.1.2 from /usr/lib/python3.4/site-packages (python 3.4)

python --version
Python 2.6.6

I installed s4cmd via

pip install s4cmd

but get

s4cmd
/usr/bin/python: No module named s4cmd

best way to correct this ?

S3 compatible storage

Would be useful to be able to specify an endpoint for S3 compatible storage, such as radosgw for ceph.

pytz not required and s4cmd shell script not executable on all systems

Hi,

It looks like pytz is a missing requirement. I have a couple systems that don't have it installed and it seems that s4cmd doesn't list it as an explicit requirement.

Also, I haven't tracked down why, but in Circle Enterprise with pip 8.1.2, the s4cmd shell script is installed non-executable, so you can't just do pip install s4cmd; s4cmd ls. Not sure if anyone else will see that or not. I haven't been able to reproduce on my local system in a virtualenv.

Thanks for a handy utility. We can easily work around these, but I figured I should bring this up.

-Teran

Timeout in EC2 when listing a bucket, s3cmd works.

$  date; /usr/local/bin/s4cmd ls s3://my-redacted-bucket --debug; date
Thu Dec  3 17:16:53 UTC 2015
  (D)s4cmd.py:468  read S3 keys from $HOME/.s3cfg file
  (D)s4cmd.py:140  >> ls_handler(<__main__.CommandHandler object at 0x7fe855009a50>, ['ls', 's3://my-redacted-bucket'])
  (D)s4cmd.py:140  >> validate(<__main__.CommandHandler object at 0x7fe855009a50>, 'cmd|s3', ['ls', 's3://my-redacted-bucket'])
  (D)s4cmd.py:142  << validate(<__main__.CommandHandler object at 0x7fe855009a50>, 'cmd|s3', ['ls', 's3://my-redacted-bucket']): None
  (D)s4cmd.py:140  >> s3walk(<__main__.S3Handler object at 0x7fe855009a10>, 's3://my-redacted-bucket')
  (D)s4cmd.py:140  >> wrapper(<ThreadUtil(Thread-1, started daemon 140635834652416)>, <__main__.S3URL instance at 0x7fe854f7f950>, '', '', [])
  (E)s4cmd.py:359  [Errno 110] Connection timed out
  (D)s4cmd.py:140  >> wrapper(<ThreadUtil(Thread-2, started daemon 140635751315200)>, <__main__.S3URL instance at 0x7fe854f7f950>, '', '', [])
  (E)s4cmd.py:359  [Errno 110] Connection timed out
  (D)s4cmd.py:140  >> wrapper(<ThreadUtil(Thread-3, started daemon 140635740825344)>, <__main__.S3URL instance at 0x7fe854f7f950>, '', '', [])
  (E)s4cmd.py:359  [Errno 110] Connection timed out
  (D)s4cmd.py:140  >> wrapper(<ThreadUtil(Thread-4, started daemon 140635730335488)>, <__main__.S3URL instance at 0x7fe854f7f950>, '', '', [])
  (E)s4cmd.py:194  [Runtime Exception] [Errno 110] Connection timed out
  (E)s4cmd.py:196  Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/s4cmd.py", line 337, in run
    self.__class__.__dict__[func_name](self, *args, **kargs)
  File "/usr/local/lib/python2.7/dist-packages/s4cmd.py", line 141, in wrapper
    ret = func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/s4cmd.py", line 838, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/s4cmd.py", line 916, in s3walk
    for key in bucket.list(s3dir, PATH_SEP):
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/bucketlistresultset.py", line 34, in bucket_lister
    encoding_type=encoding_type)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/bucket.py", line 472, in get_all_keys
    '', headers, **params)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/bucket.py", line 398, in _get_all
    query_args=query_args)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 664, in make_request
    retry_handler=retry_handler
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1071, in make_request
    retry_handler=retry_handler)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1030, in _mexe
    raise ex
error: [Errno 110] Connection timed out

  (E)s4cmd.py:194  [Thread Failure] [Errno 110] Connection timed out
Thu Dec  3 17:51:27 UTC 2015
$

s4cmd v. slow with large buckets

time s4cmd.py ls s3://my-bucket/my-dir

real    16m53.446s
user    1m43.718s
sys 0m12.333s

which seems slow, although there are ~600,000 files in that bucket. The main problem is that get etc seems to do an ls first, making every operation v. slow.

Add `--force` requirement to `del --recursive`

Following s3cmd, s4cmd should consider requiring a --force flag for del --recursive calls. Because this is currently not required, unintended data loss could occur. I can submit a PR if this is in line with the project maintainer.

V4 Auth not working

V4 Auth requiring regions such as eu-central-1 is not working with the following error

[Thread Failure] S3ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidRequest</Code><Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message><RequestId>02FA63493DA8A72F</RequestId><HostId>XANdyiE6GccqIuN9bgHmY8wXcY8ge65pzC/OWnXRxmpsU9oAmX9xfjF4A+57Kw0ue5DMgMHyIGU=</HostId></Error>

recursive put includes the parent directory

moving to s4cmd from s3cmd, found a difference in how recursive put is handled

s4cmd -r -f /home/ec2-user/da-stuff/ s3://da-bucket/uploads/

will successfully put the files, but it includes the parent directory too ('da-stuff' in this example).

/home/ec2-user/da-stuff/6.pdf => s3://da-bucket/uploads/da-stuff/6.pdf

when I was expecting

/home/ec2-user/da-stuff/6.pdf => s3://da-bucket/uploads/6.pdf

What is the right way to get this behavior?

Upload is slower for very large because of md5 calculation

For larger file uploads, takes too long mainly because of file_hash (md5) calculation.

This is line is the line responsible for this
mpu = bucket.initiate_multipart_upload(s3url.path, metadata = {'md5': self.file_hash(source), 'privilege': self.get_file_privilege(source)})

https://github.com/bloomreach/s4cmd/blob/master/s4cmd.py#L1043

I've read in aws documentation that this md5 metadata is not necessary, I suspect this md5 calculation exists mainly because local md5 of file is different from the generated by S3.

It is used in sync_check:
('md5' in remoteKey.metadata and remoteKey.metadata['md5'] == localmd5)

It is useful mostly for calls that use sync option.

But sometimes, like in my case, the user of s4cmd may not care about sync, since he may using the -f (force) option.

I thought about two possible solutions:

  • One argument to just disable this file_hash execution like "--disable-multipart-meta-md5"
  • The second one is more ambitious, and I don't know if it is really possible:

Don't calculate the md5, but uses s3's generated md5 for calculation. (We will need to reproduce s3's md5 calculation)

S3 calculates the s3 etag for multipart uploads using a md5 of the md5 of splitted parts. The algorithm is described in these links:

http://stackoverflow.com/questions/6591047/etag-definition-changed-in-amazon-s3
http://permalink.gmane.org/gmane.comp.file-systems.s3.s3tools/583
http://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb
https://forums.aws.amazon.com/thread.jspa?messageID=203510

The thing that is missing from to recalculate s3's md5 for multipart is the SPLIT_SIZE.

But s3's e-tag for file has a sufix -NUMBER_OF_PARTS.

s3cmd --recursive --list-md5 ls s3://YOURBUCKET/YOURDIR/ 
2015-02-13 05:02 9791602688   f938b15b2edef7d2c23542814bdcb6af-187  s3://FILEPATH

I guess with this info size (and number of parts), we could use to calculate like amazon's s3 does

In the example above:

parts = 187
file_size = toMB(9791602688.0D) # 9338.0

SPLIT_SIZE = ((file_size - (file_size % parts)) / parts) + 1
SPLIT_SIZE = 50 # MB

Better interop with s3fs-fuse

AFAICT through testing, https://github.com/s3fs-fuse/s3fs-fuse is the only currently working and developed fuse filesystem for s3. In order to better interoperate with then I suggest setting the following additional metadata in a way compatible with that tool

x-amz-meta-gid
x-amz-meta-mtime
x-amz-meta-uid
x-amz-meta-mode / x-amz-meta-permissions

Unicode characters

Seems to not like unicode - example from sync below:

s3://skrivapa-staging/1491993470/Testavtal.pdf.pdf => /local/1491993470/Testavtal.pdf.pdf
'ascii' codec can't encode character u'\xe5' in position 35: ordinal not in range(128)

threads exceptions on exit

Looks like some threads are raising exceptions on exit:

Exception in thread Thread-903 (most likely raised during interpreter shutdown):
Exception in thread Thread-1289 (most likely raised during interpreter shutdown):
Exception in thread Thread-544 (most likely raised during interpreter shutdown):
Exception in thread Thread-1079 (most likely raised during interpreter shutdown):
Exception in thread Thread-3685 (most likely raised during interpreter shutdown):

Specify the S3 permissions needed for each operation in the doc

I wanted to sync a folder. It works with s3cmd with the following permissions:

  • s3:ListBucket
  • s3:PutObject
  • s3:PutObjectAcl

After trying with s4cmd I have a 403 Access Denied which I guess comes from permissions. Can you add the needed permissions for each operation in the doc please?

Killing and restarting a run after moving files breaks the ability to sync that specific directory

I moved some files while a sync run against a specific directory was taking place and so needed to restart the run to pick up the changes. I CTRL-C'd the run. When I went to restart s4cmd, it refused to run, throwing this error:

[Invalid Argument] Invalid number of parameter

This is the command that was killed:

export S4CMD_NUM_THREADS=8; s4cmd sync -rs <full local path>/* s3://<bucket>/<existing path>/

Running any other command, or this command against a different local path, works.

This is the --debug output:

  (D)s4cmd.py:494  read S3 keys from $HOME/.s3cfg file
  (D)s4cmd.py:170  >> sync_handler(<__main__.CommandHandler object at 0x103559910>, ['sync', '<full local path to>.jpg', 's3://<bucket>/<existing target path>/'])
  (D)s4cmd.py:170  >> validate(<__main__.CommandHandler object at 0x103559910>, 'cmd|s3,local|s3,local', ['sync', '<full local path but without trailing slash and filename>', '<full path to>.jpg', 's3://<bucket>/<existing target path/'])
  (E)s4cmd.py:182  [Invalid Argument] Invalid number of parameters

Python Module version + non-blocking?

Could this eventually be a python module, rather than being geared more towards a command line tool? Specific features that would be really useful would be non-blocking behaviors (such as an upload/download manager that could be monitored by other code). Imagine this being part of a python GUI with a progress-bar or similar. What about progress based on bytes uploaded/downloaded versus file count. Finally, what about a continuous scan of an upload path where files are being progressively added, and then uploaded by the manager? Nice job on the code!

Official packages for Debian and PyPI

s3cmd is included in both Debian (and therefore Ubuntu) and in PyPi. It would be great if you could package them officially in both these places.

I'm sure that would help adoption as well.

Installation succeeds, but Python says module not found

My s4cmd installation finishes successfully, albeit with an SSL warning, but after installation Python reports "No module named s4cmd."

Log:

~ $ sudo pip install s4cmd
/usr/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
You are using pip version 7.1.0, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting s4cmd
/usr/lib/python2.7/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading s4cmd-1.5.20.tar.gz
Requirement already satisfied (use --upgrade to upgrade): boto>=2.3.0 in /usr/lib/python2.7/site-packages (from s4cmd)
Installing collected packages: s4cmd
  Running setup.py install for s4cmd
Successfully installed s4cmd-1.5.20

~ $ s4cmd
/usr/local/bin/python: No module named s4cmd

s4cmd ls dies of out of memory on very large bucket

I'm trying to do an ls on a big bucket (billions of objects) and i expected the ls to be "streaming". I guess it is first consuming all elements into the ram and only then spits to disk which won't work for ridiculously large buckets..

Linux killed it after 90 minutes of running the script, at that time it took about 25GB of RAM.

Windows-style backslashes appear in s3 paths when using recursive upload

s4cmd should perform path separator conversion to forward slashes before uploading. For example, when calling s4cmd -r put dir s3://bucket/somefolder/ with a local directory structured as

  • dir
    • myfile

The result now is an s3 path of s3://bucket/somefolder/dir\myfile. The expected path is s3://bucket/somefolder/dir/myfile.

Expose DEFAULT_SPLIT through command line parameter

s4cmd/s4cmd.py

Line 41 in 92a9b53

DEFAULT_SPLIT = 50 * 1024 * 1024

DEFAULT_SPLIT is hardcoded at 50MB.

I've got the feeling that for some larger file uploads it would be better to have a larger "split size". Maybe make some tweaking with --num-threads and this value could yield better results (faster uploads).

I won't go into much details trying to explaining why I got this feeling, but if you want me to, I could explain more, or even fetch some data on this.

Anyway, I think even if my feeling is not true, it would be very nice to have this value exposed to the user (through command line).

Note I am not suggesting changing default_split, but rather just being able to tweak this item.

PS: I'm not familiar with the code, but if you don't have time to do this, but are willing to receive pull requests, I could work on this (even if it could took more time for me to do this than it would for you :), ).

Thanks in advance!

[Thread Failure] 'ascii' codec can't decode byte 0xe2 in position 120: ordinal not in range(128)

I'm getting this when running:

s4cmd dsync -r --verbose /Volumes/Volume_Name-1/Foo_Bar_Project/ s3://dir01/archive/Foo_Bar_Project/

I'm running on OS X 10.11.5 El Capitan (I know, it's what I have available at this moment in time) I have admin privileges, and I can verify that PIP, Brew, Python - they're all up to date.

When running in verbose, I can see that it is actually going through and verifying wether an object has been previously synced. I started this transfer with CyberDuck and it timed out at the end, oddly enough, but with no specific error.

Has anyone ran into a similar problem in their workflow?

[OSError] 2: No such file or directory

I find when I try to download multiple Gigabytes from S3 to my local filesystem, this error will stop the process and therefore not download the rest of the files.

[OSError] 2: No such file or directory
[OSError] 2: No such file or directory
[OSError] 2: No such file or directory
[Thread Failure] [Errno 2] No such file or directory: './my_directory/my_file-thl8fgmwbcnxy36.tmp'

It usually succeeds in downloading a few hundred of ~1,500 files before it errors out, so I am left with let's say about half the directory's actual contents and lacking the rest of the files. I have found I can re-run the command multiple times and eventually it will finish. So it may be an edge case that crops up every few hundred or so transfers.

Here's the command:
s4cmd.py get -r -s s3://my-s3-bucket/my_directory

Any idea what's going on? I use s4cmd just about every day and absolutely love it; with this issue resolved I have nothing but praise for it!

Sync command re-downloads existing files

Hey everyone!
s4cmd looks really cool, I love how fast the s4cmd du command scales with huge buckets.

Well done!

I did notice that s4cmd sync s3://... /local/directory actually re-downloads files that are already present in the local destination directory -- any way to prevent that?

Thanks,
Elad

s43cmd immediately terminates

Any command that I try immediately terminates without doing any work.

root@ip-10-110-70-203:/mnt/profiles# ~/s4cmd/s4cmd.py ls s3n://rrosario-data/*
[1 task(s) completed, 4 thread(s)]

Am I doing something wrong? Is there some other data I can provide that can help debug this?

v2.0: dies on out-of-memory on single file PUT

Hi,

while trying this tool (for the first time) to dsync 80GB of data (in 50 files) from local to S3, it consumed all the memory and was killed by OOM killer.

s4cmd dsync -r . s3://xxx/normalize_exports/

Then, I tried to PUT individual files one-by-one and it also run out of memory ๐Ÿ˜ž

$ ls -lh 5s/2015-07.jsonl.gz
-rw-rw-r-- 1 ubuntu ubuntu 808M May 25 06:39 5s/2015-07.jsonl.gz
$ s4cmd put 5s/2015-07.jsonl.gz s3://xxx/normalize_exports/5s/2015-07.jsonl.gz

Versions:

$ pip list
awscli (1.10.33)
boto3 (1.3.1)
botocore (1.4.23)
colorama (0.3.3)
docutils (0.12)
futures (3.0.5)
jmespath (0.9.0)
pip (8.1.2)
pyasn1 (0.1.9)
python-dateutil (2.5.3)
pytz (2016.4)
rsa (3.4.2)
s3transfer (0.0.1)
s4cmd (2.0.1)
setuptools (2.2)
six (1.10.0)

Guessing of ContentType

When syncing a directory tree to S3 (using sync/dsync) the ContentType is being set as binary/octet-stream on all files.

s4cmd --num-threads=100 --recursive dsync local_app_folder/static/dist/ s3:/my_s3_bucket/app/static/ --API-ACL="public-read" --API-CacheControl="max-age=31536000, public"

As the directory structure contains varying filetypes its not possible to set using the --API-ContentType parameter. I can set the --API-ContentType to a blank string but that actually sets the contentType as blank on S3 which has unexpected behaviour when trying to load some file formats.

s4cmd --num-threads=100 --recursive dsync local_app_folder/static/dist/ s3:/my_s3_bucket/app/static/ --API-ACL="public-read" --API-CacheControl="max-age=31536000, public" --API-ContentType=""

It would be good to support the mimetype guessing like S3cmd

reduced performance in version 2

Hello, I noticed a significant performance decrease with version 2.0.1 (boto3 1.3.1) relative to version
1.5.20 (boto 2.40.0). With version 1.5, a list operation of a bucket takes 3 seconds on average, while
with version 2 it takes 5 seconds on average. Previous version was downloading ~70GB in 15 minutes,
current version downloads ~40GB. An automated task that executes periodically a set of different
operations across multiple buckets and regions, took 5 to 6 times longer. Let me know if I could assist
somehow. Thank you.

IAM authentication within s4cmd

Is there a way to use IAM-roles for authentication in the s4cmd version 2.0?
When using "pure" boto3 I can use IAM without any complications.

Can't get all available buckets.

I try to use s4cmd as a s3cmd, something like this:

./s4cmd -p /home/user/.s3cfg ls s3
i see my bucket's list
./s4cmd -p /home/user/.s3cfg-n ls s3
[Invalid Argument] Invalid parameter: s3, s3 path expected
./s4cmd -p /home/user/.s3cfg-n ls s3://*
[Errno -2] Name or service not known
[Errno -2] Name or service not known thread(s)]
[Errno -2] Name or service not known thread(s)]

How can I get all my buckets?

error: can't copy 'data/bash-completion/s4cmd': doesn't exist or not a regular file

Installing:

  • using root user
  • on Ubuntu 12.04.5 LTS
  • on AWS 64-bit machine
  • Using pip install s4cmd

During install I'm getting this error:

Installing collected packages: s4cmd, boto3, pytz, botocore, jmespath, futures, python-dateutil, docutils, six
  Running setup.py install for s4cmd
    Running command /usr/bin/python -c "import setuptools;__file__='/tmp/build/s4cmd/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --single-version-externally-managed --record /tmp/pip-Zvc7O7-record/install-record.txt
    running install
    running build
    running build_py
    running build_scripts
    running install_lib
    running install_data
    error: can't copy 'data/bash-completion/s4cmd': doesn't exist or not a regular file
    Complete output from command /usr/bin/python -c "import setuptools;__file__='/tmp/build/s4cmd/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --single-version-externally-managed --record /tmp/pip-Zvc7O7-record/install-record.txt:
    running install

running build

running build_py

running build_scripts

running install_lib

running install_data

error: can't copy 'data/bash-completion/s4cmd': doesn't exist or not a regular file

----------------------------------------
Command /usr/bin/python -c "import setuptools;__file__='/tmp/build/s4cmd/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --single-version-externally-managed --record /tmp/pip-Zvc7O7-record/install-record.txt failed with error code 1
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 126, in main
    self.run(options, args)
  File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 228, in run
    requirement_set.install(install_options, global_options)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 1093, in install
    requirement.install(install_options, global_options)
  File "/usr/lib/python2.7/dist-packages/pip/req.py", line 566, in install
    cwd=self.source_dir, filter_stdout=self._filter_install, show_stdout=False)
  File "/usr/lib/python2.7/dist-packages/pip/__init__.py", line 255, in call_subprocess
    % (command_desc, proc.returncode))
InstallationError: Command /usr/bin/python -c "import setuptools;__file__='/tmp/build/s4cmd/setup.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --single-version-externally-managed --record /tmp/pip-Zvc7O7-record/install-record.txt failed with error code 1
"~/.pip/pip.log" 162L, 10020C

escape % in s3 path

I find out a error while S3 path includes %.

text = (msg % args) 

above source code shows TypeError: not enough arguments for format string
I would suggest s4cmd could use advance string format to avoid this problem.

Thanks.

"specified endpoint" error

[Exception] An error occurred (PermanentRedirect) when calling the ListObjects operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
[Thread Failure] An error occurred (PermanentRedirect) when calling the ListObjects operation: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.

With 2.0. This is attempting to do an ls, where s3cmd works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.