azure / azure-data-lake-store-python Goto Github PK

Microsoft Azure Data Lake Store Filesystem Library for Python

License: MIT License

Python 100.00%

azure-data-lake-store-python's Introduction

azure-datalake-store

A pure-python interface to the Azure Data-lake Storage Gen 1 system, providing pythonic file-system and file objects, seamless transition between Windows and POSIX remote paths, high-performance up- and down-loader.

This software is under active development and not yet recommended for general use.

Note: This library supports ADLS Gen 1. For Gen 2, please see azure-storage-file-datalake, documented here

Installation

Using pip:

pip install azure-datalake-store

Manually (bleeding edge):

Download the repo from https://github.com/Azure/azure-data-lake-store-python
install the requirements (pip install -r dev_requirements.txt)
install in develop mode (python setup.py develop)

Auth

Although users can generate and supply their own tokens to the base file-system class, and there is a password-based function in the lib module for generating tokens, the most convenient way to supply credentials is via environment parameters. This latter method is the one used by default in library. The following variables are required:

azure_tenant_id
azure_username
azure_password
azure_store_name
azure_url_suffix (optional)

Pythonic Filesystem

The AzureDLFileSystem object is the main API for library usage of this package. It provides typical file-system operations on the remote azure store

token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(store_name, token)
# alternatively, adl = core.AzureDLFileSystem()
# uses environment variables

print(adl.ls())  # list files in the root directory
for item in adl.ls(detail=True):
    print(item)  # same, but with file details as dictionaries
print(adl.walk(''))  # list all files at any directory depth
print('Usage:', adl.du('', deep=True, total=True))  # total bytes usage
adl.mkdir('newdir')  # create directory
adl.touch('newdir/newfile') # create empty file
adl.put('remotefile', '/home/myuser/localfile') # upload a local file

In addition, the file-system generates file objects that are compatible with the python file interface, ensuring compatibility with libraries that work on python files. The recommended way to use this is with a context manager (otherwise, be sure to call close() on the file object).

with adl.open('newfile', 'wb') as f:
    f.write(b'index,a,b\n')
    f.tell()   # now at position 9
    f.flush()  # forces data upstream
    f.write(b'0,1,True')

with adl.open('newfile', 'rb') as f:
    print(f.readlines())

with adl.open('newfile', 'rb') as f:
    df = pd.read_csv(f) # read into pandas.

To seamlessly handle remote path representations across all supported platforms, the main API will take in numerous path types: string, Path/PurePath, and AzureDLPath. On Windows in particular, you can pass in paths separated by either forward slashes or backslashes.

import pathlib  # only >= Python 3.4
from pathlib2 import pathlib  # only <= Python 3.3

from azure.datalake.store.core import AzureDLPath

# possible remote paths to use on API
p1 = '\\foo\\bar'
p2 = '/foo/bar'
p3 = pathlib.PurePath('\\foo\\bar')
p4 = pathlib.PureWindowsPath('\\foo\\bar')
p5 = pathlib.PurePath('/foo/bar')
p6 = AzureDLPath('\\foo\\bar')
p7 = AzureDLPath('/foo/bar')

# p1, p3, and p6 only work on Windows
for p in [p1, p2, p3, p4, p5, p6, p7]:
  with adl.open(p, 'rb') as f:
      print(f.readlines())

Performant up-/down-loading

Classes ADLUploader and ADLDownloader will chunk large files and send many files to/from azure using multiple threads. A whole directory tree can be transferred, files matching a specific glob-pattern or any particular file.

# download the whole directory structure using 5 threads, 16MB chunks
ADLDownloader(adl, '', 'my_temp_dir', 5, 2**24)

API

class azure.datalake.store.core.AzureDLFileSystem(token=None, per_call_timeout_seconds=60, **kwargs)

Access Azure DataLake Store as if it were a file-system

Parameters

store_name: str (“”)

  Store name to connect to.

token: credentials object

  When setting up a new connection, this contains the authorization
  credentials (see lib.auth()).

url_suffix: str (None)

  Domain to send REST requests to. The end-point URL is constructed
  using this and the store_name. If None, use default.

api_version: str (2018-09-01)

  The API version to target with requests. Changing this value will
  change the behavior of the requests, and can cause unexpected behavior or
  breaking changes. Changes to this value should be undergone with caution.

per_call_timeout_seconds: float(60)

  This is the timeout for each requests library call.

kwargs: optional key/values

  See `lib.auth()`; full list: tenant_id, username, password, client_id,
  client_secret, resource

Methods

access(self, path, invalidate_cache=True)

Does such a file/directory exist?

Parameters

path: str or AzureDLPath
```
  Path to query
```
invalidate_cache: bool
```
  Whether to invalidate cache
```
Returns

True or false depending on whether the path exists.

cat(self, path)

Return contents of file

Parameters

path: str or AzureDLPath
```
  Path to query
```
Returns

Contents of file

chmod(self, path, mod)

Change access mode of path

Note this is not recursive.

Parameters

path: str

  Location to change

mod: str

  Octal representation of access, e.g., “0777” for public read/write.
  See [docs]([http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Permission](http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Permission))

chown(self, path, owner=None, group=None)

Change owner and/or owning group

Note this is not recursive.

Parameters

path: str

  Location to change

owner: str

  UUID of owning entity

group: str

  UUID of group

concat(self, outfile, filelist, delete_source=False)

Concatenate a list of files into one new file

Parameters

outfile: path

  The file which will be concatenated to. If it already exists,
  the extra pieces will be appended.

filelist: list of paths

  Existing adl files to concatenate, in order

delete_source: bool (False)

  If True, assume that the paths to concatenate exist alone in a
  directory, and delete that whole directory when done.

Returns

None

connect(self)

Establish connection object.

cp(self, path1, path2)

Not implemented. Copy file between locations on ADL

classmethod current()

Return the most recently created AzureDLFileSystem

df(self, path)

Resource summary of path

Parameters

path: str
```
  Path to query
```

du(self, path, total=False, deep=False, invalidate_cache=True)

Bytes in keys at path

Parameters

path: str or AzureDLPath

  Path to query

total: bool

  Return the sum on list

deep: bool

  Recursively enumerate or just use files under current dir

invalidate_cache: bool

  Whether to invalidate cache

Returns

List of dict of name:size pairs or total size.

exists(self, path, invalidate_cache=True)

Does such a file/directory exist?

Parameters

path: str or AzureDLPath
```
  Path to query
```
invalidate_cache: bool
```
  Whether to invalidate cache
```
Returns

True or false depending on whether the path exists.

get(self, path, filename)

Stream data from file at path to local filename

Parameters

path: str or AzureDLPath

  ADL Path to read

filename: str or Path

  Local file path to write to

Returns

None

get_acl_status(self, path)

Gets Access Control List (ACL) entries for the specified file or directory.

Parameters

path: str
```
  Location to get the ACL.
```

glob(self, path, details=False, invalidate_cache=True)

Find files (not directories) by glob-matching.

Parameters

path: str or AzureDLPath

  Path to query

details: bool

  Whether to include file details

invalidate_cache: bool

  Whether to invalidate cache

Returns

List of files

head(self, path, size=1024)

Return first bytes of file

Parameters

path: str or AzureDLPath

  Path to query

size: int

  How many bytes to return

Returns

First(size) bytes of file

info(self, path, invalidate_cache=True, expected_error_code=None)

File information for path

Parameters

path: str or AzureDLPath

  Path to query

invalidate_cache: bool

  Whether to invalidate cache or not

expected_error_code: int

  Optionally indicates a specific, expected error code, if any.

Returns

File information

invalidate_cache(self, path=None)

Remove entry from object file-cache

Parameters

path: str or AzureDLPath

  Remove the path from object file-cache

Returns

None

listdir(self, path='', detail=False, invalidate_cache=True)

List all elements under directory specified with path

Parameters

path: str or AzureDLPath

  Path to query

detail: bool

  Detailed info or not.

invalidate_cache: bool

  Whether to invalidate cache or not

Returns

List of elements under directory specified with path

ls(self, path='', detail=False, invalidate_cache=True)

List all elements under directory specified with path

Parameters

path: str or AzureDLPath

  Path to query

detail: bool

  Detailed info or not.

invalidate_cache: bool

  Whether to invalidate cache or not

Returns

List of elements under directory specified with path

merge(self, outfile, filelist, delete_source=False)

Concatenate a list of files into one new file

Parameters

outfile: path

  The file which will be concatenated to. If it already exists,
  the extra pieces will be appended.

filelist: list of paths

  Existing adl files to concatenate, in order

delete_source: bool (False)

  If True, assume that the paths to concatenate exist alone in a
  directory, and delete that whole directory when done.

Returns

None

mkdir(self, path)

Make new directory

Parameters

path: str or AzureDLPath
```
  Path to create directory
```
Returns

None

modify_acl_entries(self, path, acl_spec, recursive=False, number_of_sub_process=None)

Modify existing Access Control List (ACL) entries on a file or folder. If the entry does not exist it is added, otherwise it is updated based on the spec passed in. No entries are removed by this process (unlike set_acl).

Note: this is by default not recursive, and applies only to the file or folder specified.

Parameters

path: str

  Location to set the ACL entries on.

acl_spec: str

  The ACL specification to use in modifying the ACL at the path in the format
  ‘[default:]user|group|other:[entity id or UPN]:r|-w|-x|-,[default:]user|group|other:[entity id or UPN]:r|-w|-x|-,…’

recursive: bool

  Specifies whether to modify ACLs recursively or not

mv(self, path1, path2)

Move file between locations on ADL

Parameters

path1:
```
  Source Path
```
path2:
```
  Destination path
```
Returns

None

open(self, path, mode='rb', blocksize=33554432, delimiter=None)

Open a file for reading or writing

Parameters

path: string

  Path of file on ADL

mode: string

  One of ‘rb’, ‘ab’ or ‘wb’

blocksize: int

  Size of data-node blocks if reading

delimiter: byte(s) or None

  For writing delimiter-ended blocks

put(self, filename, path, delimiter=None)

Stream data from local filename to file at path

Parameters

filename: str or Path

  Local file path to read from

path: str or AzureDLPath

  ADL Path to write to

delimiter:

  Optional delimeter for delimiter-ended blocks

Returns

None

read_block(self, fn, offset, length, delimiter=None)

Read a block of bytes from an ADL file

Starting at offset of the file, read length bytes. If delimiter is set then we ensure that the read starts and stops at delimiter boundaries that follow the locations offset and `offset

length. If offset` is zero then we start at zero. The bytestring returned WILL include the end delimiter string.

If offset+length is beyond the eof, reads to eof.

Parameters

fn: string

  Path to filename on ADL

offset: int

  Byte offset to start read

length: int

  Number of bytes to read

delimiter: bytes (optional)

  Ensure reading starts and stops at delimiter bytestring

Examples

>>> adl.read_block('data/file.csv', 0, 13)  # doctest: +SKIP
b'Alice, 100\nBo'
>>> adl.read_block('data/file.csv', 0, 13, delimiter=b'\n')  # doctest: +SKIP
b'Alice, 100\nBob, 200\n'

Use length=None to read to the end of the file.

adl.read_block(‘data/file.csv’, 0, None, delimiter=b’n’) # doctest: +SKIP b’Alice, 100nBob, 200nCharlie, 300’

remove(self, path, recursive=False)

Remove a file or directory

Parameters

path: str or AzureDLPath

  The location to remove.

recursive: bool (True)

  Whether to remove also all entries below, i.e., which are returned
  by walk().

Returns

None

remove_acl(self, path)

Remove the entire, non default, ACL from the file or folder, including unnamed entries. Default entries cannot be removed this way, please use remove_default_acl for that.

Note: this is not recursive, and applies only to the file or folder specified.

Parameters

path: str
```
  Location to remove the ACL.
```

remove_acl_entries(self, path, acl_spec, recursive=False, number_of_sub_process=None)

Remove existing, named, Access Control List (ACL) entries on a file or folder. If the entry does not exist already it is ignored. Default entries cannot be removed this way, please use remove_default_acl for that. Unnamed entries cannot be removed in this way, please use remove_acl for that.

Note: this is by default not recursive, and applies only to the file or folder specified.

Parameters

path: str

  Location to remove the ACL entries.

acl_spec: str

  The ACL specification to remove from the ACL at the path in the format (note that the permission portion is missing)
  ‘[default:]user|group|other:[entity id or UPN],[default:]user|group|other:[entity id or UPN],…’

recursive: bool

  Specifies whether to remove ACLs recursively or not

remove_default_acl(self, path)

Remove the entire default ACL from the folder. Default entries do not exist on files, if a file is specified, this operation does nothing.

Note: this is not recursive, and applies only to the folder specified.

Parameters

path: str
```
  Location to set the ACL on.
```

rename(self, path1, path2)

Move file between locations on ADL

Parameters

path1:
```
  Source Path
```
path2:
```
  Destination path
```
Returns

None

rm(self, path, recursive=False)

Remove a file or directory

Parameters

path: str or AzureDLPath

  The location to remove.

recursive: bool (True)

  Whether to remove also all entries below, i.e., which are returned
  by walk().

Returns

None

rmdir(self, path)

Remove empty directory

Parameters

path: str or AzureDLPath
```
  Directory  path to remove
```
Returns

None

set_acl(self, path, acl_spec, recursive=False, number_of_sub_process=None)

Set the Access Control List (ACL) for a file or folder.

Note: this is by default not recursive, and applies only to the file or folder specified.

Parameters

path: str

  Location to set the ACL on.

acl_spec: str

  The ACL specification to set on the path in the format
  ‘[default:]user|group|other:[entity id or UPN]:r|-w|-x|-,[default:]user|group|other:[entity id or UPN]:r|-w|-x|-,…’

recursive: bool

  Specifies whether to set ACLs recursively or not

set_expiry(self, path, expiry_option, expire_time=None)

Set or remove the expiration time on the specified file. This operation can only be executed against files.

Note: Folders are not supported.

Parameters

path: str

  File path to set or remove expiration time

expire_time: int

  The time that the file will expire, corresponding to the expiry_option that was set

expiry_option: str

  Indicates the type of expiration to use for the file:

      1. NeverExpire: ExpireTime is ignored.

      1. RelativeToNow: ExpireTime is an integer in milliseconds representing the expiration date relative to when file expiration is updated.

      1. RelativeToCreationDate: ExpireTime is an integer in milliseconds representing the expiration date relative to file creation.

      1. Absolute: ExpireTime is an integer in milliseconds, as a Unix timestamp relative to 1/1/1970 00:00:00.

stat(self, path, invalidate_cache=True, expected_error_code=None)

File information for path

Parameters

path: str or AzureDLPath

  Path to query

invalidate_cache: bool

  Whether to invalidate cache or not

expected_error_code: int

  Optionally indicates a specific, expected error code, if any.

Returns

File information

tail(self, path, size=1024)

Return last bytes of file

Parameters

path: str or AzureDLPath

  Path to query

size: int

  How many bytes to return

Returns

Last(size) bytes of file

touch(self, path)

Create empty file

Parameters

path: str or AzureDLPath
```
  Path of file to create
```
Returns

None

unlink(self, path, recursive=False)

Remove a file or directory

Parameters

path: str or AzureDLPath

  The location to remove.

recursive: bool (True)

  Whether to remove also all entries below, i.e., which are returned
  by walk().

Returns

None

walk(self, path='', details=False, invalidate_cache=True)

Get all files below given path

Parameters

path: str or AzureDLPath

  Path to query

details: bool

  Whether to include file details

invalidate_cache: bool

  Whether to invalidate cache

Returns

List of files

class azure.datalake.store.multithread.ADLUploader(adlfs, rpath, lpath, nthreads=None, chunksize=268435456, buffersize=4194304, blocksize=4194304, client=None, run=True, overwrite=False, verbose=False, progress_callback=None, timeout=0)

Upload local file(s) using chunks and threads

Launches multiple threads for efficient uploading, with chunksize assigned to each. The path can be a single file, a directory of files or a glob pattern.

Parameters

adlfs: ADL filesystem instance

rpath: str

  remote path to upload to; if multiple files, this is the dircetory
  root to write within

lpath: str

  local path. Can be single file, directory (in which case, upload
  recursively) or glob pattern. Recursive glob patterns using \*\* are
  not supported.

nthreads: int [None]

  Number of threads to use. If None, uses the number of cores.

chunksize: int [2**28]

  Number of bytes for a chunk. Large files are split into chunks. Files
  smaller than this number will always be transferred in a single thread.

buffersize: int [2**22]

  Number of bytes for internal buffer. This block cannot be bigger than
  a chunk and cannot be smaller than a block.

blocksize: int [2**22]

  Number of bytes for a block. Within each chunk, we write a smaller
  block for each API call. This block cannot be bigger than a chunk.

client: ADLTransferClient [None]

  Set an instance of ADLTransferClient when finer-grained control over
  transfer parameters is needed. Ignores nthreads and chunksize
  set by constructor.

run: bool [True]

  Whether to begin executing immediately.

overwrite: bool [False]

  Whether to forcibly overwrite existing files/directories. If False and
  remote path is a directory, will quit regardless if any files would be
  overwritten or not. If True, only matching filenames are actually
  overwritten.

progress_callback: callable [None]

  Callback for progress with signature function(current, total) where
  current is the number of bytes transfered so far, and total is the
  size of the blob, or None if the total size is unknown.

timeout: int (0)

  Default value 0 means infinite timeout. Otherwise time in seconds before the
  process will stop and raise an exception if  transfer is still in progress

Attributes

hash

Methods

active(self)

Return whether the uploader is active

static clear_saved()

Remove references to all persisted uploads.

static load()

Load list of persisted transfers from disk, for possible resumption.

Returns

A dictionary of upload instances. The hashes are auto

  generated unique. The state of the chunks completed, errored, etc.,
  can be seen in the status attribute. Instances can be resumed with
  `run()`.

run(self, nthreads=None, monitor=True)

Populate transfer queue and execute downloads

Parameters

nthreads: int [None]

  Override default nthreads, if given

monitor: bool [True]

  To watch and wait (block) until completion.

save(self, keep=True)

Persist this upload

Saves a copy of this transfer process in its current state to disk. This is done automatically for a running transfer, so that as a chunk is completed, this is reflected. Thus, if a transfer is interrupted, e.g., by user action, the transfer can be restarted at another time. All chunks that were not already completed will be restarted at that time.

See methods load to retrieved saved transfers and run to resume a stopped transfer.

Parameters

keep: bool (True)

  If True, transfer will be saved if some chunks remain to be
  completed; the transfer will be sure to be removed otherwise.

successful(self)

Return whether the uploader completed successfully.

It will raise AssertionError if the uploader is active.

class azure.datalake.store.multithread.ADLDownloader(adlfs, rpath, lpath, nthreads=None, chunksize=268435456, buffersize=4194304, blocksize=4194304, client=None, run=True, overwrite=False, verbose=False, progress_callback=None, timeout=0)

Download remote file(s) using chunks and threads

Launches multiple threads for efficient downloading, with chunksize assigned to each. The remote path can be a single file, a directory of files or a glob pattern.

Parameters

adlfs: ADL filesystem instance

rpath: str

  remote path/globstring to use to find remote files. Recursive glob
  patterns using \*\* are not supported.

lpath: str

  local path. If downloading a single file, will write to this specific
  file, unless it is an existing directory, in which case a file is
  created within it. If downloading multiple files, this is the root
  directory to write within. Will create directories as required.

nthreads: int [None]

  Number of threads to use. If None, uses the number of cores.

chunksize: int [2**28]

  Number of bytes for a chunk. Large files are split into chunks. Files
  smaller than this number will always be transferred in a single thread.

buffersize: int [2**22]

  Ignored in curret implementation.
  Number of bytes for internal buffer. This block cannot be bigger than
  a chunk and cannot be smaller than a block.

blocksize: int [2**22]

  Number of bytes for a block. Within each chunk, we write a smaller
  block for each API call. This block cannot be bigger than a chunk.

client: ADLTransferClient [None]

  Set an instance of ADLTransferClient when finer-grained control over
  transfer parameters is needed. Ignores nthreads and chunksize set
  by constructor.

run: bool [True]

  Whether to begin executing immediately.

overwrite: bool [False]

  Whether to forcibly overwrite existing files/directories. If False and
  local path is a directory, will quit regardless if any files would be
  overwritten or not. If True, only matching filenames are actually
  overwritten.

progress_callback: callable [None]

  Callback for progress with signature function(current, total) where
  current is the number of bytes transfered so far, and total is the
  size of the blob, or None if the total size is unknown.

timeout: int (0)

  Default value 0 means infinite timeout. Otherwise time in seconds before the
  process will stop and raise an exception if  transfer is still in progress

Attributes

hash

Methods

active(self)

Return whether the downloader is active

static clear_saved()

Remove references to all persisted downloads.

static load()

Load list of persisted transfers from disk, for possible resumption.

Returns

A dictionary of download instances. The hashes are auto-

  generated unique. The state of the chunks completed, errored, etc.,
  can be seen in the status attribute. Instances can be resumed with
  `run()`.

run(self, nthreads=None, monitor=True)

Populate transfer queue and execute downloads

Parameters

nthreads: int [None]

  Override default nthreads, if given

monitor: bool [True]

  To watch and wait (block) until completion.

save(self, keep=True)

Persist this download

See methods load to retrieved saved transfers and run to resume a stopped transfer.

Parameters

keep: bool (True)

  If True, transfer will be saved if some chunks remain to be
  completed; the transfer will be sure to be removed otherwise.

successful(self)

Return whether the downloader completed successfully.

It will raise AssertionError if the downloader is active.

azure.datalake.store.lib.auth(tenant_id=None, username=None, password=None, client_id='', client_secret=None, resource='https://datalake.azure.net/', require_2fa=False, authority=None, retry_policy=None, **kwargs)

User/password authentication

Parameters

tenant_id: str

  associated with the user’s subscription, or “common”

username: str

  active directory user

password: str

  sign-in password

client_id: str

  the service principal client

client_secret: str

  the secret associated with the client_id

resource: str

  resource for auth (e.g., [https://datalake.azure.net/](https://datalake.azure.net/))

require_2fa: bool

  indicates this authentication attempt requires two-factor authentication

authority: string

  The full URI of the authentication authority to authenticate against (such as [https://login.microsoftonline.com/](https://login.microsoftonline.com/))

kwargs: key/values

  Other parameters, for future use

Returns

:type DataLakeCredential :mod: A DataLakeCredential object

azure-data-lake-store-python's People

Contributors

Stargazers

Watchers

Forkers

jbcrail begoldsm matt1883 rrenfrow86 martindurant hoffmann clehene saipmedi-nbcuni oxygenanywhere ro-joowan luigialves chdevala yujinlim ezhaar milanchandna emedgene okcorral chitratsr edaroza lewu-msft manghuajiang djamsoft akharit donaldgarnica cjalmeida claudioscalzo mfors3 johnhoman jayapracash vinod46kumar satoshirobatofujimoto mgasner robertfjones pawan-kv sureshdontha rvilla87 nayanr ljrain patocl sandeepkumar101 will-tong lfbraz sanjay-suv bluca markusweimer parthi10 pauldx weiyunna ross1503 sreekanth370 jitensahoo mvds00 isabella232 ajeetpandeyy icodein sekenre mh-data-science jorandox jon21paulos uranusjr arpitjain799 kpradyumna095 python-popular-repos sourcecodecheck tonybaloney mikesecurity dhawnsec

azure-data-lake-store-python's Issues

Task 3.3: Addition of performance tests for several use cases

Addition of performance tests to measure performance for large files, folders full of small and mixed file sizes. Once stable, we will integrate these tests into our existing performance testing and reporting service.

Upload 100GB file on machine with 112GB of memory results in OOM error and cascading errors

When I attempt to upload a 100GB file with 128 threads on a vm with 112GB of memory I get an out of memory error, followed by a large number of IO errors on closed streams.

When an error is encountered it should not immediately result in cascading errors if possible, and retry should be considered (although with OOM errors that is probably not practical).

Below is a partial stack trace. The IO errors continue for a long time:

PS E:\ingress> python C:\tools\sampleUploadDownload.py
Uploading file...
Traceback (most recent call last):
File "C:\tools\sampleUploadDownload.py", line 24, in
multithread.ADLUploader(adl, lpath=fileLocationToUpload, rpath=remoteFolderName + '/' + remoteFileName, nthreads=128
) # change the thread number up or down depending on tuning for perf
MemoryError
Exception ignored in: <bound method AzureDLFile.del of <ADL file: /tmp/e683991075_35701915648>>
Traceback (most recent call last):
File "c:\tools\azure-data-lake-store-python\adlfs\core.py", line 702, in del
self.close()
File "c:\tools\azure-data-lake-store-python\adlfs\core.py", line 685, in close
self.flush(force=True)
File "c:\tools\azure-data-lake-store-python\adlfs\core.py", line 628, in flush
if self.buffer.tell() == 0:
ValueError: I/O operation on closed file.
Exception ignored in: <bound method AzureDLFile.del of <ADL file: /tmp/e683991075_42949672960>>
Traceback (most recent call last):
File "c:\tools\azure-data-lake-store-python\adlfs\core.py", line 702, in del
self.close()
File "c:\tools\azure-data-lake-store-python\adlfs\core.py", line 685, in close
self.flush(force=True)
File "c:\tools\azure-data-lake-store-python\adlfs\core.py", line 628, in flush
if self.buffer.tell() == 0:
ValueError: I/O operation on closed file.

AccessControlException after file upload

I am trying to upload the content of the local file into a DataLake store, with the following simple snippet:

token = lib.auth(tenant_id='782633d2-40ee-4e13-a016-xxxx', client_id='6a706405-9ef2-440c-9585-xxxxx', client_secret='yyyy=')
adl = AzureDLFileSystem(store_name='mystore', token=token)

adl.put('updates/update.xml', update.xm)

I can see the file is a data explorer, with a non-zero size, however, once I click on the file, I get

 Error
AccessControlException
Message
FsOpenStream failed with error 0x83090aa2 (Either the resource does not exist or the current user is not authorized to perform the requested operation).

 [58c2c59a-1a6d-42da-aaff-9f9bde8216be] failed with error 0x83090aa2 (Either the resource does not exist or the current user is not authorized to perform the requested operation).

 [58c2c59a-1a6d-42da-aaff-9f9bde8216be][2016-10-06T00:44:11.1559029-07:00]

I can oly delete the file from web browser. My user is the owner of the root datalake store folder.

PRI 0: Uploading 100GB file results in 200GB on the server

Using the multi-part uploader is not resulting in valid files being uploaded to the server. On completion of a 100GB file upload, checking the status of that file reports back a file that is 200GB.

It is critical that the multi-part upload and download logic have self-checks that confirm they have uploaded/download as much data as they are supposed to, and to report problems/failures if there is too much data.

As an example, there can be scenarios on the server where an append request "fails" from the client perspective (due to a timeout or something similar) but is committed on the server. In this case we should make sure that when we think we have finished uploading or downloading a segment that it is actually as long as we expect it to be, and to discard it/error out if it is not the right size.

More flexibility with trace logging

Currently we see a lot of general debug logging when executing, especially on large files/folders. Ideally, it would be good to be able to turn this on/off and have the ability to tweak verbosity (from debug all the way to just error cases). I am not sure if this is possible, but it would also be good to be able to optionally print out the request/response in a nice format. For example, PowerShell today hooks into the tracing logging that takes place within the standard .NET sdk to enable things like this if the user wants to debug what is actually being sent:

Get-AzureRmDataLakeAnalyticsAccount -Debug
DEBUG: ============================ HTTP REQUEST ============================

HTTP Method:
GET

Absolute Uri:
https://api-dogfood.resources.windows-int.net/subscriptions/90585f39-ab85-4921-bb37-0468f7efd1dd/providers/Microsoft.Da
taLakeAnalytics/accounts?api-version=2015-10-01-preview

Headers:
x-ms-client-request-id : fe5fda9a-fd40-475f-9dec-a35c2ae7fe0c
accept-language : en-US

Body:

DEBUG: ============================ HTTP RESPONSE ============================

Status Code:
OK

Headers:
Pragma : no-cache
x-ms-original-request-ids :
ed2622ab-58bf-4cef-b960-650f971e6ada,eea2a95c-20e1-49c9-ba4a-310e31b3ffd4,dc27ecf2-8a38-4d01-94f7-ddd8f9bdc171
x-ms-ratelimit-remaining-subscription-reads: 14863
x-ms-request-id : c55140b3-5d56-409e-a5c3-2d668b8b1134
x-ms-correlation-request-id : c55140b3-5d56-409e-a5c3-2d668b8b1134
x-ms-routing-request-id : CENTRALUS:20160901T233000Z:c55140b3-5d56-409e-a5c3-2d668b8b1134
Strict-Transport-Security : max-age=31536000; includeSubDomains
Cache-Control : no-cache
Date : Thu, 01 Sep 2016 23:29:59 GMT

Body:
{
"value": [
{
"properties": {
"provisioningState": "Succeeded",
"state": "Active",
"endpoint": "e2etestkonappebn3p.konaaccountdogfood.net",
"accountId": "61ab843c-2f11-42cc-9f1d-851b69d1281f",
"creationTime": "2016-08-11T22:58:35.9195777Z",
"lastModifiedTime": "2016-08-11T22:58:35.9195777Z"
},
"location": "brazilsouth",
"tags": null,
"id":
"/subscriptions/90585f39-ab85-4921-bb37-0468f7efd1dd/resourceGroups/konagroup-ppe-bn3p/providers/Microsoft.DataLakeAnal
ytics/accounts/e2etestkonappebn3p",
"name": "e2etestkonappebn3p",
"type": "Microsoft.DataLakeAnalytics/accounts"
} ...
}

Meta: Test scenarios

Cancel upload/download

Validate that the final file does not exist
Resume the upload/download
Validate that it completes successfully

Validate retry logic when a basic REST call fails for uploader/downloader

Mock a failure in append/create. Validate the operation retries and eventually fails
Resume the upload/download with mocked failure remove, verify it succeeds

Verify overwrite logic

Files/folders should not be overwritten if the user does not indicate a force operation
Files/folders should be overwritten if the user does indicate force (default should not be force)

Verify record boundary splitting is honored

If this is passed in, verify that it is honored.
Verify that if the user passes in this boundary splitting and the data is >4MB when a boundary is found, that we throw (since we cannot support records that are longer than 4mb).

Progress tracking/state validation

Validate that metadata about the upload/download is correct upon completion/error termination
This should also be done during the resume tests to ensure the metadata is accurate/useable.

Task 4.3: Package publishing

Get the functionality ready for package publishing, which includes ensuring our getting started documentation, samples and readthedocs code documentation is ready and has been reviewed.

Lower Pri: Large folder upload/download has a large startup cost in time

When uploading or downloading a very large directory, we see a large startup cost where the client is getting all the information it needs to begin (or resume) the operation. This could benefit from some parallelization or optimizations to reduce the amount of time needed to setup the operation.

The current timing is about 30 seconds for a folder with 10,000 1mb files in a nested structure.

Lots of memory errors when attempting to upload/download large files

When attempting to upload or download a large file (such as a 10GB file) I am constantly running into MemoryErrors. Looking at how much memory Python is using during this time I see that it is right around 2GB. This seems like a lot of memory taken to process a 10GB file. Additionally, 2GB of memory use shouldn't be enough to cause these failures in the python runtime (although I imagine that is just a setting somewhere). Ideally we shouldn't be scaling memory usage to be 20% of the file size, since we need to support the ability to upload files in the terabyte range (for upload and download).

ADLDownloader appears to be downloading data, doesn't write it to file, results in 0 byte file

For any files I download (from 1mb to 100gb) I am receiving 0 byte destination files. This is true for both "nulled" files that are basically empty as well as files with randomly generated text. I have not tried to repro this on Unix machines, but it consistently repros on windows.

Task 3.1: Extend the single file upload/download to allow for folder and recursive folder upload/download

Cancel logic hangs on large folders

On a folder containing 100,000 1mb files if I issue a cancel request the operation still hasn't returned control back to me after about 30 minutes (and it appears to still be uploading, very slowly).

SetPermissions and SetOwner not implemented

Re #5, these two calls remain inaccessible. They should be exposed at the REST and filesystem level.
NB: in the preview API, only the root level can be changed.

glob strings in multi up/downloader

The use of "*" breaks relpath in up/downloader setup, if it is anywhere but in the final path element.

Replace CONCAT with MSCONCAT

There is a design limitation in concat due to passing all of the paths in on the URI which can cause failure when concatenating a large number of files. Until this is fixed in a new WEBHDFS protocol, please use the MSCONCAT op code and pass in the paths in the body of the request as an application/octet-stream using the following format for the stream in the body:
sources=,,...,

To see more information about this API you can see the swagger specification for it here:
https://github.com/Azure/azure-rest-api-specs/blob/master/arm-datalake-store/filesystem/2015-10-01-preview/swagger/filesystem.json#L335

Additionally, for further optimizations for concat please see issue: #49

Optimization for APPEND, CREATE and OPEN

To avoid redirects and extra web calls, the following query parameters should be added to the APPEND, CREATE and OPEN operations:
APPEND: append=true
CREATE: write=true
OPEN: read=true

Documentation

Placeholder: no part of this project can be considered complete without adequate documentation, both as generated from docstrings, and explicit explanations of motivation, typical use etc.

Task 4.1: Stabilize work of milestones 1-3 and confirm code coverage

Uploader fails if the full path to the target file doesn't exist

Creation of a file in ADLS does not require the underlying path to exist. The create call itself will create the entire path or throw if it can't. The recommendation is to remove this requirement and fail only if the target file itself can't be created (which should still be tested for at the very beginning).

Block size variable not set for non-delimited files

When opening non-delimited files in either wb or ab mode, the blocksize instance variable isn't set and causes some tests to fail with this error:

AttributeError: 'AzureDLFile' object has no attribute 'blocksize'

Task 4.2: Integrate custom functionality with the existing Azure SDK for Python

This includes proper packaging and naming, ensure inclusion of any common dependencies for things like error handling (ideally this is done in an ongoing basis during development in milestone one, but just in case anything is missed it is fixed here).

Pri 0: Uploading 50,000 1mb files resulting in a memory exception

The following exception is thrown when attempting to upload 50,000 1MB files. Note that it keeps going on like this for a very long time:

Uploading 50000 1MB files...
Traceback (most recent call last):
File "C:\tools\sampleUploadDownload.py", line 36, in
multithread.ADLUploader(adl, lpath='D:\ingress\largeFolder', rpath=remoteFolderName + '/50000files', nthreads=64)

change the thread number up or down depending on tuning for perf

File "c:\tools\azure-data-lake-store-python\adlfs\multithread.py", line 233, in init
self.run()
File "c:\tools\azure-data-lake-store-python\adlfs\multithread.py", line 271, in run
self.client.run(nthreads, monitor)
File "c:\tools\azure-data-lake-store-python\adlfs\transfer.py", line 337, in run
self.monitor()
File "c:\tools\azure-data-lake-store-python\adlfs\transfer.py", line 370, in monitor
self._wait(poll, timeout)
File "c:\tools\azure-data-lake-store-python\adlfs\transfer.py", line 352, in _wait
self._update()
File "c:\tools\azure-data-lake-store-python\adlfs\transfer.py", line 326, in _update
self.save()
File "c:\tools\azure-data-lake-store-python\adlfs\transfer.py", line 410, in save
pickle.dump(all_downloads, f)
MemoryError
Unhandled exception in thread started by
Unhandled exception in thread started by Unhandled exception in thread started by

Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(Thread-173, started daemon 11736)>>
Traceback (most recent call last):
Unhandled exception in thread started by
Unhandled exception in thread started by Unhandled exception in thread started by
Exception ignored in: <generator object CaseInsensitiveDict.iter.. at 0x37CD71D8>
Traceback (most recent call last):
Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(Thread-149, started daemon 17448)>>
Traceback (most recent call last):
Traceback (most recent call last):
Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(Thread-167, started daemon 4856)>>
Unhandled exception in thread started by <bound method Thread._bootstrap of <Thread(Thread-135, started daemon 18448)>>
File "C:\Users\adlsperf\AppData\Local\Programs\Python\Python35-32\lib\concurrent\futures\thread.py", line 66, in _work
er
Traceback (most recent call last):
File "C:\Users\adlsperf\AppData\Local\Programs\Python\Python35-32\lib\concurrent\futures\thread.py", line 66, in _work
er
Traceback (most recent call last):
Traceback (most recent call last):
<bound method Thread._bootstrap of <Thread(Thread-155, started daemon 6328)>>
Exception in thread Thread-189:
Traceback (most recent call last):
File "C:\Users\adlsperf\AppData\Local\Programs\Python\Python35-32\lib\concurrent\futures\thread.py", line 64, in _work
er
MemoryError