mtth / hdfs Goto Github PK

View Code? Open in Web Editor NEW

268.0 268.0 100.0 610 KB

API and command line interface for HDFS

Home Page: https://hdfscli.readthedocs.io

License: MIT License

Python 98.09% Shell 1.91%

cli hdfs python

hdfs's People

Stargazers

Watchers

Forkers

borgstrom laserson uday1889 wesm ihodes szrw668 jesseliu77 moutai gdtm86 tarrasch pprett nikolayvoronchikhin sophialsanchez zhoubangtao adamchainz yonglehou cogniteev willhorn yassineazzouz macaww snowch ezc sansagara wynhbh noureddinehmx jmh045000 chansonz damienclaveau jiaqianjing mistralbkru hikoz mrmichalis gdgt udim leonj1 fuhmdyd mani552 fhqplzj afcarl misterpilou harsham4026 gridl tiraffe chinesefree shubhamkadam1 alemfg yishengcheng yuanliangzhang ruibm datavisortianli vinsia sebastien-collet harrisonfeng pthangavel bkeep redspart isananic max-hoffman djailla tqichun paveld0770 visualskyrim renchenyang zhangjianrules parthiban32 watsy0007 alex-bo drish monoid n1tk xujl930 mitthu oloremo kaldie ivandubei musicinmybrain davidhulsman matthiasgleichauf yujiapingyu usiel jmcrawford45 stoneidolon yabu-maker china-bear johnyan2017 matteomancini984 elijahahianyo jorricks callysthenes monkeychen andromelus keithmoc dolfinus eubnara kianmeng politusanalytics rmontenegroo apteryx0

hdfs's Issues

`auth=False` in request results in HdfsError: Authentication failure.

I am using the library to work with two kerberized Hadoop clusters (Cluster A and Cluster B). The library works beautifully with Cluster A. It raises HdfsError: Authentication failure. Check your credentials on Cluster B.

It seems that Cluster B needs the request to the redirect location to be signed as well, but consumer() sets auth=False when it calls _request (Line 394). The request is sent in using _session in _request() (lines 174-180). Since kwargs['auth'] is False, _session.auth is not used. Somehow this is only a problem on Cluster B.

Adding del kwargs['auth'] before Line 174 patches the problem for Cluster B.

upload files from local to hdfs failed

from hdfs import *
client = Client('http://172.17.201.16:50070',proxy='hdfs')

client.upload('/user/hdfs/','/home/xxxxx/haha.txt')

ERROR:
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 484, in upload
raise err
hdfs.util.HdfsError: Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: User: dr.who is not allowed to impersonate hdfs

Upload can delete whole directories accidentally

We were using the upload API call to transfer a file to an existing directory. Unfortunately, due to this line of code it seems that when an error occurs during upload the entire destination folder is deleted recursively.

Fortunately, we only lost test data.

Can you consider disabling deletes by default, or perhaps just when the target path is a folder?

InvalidSchema: No connection adapters were found

Hi,

I've setup WebHDFS and I'm trying to use the CLI with:

$ hdfscli --alias=dev
In [1]: CLIENT.list('/')

However, I'm getting the following error:

InvalidSchema: No connection adapters were found for 'hdfs://<HOST>:<PORT>/webhdfs/v1/<FOLDER>'

The WebHDFS REST API works fine with e.g.:

curl "http://<HOST>:<PORT>/webhdfs/v1/<FOLDER>?op=LISTSTATUS"

My config (~/.hdfscli.cfg) looks like:

[global]
default.alias = dev

[dev.alias]
url = hdfs://<HOST>:<PORT>

Am I using hdfs incorrectly or is there something sketchy going on? And it should work without setting up HttpFS, right?

orc format support

Hi @mtth, is there any plan to support orc data format?

Support for Get_ACL_Status

Hi @mtth .
It would be nice if WebHDFS method Get_ACL_Status would be supported. I need a quick way to get the ACL details of a specific path/file in HDFS.

If you like the idea, i would like to contribute and add the feature. Any suggestions on where to look at first?

Functionality to search for files by name

It seems like there is no method for searching files by name in the current API. I want to be able to pass in a search query like "samp" and get a list of files/folders that contain the string (e.g. "samplefile.txt", "samplefolder", etc.)

The only way I can think of for searching currently is by using this method https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.walk and iterating through all of the files/folders and looking for ones with names that match the input string. Not sure how efficient this solution would be..

Kerberos Question

When HDFS is set up with Kerberos, there are different principals for each service as shown here: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SecureMode.html

When authenticating with the KerberosClient, do I need to use the NameNode principal or WebHDFS principal? (i.e. nn/[email protected] or http/[email protected])

Does hdfs require pandas==0.14.1 or pandas>=0.14.1 ?

pip-installing hdfs uninstalled my newer pandas.

How is walk supposed to be used?

How is walk supposed to be used?

I'm trying (path, dirs, files) = CLIENT.walk("/", depth=1) but this gives me an error:

>>> (path, dirs, files) = CLIENT.walk("/", depth=1)
Traceback (most recent call last):
    File "<console>", line 1, in <module>
ValueError: need more than 1 value to unpack

Expose a listdir method

The only way to get a directory listing is to use the internal API like so

client._list_status(path).json()

It'd be great to have a client.listdir(path) method.

Mixed path separators when using Windows

os.path.join() will use whatever separator the local system uses. We should force to "/" and not allow mixed separators, lest you get a 400 bad request from WebHDFS.

In the function 'resolve' in 'client.py'

path = osp.normpath(osp.join(self.root, path))

should be changed to

path = osp.normpath(osp.join(self.root, path)).replace("\","/")

EDITED because formatting ate a ''

There are several other places this should be enforced. Perhaps there is a better way to globally force it?

Configurable logging

I'm using Client.status internally in ibis to determine whether a file exists, and this results in various scary logger warnings being printed in normal user actions. Would it be possible to make this configurable so I can hide these messages somehow?

Connection to swebhdfs

The cluster I'm trying to connect to uses secure webhdfs, so I've set up the ~/.hdfscli.cfg in the following way.

[global]
default.alias = dev

[dev.alias]
url = https://<namenode>:14000
user = <my.username>

I've tried to upload a file as shown in the quickstart document but got an error message complaining about the SSL certificate.

$ hdfscli upload --alias=dev empty.txt models/
ERROR   Unexpected exception.
Traceback (most recent call last):
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 595, in urlopen
    chunked=chunked)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 352, in _make_request
    self._validate_conn(conn)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 831, in _validate_conn
    conn.connect()
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connection.py", line 289, in connect
    ssl_version=resolved_ssl_version)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/util/ssl_.py", line 308, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib/python3.5/ssl.py", line 377, in wrap_socket
    _context=self)
  File "/usr/lib/python3.5/ssl.py", line 752, in __init__
    self.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 988, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib/python3.5/ssl.py", line 633, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/adapters.py", line 423, in send
    timeout=timeout
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 621, in urlopen
    raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/config.py", line 195, in wrapper
    return func(*args, **kwargs)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/__main__.py", line 260, in main
    progress=progress,
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 453, in upload
    hdfs_path = self.resolve(hdfs_path)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 220, in resolve
    root = self._get_home_directory('/').json()['Path']
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 92, in api_handler
    **self.kwargs
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 178, in _request
    **kwargs
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/sessions.py", line 475, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/sessions.py", line 596, in send
    r = adapter.send(request, **kwargs)
  File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/adapters.py", line 497, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

Does hdfscli support secure webhdfs connections?

KerberosClient Question

Hi, I have a question about how the KerberosClient works at: http://stackoverflow.com/questions/41292891/how-does-the-hdfscli-library-decide-which-kerberos-principal-to-use

Basically, it seems like the KerberosClient doesn't have any parameters to specify the keytab location or principal to login with. How does it find the keytab and decide which principal to use? Thank you.

Format of .hdfsrc file not documented

Hello - I am trying to use the hdfsClient from the CLI
I see that in the help [ -a ALIAS] suggests that it is optional - but it seems to require this.
So I am not seeing a sample $HOME/.hdfsrc file
Can you please point to an example file

Thanks a lot

Dataframe extension: order of columns not preserved

My dataframe has below data when printed:

df
Out[149]: 
       user gender        age  \
0     Peter      F  23.000000   
1   M.A.R.Y      F  27.333333   
2  The King      M  28.000000   
3   Tim Tom      M  28.000000   
4      Mary      F  29.000000   

                                          self_intro  
0  Hello, my name is Peter. I am graduated from t...  
1  I am Mary. I am from Maryland. I was born from...  
2                                I am King. The end.  
3  Hi, I am Tim Tom. I love eating snakes. I am v...  
4                                                ...

I use below code to test:

write_dataframe(hdfs_client, df_avro_filepath, df, overwrite=True)
_df = read_dataframe(hdfs_client, df_avro_filepath)
print(_df)
pd.util.testing.assert_frame_equal(df, _df)

and the output has the column sorted alphabetically:

         age gender                                         self_intro  \
0  29.000000      F                                                ...   
1  27.333334      F  I am Mary. I am from Maryland. I was born from...   
2  23.000000      F  Hello, my name is Peter. I am graduated from t...   
3  28.000000      M  Hi, I am Tim Tom. I love eating snakes. I am v...   
4  28.000000      M                                I am King. The end.   

       user  
0      Mary  
1   M.A.R.Y  
2     Peter  
3   Tim Tom  
4  The King

...
AssertionError: DataFrame.columns are different

DataFrame.columns values are different (75.0 %)
[left]:  Index(['user', 'gender', 'age', 'self_intro'], dtype='object')
[right]: Index(['age', 'gender', 'self_intro', 'user'], dtype='object')

Progress tracking for write method

Can you add progress tracking to the "write" method like there is for the "upload" method? I want to be able to directly write binary data to my HDFS and track track the progress. I can't use the "upload" method because there is no local_path

PEP8 compliance

It will make it easier for others to contribute to this package if it adheres to Python style best practices (e.g. 4 spaces per indent)

Support for Set ACL

Hi @mtth. I'm actually adding support for Set ACL. I'm doing a PR as soon as i test this on my cluster!.

Obtain the plain file stream (instead of a contextmanager generator) while reading

Suppose such a circumstance: I have implemented a middleware to serve a file object from HDFS to the downstream application. So in the middleware I wrote:

# file_storage.py
def open(path):
    return hdfs_client.read(path)

And in my downstream app I wrote somethin like:

# app.py
with contextlib.closing(file_storage.py.open(path)) as stream:
    stream.read()

If I wrapped hdfs_client.read into a with context, then the downstream app would read a closed resource.

So is there someway to obtain the plain file stream directly? Or shall we replace the contextmanager decorator by a more generic approach?

URL encoding of directories

Many times, you need characters such as '=' in your HDFS path names. For instance, a partitioned Hive table can have '../year=2015/month=6/day=17' as the path name of a directory. Impala, for instance, uses this naming convention by default for partitioned tables.

Using hdfs.InsecureClient(url, user).list(hdfs_path) on such a path results in

HdfsError: File ../year%3D2015/month%3D6/day%3D30 does not exist.

AvroReader - 'module' object is not callable

When using confluent schema registry (https://github.com/verisign/python-confluent-schemaregistry) together with this module for AvroReader, then it fails with 'module' object is not callable.

'module' object is not callable
File "/venv1/lib64/python3.4/site-packages/hdfs/ext/avro/__init__.py", line 195, in __enter__\n self._schema = next(self._records) # Prime generator to get schema.\n', File "/venv1/lib64/python3.4/site-packages/hdfs/ext/avro/__init__.py", line 186, in _reader\n avro_reader = fastavro.reader(_SeekableReader(bytes_reader))\

Pseudo-code:

from hdfs import InsecureClient
from hdfs.ext.avro import AvroReader, AvroWriter
from confluent.schemaregistry.client import CachedSchemaRegistryClient
from confluent.schemaregistry.serializers import MessageSerializer

schema_registry_client = CachedSchemaRegistryClient(url='http://'+settings['schema_registry_server'])
serializer = MessageSerializer(schema_registry_client)
with AvroReader(hdfs_client, settings['hdfs_target_file']) as reader:
    pass

No way to send kwargs to the _Request objects

I am trying to connect to a namenode on a kerberized cluster but am having problems because of an invalid certificate authority. It would be great if I could provide verify=False to the _Request objects, perhaps on a per-session basis.

/usr/bin/hdfs needs rename

It would be good to rename /usr/bin/hdfs to something else, for example hdfscli. The issue is when installing your package on hosts with Hadoop 2 already installed, when people run the hadoop hdfs command, most likely yours will called instead as /usr/bin is typically earlier on $PATH.

The other issues is if someone like Cloudera, Hortonworks, or Apache Bigtop decides to symlink /usr/bin/hdfs to the hadoop hdfs command, there will be a package conflict preventing installation.

Could NameNode HA be supported?

That will be great if the url parameter for Client accepted multiple host:port pairs.

Cant walk over dirs with spaces

This is is REPL with "/test" dir that contains dir with spaces "dir with space". When trying to traverse that dir it tries that with "/test/dir%20with%20space" instead "/test/dir with space".
Hadoop version is: Hadoop 2.5.0-cdh5.3.6

>>> from hdfs import InsecureClient
>>> ic = InsecureClient("http://localhost:50070")
>>> for (path, dirs, files) in ic.walk("/test"):
...     print(dirs)
... 
[u'dir with space']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 938, in walk
    for infos in _walk(hdfs_path, s, depth):
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 932, in _walk
    for infos in _walk(path, s, depth - 1):
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 918, in _walk
    infos = self.list(dir_path, status=True)
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 887, in list
    statuses = self._list_status(hdfs_path).json()['FileStatuses']['FileStatus']
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 90, in api_handler
    **self.kwargs
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 182, in _request
    return _on_error(response)
  File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 44, in _on_error
    raise HdfsError(message)
hdfs.util.HdfsError: File /test/dir%20with%20space does not exist.

Copy command

Hey, is there anyway to simply copy data? Can't seem to find it anywhere in the docs. Thanks

Implement HDFS trash functionality in hdfs module

After an accidental deletion of some reasonably important data and thinking I'd be able to get some files back from Trash i've discovered that webhdfs does not enable hdfs Trash behind the scenes and it's considered the clients responsibility to do this.

Is this something that could be incorporated into the hdfs module? I've had a look around and it looks like hue has implemented some of this functionality in its internal code.

https://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py

optional user parameter for InsecureClient

When using hdfs from Windows, the Username often differs from the username on hadoop.

Can you add an example for passing the username to the Insecure Client to the Quickstart tutorial?

Client.download repeatedly quotes files

this causes an error when downloading files with '%' in the filename:

 >>> c.write('/foo/foo bar%.txt', 'hello')
 >>> c.download('foo/foo bar%.txt', '/tmp/foobar')
 HdfsError: File foo/foo bar%25.txt does not exist

Which Hadoop version should I use?

Hi Dear Author(s),

I would like to very thank you for your brilliant job because I just in need of the combination of pandas and HDFS.
However, when I try the run the simplest example, I've encounter such error:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 396, in write
buffersize=buffersize,
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 86, in api_handler
*_self.kwargs
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 196, in _request
*_kwargs
File "/usr/lib/python2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, *_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, *_send_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, *kwargs)
File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('\x00\x00\x00|{\x08\xff\xff\xff\xff\x0f\x10\x02\x18\t")org.apache.hadoop.ipc.RPC$VersionMismatch>Server IPC version 9 cannot communicate with client version 470\x0e:\x00@\x01',))

My hadoop version is 2.7.1, the latest one.
And I've installed spark on my server, version 1.4.1

Thanks!

Don't use 'hdfs' as the binary name

Apache Hadoop scripts on PATH use "hdfs". This module's default installation overwrites that and breaks scripts that rely on the Apache Hadoop one, when installed on a host that already has Apache Hadoop installed on it.

Could the binary name be changed?

support hfile/avro for hbase ?

hi,
similar to https://github.com/traviscrawford/python-hdfs/blob/master/hdfs/hfile.py , is it possible for you to support the HBase/HFile format ? And especially being able to read write AVRO data from hbase.

References:

Python 3 support?

I seem to hit some syntax errors when running write_df with python 3.4. I wasn't sure if this is supported or not.

Could you make this clearer on your documentation of whether this is supported or not. If not is there any plans to make it compatible in the future.

upload files from local to hdfs failed

`>>> from hdfs import Client

client = Client('http://172.17.201.16:50070')
client.list('/')
[u'abc', u'demo', u'input', u'tmp', u'user']
client.upload('/user/hdfs/','/home/spider/haha.txt')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 547, in upload
raise err
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7fb776c84950>: Failed to establish a new connection: [Errno -2] Name or service not known`
But , i upload file in this way on the server cluster is successful

Upgrading `requests` package breaks hdfs download

If I do:

# pip install hdfs==2.0.9 requests==2.9.1

I can successfully run a tests that uploads and downloads a file from HDFS. Here is a log output of the test:

[2016-08-10 18:21:16,054] {client.py:156} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:21:16,055] {client.py:427} INFO - Uploading '/tmp/tmpPQUs8k' to '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,059] {client.py:891} INFO - Listing '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,070] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,076] {client.py:375} INFO - Writing to '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,079] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,111] {base_hook.py:53} INFO - Using connection to: cdh
[2016-08-10 18:21:16,112] {client.py:156} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:21:16,112] {client.py:650} INFO - Downloading '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv' to '/tmp/tmp7imrk9out'.
[2016-08-10 18:21:16,114] {client.py:920} INFO - Walking '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv' (depth 0).
[2016-08-10 18:21:16,119] {client.py:276} INFO - Fetching status for '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,121] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,124] {client.py:591} INFO - Reading file '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,128] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,137] {base_hook.py:53} INFO - Using connection to: cdh
[2016-08-10 18:21:16,138] {client.py:156} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:21:16,138] {client.py:766} INFO - Deleting '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,139] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh

If I now upgrade the requests package to a later version:

pip install hdfs==2.0.9 requests==2.11.0

I get the following;

[2016-08-10 18:24:45,246] {base_hook.py:53} INFO - Using connection to: cdh
[2016-08-10 18:24:45,247] {client.py:157} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:24:45,248] {client.py:645} INFO - Downloading '/data/barclay/raw_excel/2016-07-10.xlsx' to 'local3.xlsx'.
[2016-08-10 18:24:45,249] {client.py:914} INFO - Walking '/data/barclay/raw_excel/2016-07-10.xlsx' (depth 0).
[2016-08-10 18:24:45,249] {client.py:279} INFO - Fetching status for '/data/barclay/raw_excel/2016-07-10.xlsx'.
[2016-08-10 18:24:45,256] {connectionpool.py:214} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:24:45,262] {client.py:586} INFO - Reading file '/data/barclay/raw_excel/2016-07-10.xlsx'.
[2016-08-10 18:24:45,268] {connectionpool.py:214} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:25:44,239] {client.py:721} ERROR - Error while downloading. Attempting cleanup.
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/hdfs/client.py", line 717, in download
    _download(fpath_tuple)
  File "/usr/local/lib/python2.7/site-packages/hdfs/client.py", line 660, in _download
    for chunk in reader:
  File "/usr/local/lib/python2.7/site-packages/requests/utils.py", line 372, in stream_decode_response_unicode
    raise UnicodeError("Unable to decode contents with encoding %s." % encoding)
UnicodeError: Unable to decode contents with encoding None.
---------------------------------------------------------------------------
UnicodeError                              Traceback (most recent call last)
<ipython-input-3-e7e086a08bcd> in <module>()
----> 1 hook.hdfs_download('/data/barclay/raw_excel/2016-07-10.xlsx', 'local3.xlsx')

/opt/altx_data_pipeline/altx_data_pipeline/utils/air/hooks/hive_hooks.pyc in hdfs_download(self, hdfs_path, local_path, overwrite)
    255     def hdfs_download(self, hdfs_path, local_path, overwrite=False):
    256         hdfs_client = self._get_hdfs_client()
--> 257         hdfs_client.download(hdfs_path, local_path, overwrite=overwrite)
    258 
    259     def table_from_dataframe(self, df, table_name, engine, as_statement=False):

/opt/altx_data_pipeline/altx_data_pipeline/utils/clients/hdfs_client.pyc in download(self, hdfs_path, local_path, overwrite)
     34     def download(self, hdfs_path, local_path, overwrite=False):
     35         c = self.get_conn()
---> 36         c.download(hdfs_path, local_path, overwrite=overwrite)
     37         logger.debug("[HDFS] Downloaded {} to {}".format(hdfs_path,
     38                                                          local_path))

/usr/local/lib/python2.7/site-packages/hdfs/client.pyc in download(self, hdfs_path, local_path, overwrite, n_threads, temp_dir, **kwargs)
    728         _logger.error('Unable to cleanup temporary folder.')
    729       finally:
--> 730         raise err
    731     else:
    732       if temp_path != local_path:

UnicodeError: Unable to decode contents with encoding None.

This seems to be a related to an issue in requests version 2.11.0

kennethreitz/requests#3481

TypeError when resolving path without setting root

Client.resolve uses self.root without checking if it is empty which leads to exception.

Question on using Kerberos client in hdfscli

The documentation is not very specific on how to use or configure hdfs to use Kerberos authentication.
Please advise.

$ klist
Ticket cache: FILE:/homes/udb_grid/cache/krb_ticket
Default principal: [email protected]

Valid starting Expires Service principal
04/21/15 17:36:10 04/22/15 17:36:10 krbtgt/[email protected]
renew until 04/26/15 16:30:46
04/21/15 20:43:08 04/22/15 17:36:10 HTTP/[email protected]
renew until 04/26/15 16:30:46
$ python --version
Python 2.6.6
$ pip list | grep hdfs
hdfs (0.5.2)
$ cat /homes/udb_grid/.hdfsrc
[hdfs]
default.alias=udb_grid@bassniumred
[udb_grid@bassniumred_alias]
root=/projects/udb
url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070
$ rm -f /tmp/hdfs.log; hdfscli --list; echo; echo; cat /tmp/hdfs.log
Authentication failure. Check your credentials.

2015-04-21 21:59:49,266 | DEBU | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: Resolved path '' to '/projects/udb'.
2015-04-21 21:59:49,267 | INFO | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: Fetching status for /projects/udb.
2015-04-21 21:59:49,267 | DEBU | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: Resolved path '/projects/udb' to '/projects/udb'.
2015-04-21 21:59:49,284 | WARN | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: [401] /webhdfs/v1/projects/udb?op=GETFILESTATUS:

<title>Error 401 Authentication required</title>

HTTP ERROR 401

Problem accessing /webhdfs/v1/projects/udb. Reason:

    Authentication required

Powered by Jetty://

2015-04-21 21:59:49,285 | ERRO | hdfs.util > Authentication failure. Check your credentials.
$

TypeError on mutual_auth since 2.0.14

Since 2.0.14 we get this kind of error:

Is there something I'm missing ? (I don't pass any kind of kwargs on the code).

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/airflow/models.py", line 1250, in run
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.5/dist-packages/airflow/operators/python_operator.py", line 121, in execute
    condition = super(ShortCircuitOperator, self).execute(context)
  File "/usr/local/lib/python3.5/dist-packages/airflow/operators/python_operator.py", line 66, in execute
    return_value = self.python_callable(*self.op_args, **self.op_kwargs)
  File "/etc/airflow/dags/flux_chanel.zip/flux_chanel_utils/logic.py", line 77, in are_files_here
    return hook.check_for_path(hdfs_path)
  File "/usr/local/lib/python3.5/dist-packages/airflow/hooks/webhdfs_hook.py", line 57, in check_for_path
    c = self.get_conn()
  File "/usr/local/lib/python3.5/dist-packages/airflow/hooks/webhdfs_hook.py", line 39, in get_conn
    client = KerberosClient(connection_str)
  File "/usr/local/lib/python3.5/dist-packages/hdfs/ext/kerberos.py", line 121, in __init__
    session.auth = _HdfsHTTPKerberosAuth(int(max_concurrency), **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/hdfs/ext/kerberos.py", line 68, in __init__
    super(_HdfsHTTPKerberosAuth, self).__init__(**kwargs)
TypeError: __init__() got an unexpected keyword argument 'mutual_auth'

Question about stating root directory.

Hi htth.
Just a quick question on something i realized when using hdfscli on a script to stat my prod hdfs.
When using stat on root in order to get the bytes, like client.content('/')['length'] i get a value, say X.
Shortly after that, i walk on a subdirectory, stating each one of them and also storing the lenght, like:

fpaths = [
    psp.join(dpath, ddir)
    for dpath, ddirs, fnames in client.walk('/', 3)
    for ddir in ddirs
    ]

for item in fpaths:
     details = client.content(item)
     totalBytes = details['length']

For some directories on this walk (Y), i get some lengths that are up to 300% more than what i get for root (X). Is this expected? What should i use in order to compare the length of a particular subfolder in relation to the complete hdfs length?

Edit:
Nevermind, i just found out that this particular issue is with my hdfs. Apparently, it stops iterating over root when getting to a particular folder. So the same happens when doing:
hdfs dfs -du -h / -> gets only first folder size
and:
hdfs dfs -du -h -s / ->gets the complete list of first-folder directories.

Workaround is to do the walk by myself. Thank you.

Skip empty Avro part files

Empty avro part-files seem to yield an error when reading. Allow for skipping these rather than error out.

example with timeout

it is not clear how timeout should be specified. in HTTPRequest, one can specify request_timeout.

HTTPRequest(url_with_host, method=method, body=body, request_timeout=HDFS_TIMEOUT)

we are trying to upload a very large file. it is also not clear if setting parallelism will help. right now, we are getting timeout.

Install HDFSCLI with kerberos extension over setup.py

Hi mtth.
In our company, we use our own tarballs as the source for installing modules on the servers. We use setup.py install The issue i have installing HDFSCLI this way, is that i require the Kerberos extension, wich is not installed by default.

I've googled around, and that seems to be a missing feature on setup.py. So wanted to ask if you can think of a workaround for the Kerberos extension to be installed as the default. (i thought moving the kerberos.py file out one step would do the trick, but the tarball gets messed).

Any ideas?

Question about Custom Client

Hi @mtth . I happen to be using hdfscli with the default ~/.hdfscli.cfg approach. However, after a recent server change, i had to re-create the file. So i was thinking to avoid the file (and even the use of a Environment Variable) by directly passing the parameters to the Client programmatically.

I'm actually just using client = Config().get_client() on the code. And for the config file:

[global]
default.alias = prod
autoload.modules = hdfs.ext.kerberos

[prod.alias]
url = http://cluster.domain:14000/
client = KerberosClient

Any idea how to avoid using the config file with this setup? I've taken a look at:
http://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client
But cant seem to quite get how i'm supposed to use Kerberos client with this...

Can't set timeout for requests

Evaluating this package. Looks good, but I would like to be able to set timeouts when making requests to cope better with any namenode/datanode hickups.

Can't figure out how to do that easily. Is there a way?

readlines() is extremely slow

I'm reading in a 2.67 MB json file from a hdfs filesystem. It has 2809922 characters in 254 lines.

from hdfs import InsecureClient
client = InsecureClient(hadoopDirectory, user=me)

Following the quickstart guide and using timeit I measured this bit of code:

with client.read(filename) as reader:
  output = reader.read()

The read() method takes 59.2 ms.

However just substituting read() with readlines() increases the time to 1 min 3 s which is a more than thousand-fold increase for the same file.

Reading only the first line with readline() is also slow, it already takes 991 ms, though the lines are unevenly long.

Reading the file locally with standard python like this

with open(filename) as reader:
  output = reader.readlines()

takes 8.46 ms.

AvroWrite exponentional file size

I have found a very interesting issue with AvroWriter, which breaks the whole writing process.
Basically the files size start growing exponentially.
When I export 100 records, the file size is 10Kb, when I export 200 records (of the same), the file size is 670Kb!!!

I use the standard AvroWriter example (except added a long string value to make the data bigger).
How to reproduce:
The below code exports 2 records. just take those 2 records and copy-paste them until you don't have 200 records of the same and then run the export. Observe that the file size became 670Kb.
Remove half of the records, leave only 100 records. Export again, observe that the file size is 10Kb!

"""Avro extension example."""
from hdfs import InsecureClient
from hdfs.ext.avro import AvroReader, AvroWriter
import os

os.environ['http_proxy'] = ''
os.environ['user'] = 'my_user'

# Get the default alias' client.
client = InsecureClient('http://xxx:50070', user='my_user')

# Some sample data.
records = [
    {'name': 'Ann', 'age': 23, 'string': 'asdasd asklhj doiuiroe t43785 iofdjkvnuhyt789u judffkguirety9 jdfvhjkfcvhuierytuiy348957hjdfh jkdfgirhytiery5689347589'},
    {'name': 'Bob', 'age': 22, 'string': 'nkdfn kjfhy tuier vgfhe89734 89hIUHIUJKBNUIh iuhiuh48 9 5u7489jkldjvlcdjvoiejuioweufoksdfsdf'},
]

# Write an Avro File to HDFS (since our records' schema is very simple, we let
# the writer infer it automatically, otherwise we would pass it as argument).
# codec can be null, deflate or snappy
with AvroWriter(client, '/user/my_user/names.avro', overwrite=True) as writer:
    for record in records:
        writer.write(record)

% in file names need to be automatically URL encoded

If I try to access a file with a percent in the filename, I need to manually url encode the % sign, otherwise I get a 500 response from the webhdfs API as %'s are reserved characters. RFC 3986 Section 2.1 talks about percent encoding url characters.

-=-=-=-=-=-=-
File name with %
Note: the blank error string showing the exception is not being caught.
-=-=-=-=-=-=-
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Uploading /var/tmp/bigfile to /home/opsmekanix/bigfil%e.
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Writing to /home/opsmekanix/bigfil%e.
DEBUG:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Resolved path /home/opsmekanix/bigfil%e to /home/opsmekanix/bigfil%e.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mynamnodeserver01.example.com
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%e?overwrite=False&op=CREATE HTTP/1.1" 401 0
DEBUG:requests_kerberos.kerberos_:handle_401(): Handling: 401
DEBUG:requests_kerberos.kerberos_:authenticate_user(): Authorization header: Negotiate YIICXwYJKoZIhvc ... ommited header here ... SQZ8qie6jMeIlxS7+ZaW186FDpdDrDYLa9BigK9qhT3qRXx7R0=
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%e?overwrite=False&op=CREATE HTTP/1.1" 500 0
DEBUG:requests_kerberos.kerberos_:authenticate_user(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_401(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_other(): Handling: 500
DEBUG:requests_kerberos.kerberos_:handle_other(): Authenticating the server
DEBUG:requests_kerberos.kerberos_:authenticate_server(): Authenticate header: YGYGCSqGSIb3EgECAgIAb1cwVaADAgEFoQMCAQ+iSTBHoAMCAReiQAQ+G5 ... ommitted header here ... NH6V8mO0OGfmbAxIYfpyIXyaeG8Xikdo8SQbry27Fiui0=
DEBUG:requests_kerberos.kerberos_:authenticate_server(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_other(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [500]>
WARNING:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: [500] /webhdfs/v1/home/opsmekanix/bigfil%e?overwrite=False&op=CREATE:

ERROR:

-=-=-=-=-=-=-
File name where percent is represnted as %26
Note: I controll-c'd out of this request as I don't actually care about uploading the file
-=-=-=-=-=-=-
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Uploading /var/tmp/bigfile to /home/opsmekanix/bigfil%26e.
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Writing to /home/opsmekanix/bigfil%26e.
DEBUG:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Resolved path /home/opsmekanix/bigfil%26e to /home/opsmekanix/bigfil%26e.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mynamnodeserver01.example.com
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%26e?overwrite=False&op=CREATE HTTP/1.1" 401 0
DEBUG:requests_kerberos.kerberos_:handle_401(): Handling: 401
DEBUG:requests_kerberos.kerberos_:authenticate_user(): Authorization header: Negotiate YIICXwYJKoZIhvc ... ommited header here ... SQZ8qie6jMeIlxS7+ZaW186FDpdDrDYLa9BigK9qhT3qRXx7R0=
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%26e?overwrite=False&op=CREATE HTTP/1.1" 307 0
DEBUG:requests_kerberos.kerberos_:authenticate_user(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_401(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_other(): Handling: 307
DEBUG:requests_kerberos.kerberos_:handle_other(): Authenticating the server
DEBUG:requests_kerberos.kerberos_:authenticate_server(): Authenticate header: YGYGCSqGSIb3EgECAgIAb1cwVaADAgEFoQMCAQ+iSTBHoAMCAReiQAQ+G5 ... ommitted header here ... NH6V8mO0OGfmbAxIYfpyIXyaeG8Xikdo8SQbry27Fiui0=
DEBUG:requests_kerberos.kerberos_:authenticate_server(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_other(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [307]>
DEBUG:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: [307] /webhdfs/v1/home/opsmekanixbigfil%26e?overwrite=False&op=CREATE
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mydatanode123.example.com
^CTraceback (most recent call last):

Error when using Kerberos. Ticket Expired?

I'm trying to get the root HDFS directory listing. Se when i do CLIENT.list('/') , i'm getting the following error traceback:

In [1]: CLIENT.list('/')
ERROR   generate_request_header(): authGSSClientStep() failed:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
    negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
ERROR   (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
    negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
ERROR   generate_request_header(): authGSSClientStep() failed:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
    negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
ERROR   (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
    negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure.  Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
---------------------------------------------------------------------------
HdfsError                                 Traceback (most recent call last)
<ipython-input-1-b69df44579dc> in <module>()
----> 1 CLIENT.list('/')

/usr/local/lib/python3.5/site-packages/hdfs/client.py in list(self, hdfs_path, status)
    891     _logger.info('Listing %r.', hdfs_path)
    892     hdfs_path = self.resolve(hdfs_path)
--> 893     statuses = self._list_status(hdfs_path).json()['FileStatuses']['FileStatus']
    894     if len(statuses) == 1 and (
    895       not statuses[0]['pathSuffix'] or self.status(hdfs_path)['type'] == 'FILE'

/usr/local/lib/python3.5/site-packages/hdfs/client.py in api_handler(client, hdfs_path, data, strict, **params)
     90         params=params,
     91         strict=strict,
---> 92         **self.kwargs
     93       )
     94

/usr/local/lib/python3.5/site-packages/hdfs/client.py in _request(self, method, url, strict, **kwargs)
    179     )
    180     if strict and not response: # Non 2XX status code.
--> 181       return _on_error(response)
    182     else:
    183       return response

/usr/local/lib/python3.5/site-packages/hdfs/client.py in _on_error(response)
     35   """
     36   if response.status_code == 401:
---> 37     raise HdfsError('Authentication failure. Check your credentials.')
     38   try:
     39     # Cf. http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#Error+Responses

HdfsError: Authentication failure. Check your credentials.

Right now i'm only testing this on dev, so what i have in ~/.hdfscli.cfg is:


[global]
default.alias = dev
autoload.modules = hdfs.ext.kerberos

[dev.alias]
url = http://devxx.atl.enterprise.com:50070/
client = KerberosClient

[prod.alias]
url = http://prod.namenode:port
root = /jobs/

What i'm doing wrong??. I Suspect it may have something to do with the user. (When i shell into the server, the username changes from mine (latencio.site) to hadoop.

Also, i have access to Cloudera Manager to check the Kerberos Configuration.

mtth / hdfs Goto Github PK

hdfs's People

Stargazers

Watchers

Forkers

hdfs's Issues

client.upload('/user/hdfs/','/home/xxxxx/haha.txt')

HTTP ERROR 401

Recommend Projects

Recommend Topics

Recommend Org