mtth / hdfs Goto Github PK
View Code? Open in Web Editor NEWAPI and command line interface for HDFS
Home Page: https://hdfscli.readthedocs.io
License: MIT License
API and command line interface for HDFS
Home Page: https://hdfscli.readthedocs.io
License: MIT License
I am using the library to work with two kerberized Hadoop clusters (Cluster A and Cluster B). The library works beautifully with Cluster A. It raises HdfsError: Authentication failure. Check your credentials
on Cluster B.
It seems that Cluster B needs the request to the redirect location to be signed as well, but consumer()
sets auth=False
when it calls _request
(Line 394). The request is sent in using _session
in _request()
(lines 174-180). Since kwargs['auth'] is False
, _session.auth
is not used. Somehow this is only a problem on Cluster B.
Adding del kwargs['auth']
before Line 174 patches the problem for Cluster B.
from hdfs import *
client = Client('http://172.17.201.16:50070',proxy='hdfs')client.upload('/user/hdfs/','/home/xxxxx/haha.txt')
ERROR:
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 484, in upload
raise err
hdfs.util.HdfsError: Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: User: dr.who is not allowed to impersonate hdfs
We were using the upload
API call to transfer a file to an existing directory. Unfortunately, due to this line of code it seems that when an error occurs during upload the entire destination folder is deleted recursively.
Fortunately, we only lost test data.
Can you consider disabling deletes by default, or perhaps just when the target path is a folder?
Hi,
I've setup WebHDFS and I'm trying to use the CLI with:
$ hdfscli --alias=dev
In [1]: CLIENT.list('/')
However, I'm getting the following error:
InvalidSchema: No connection adapters were found for 'hdfs://<HOST>:<PORT>/webhdfs/v1/<FOLDER>'
The WebHDFS REST API works fine with e.g.:
curl "http://<HOST>:<PORT>/webhdfs/v1/<FOLDER>?op=LISTSTATUS"
My config (~/.hdfscli.cfg) looks like:
[global]
default.alias = dev
[dev.alias]
url = hdfs://<HOST>:<PORT>
Am I using hdfs incorrectly or is there something sketchy going on? And it should work without setting up HttpFS, right?
Hi @mtth, is there any plan to support orc data format?
Hi @mtth .
It would be nice if WebHDFS method Get_ACL_Status would be supported. I need a quick way to get the ACL details of a specific path/file in HDFS.
If you like the idea, i would like to contribute and add the feature. Any suggestions on where to look at first?
It seems like there is no method for searching files by name in the current API. I want to be able to pass in a search query like "samp" and get a list of files/folders that contain the string (e.g. "samplefile.txt", "samplefolder", etc.)
The only way I can think of for searching currently is by using this method https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.walk and iterating through all of the files/folders and looking for ones with names that match the input string. Not sure how efficient this solution would be..
When HDFS is set up with Kerberos, there are different principals for each service as shown here: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SecureMode.html
When authenticating with the KerberosClient, do I need to use the NameNode principal or WebHDFS principal? (i.e. nn/[email protected] or http/[email protected])
pip-installing hdfs uninstalled my newer pandas.
How is walk
supposed to be used?
I'm trying (path, dirs, files) = CLIENT.walk("/", depth=1)
but this gives me an error:
>>> (path, dirs, files) = CLIENT.walk("/", depth=1)
Traceback (most recent call last):
File "<console>", line 1, in <module>
ValueError: need more than 1 value to unpack
The only way to get a directory listing is to use the internal API like so
client._list_status(path).json()
It'd be great to have a client.listdir(path)
method.
os.path.join() will use whatever separator the local system uses. We should force to "/" and not allow mixed separators, lest you get a 400 bad request from WebHDFS.
In the function 'resolve' in 'client.py'
path = osp.normpath(osp.join(self.root, path))
should be changed to
path = osp.normpath(osp.join(self.root, path)).replace("\","/")
EDITED because formatting ate a ''
There are several other places this should be enforced. Perhaps there is a better way to globally force it?
I'm using Client.status
internally in ibis to determine whether a file exists, and this results in various scary logger warnings being printed in normal user actions. Would it be possible to make this configurable so I can hide these messages somehow?
The cluster I'm trying to connect to uses secure webhdfs, so I've set up the ~/.hdfscli.cfg
in the following way.
[global]
default.alias = dev
[dev.alias]
url = https://<namenode>:14000
user = <my.username>
I've tried to upload a file as shown in the quickstart document but got an error message complaining about the SSL certificate.
$ hdfscli upload --alias=dev empty.txt models/
ERROR Unexpected exception.
Traceback (most recent call last):
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 595, in urlopen
chunked=chunked)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 352, in _make_request
self._validate_conn(conn)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 831, in _validate_conn
conn.connect()
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connection.py", line 289, in connect
ssl_version=resolved_ssl_version)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/util/ssl_.py", line 308, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/usr/lib/python3.5/ssl.py", line 377, in wrap_socket
_context=self)
File "/usr/lib/python3.5/ssl.py", line 752, in __init__
self.do_handshake()
File "/usr/lib/python3.5/ssl.py", line 988, in do_handshake
self._sslobj.do_handshake()
File "/usr/lib/python3.5/ssl.py", line 633, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/adapters.py", line 423, in send
timeout=timeout
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/packages/urllib3/connectionpool.py", line 621, in urlopen
raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/config.py", line 195, in wrapper
return func(*args, **kwargs)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/__main__.py", line 260, in main
progress=progress,
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 453, in upload
hdfs_path = self.resolve(hdfs_path)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 220, in resolve
root = self._get_home_directory('/').json()['Path']
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 92, in api_handler
**self.kwargs
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/hdfs/client.py", line 178, in _request
**kwargs
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/sessions.py", line 596, in send
r = adapter.send(request, **kwargs)
File "/home/gianluca/.virtualenvs/search/lib/python3.5/site-packages/requests-2.11.1-py3.5.egg/requests/adapters.py", line 497, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
Does hdfscli support secure webhdfs connections?
Hi, I have a question about how the KerberosClient works at: http://stackoverflow.com/questions/41292891/how-does-the-hdfscli-library-decide-which-kerberos-principal-to-use
Basically, it seems like the KerberosClient doesn't have any parameters to specify the keytab location or principal to login with. How does it find the keytab and decide which principal to use? Thank you.
Hello - I am trying to use the hdfsClient from the CLI
I see that in the help [ -a ALIAS] suggests that it is optional - but it seems to require this.
So I am not seeing a sample $HOME/.hdfsrc file
Can you please point to an example file
Thanks a lot
My dataframe has below data when printed:
df
Out[149]:
user gender age \
0 Peter F 23.000000
1 M.A.R.Y F 27.333333
2 The King M 28.000000
3 Tim Tom M 28.000000
4 Mary F 29.000000
self_intro
0 Hello, my name is Peter. I am graduated from t...
1 I am Mary. I am from Maryland. I was born from...
2 I am King. The end.
3 Hi, I am Tim Tom. I love eating snakes. I am v...
4 ...
I use below code to test:
write_dataframe(hdfs_client, df_avro_filepath, df, overwrite=True)
_df = read_dataframe(hdfs_client, df_avro_filepath)
print(_df)
pd.util.testing.assert_frame_equal(df, _df)
and the output has the column sorted alphabetically:
age gender self_intro \
0 29.000000 F ...
1 27.333334 F I am Mary. I am from Maryland. I was born from...
2 23.000000 F Hello, my name is Peter. I am graduated from t...
3 28.000000 M Hi, I am Tim Tom. I love eating snakes. I am v...
4 28.000000 M I am King. The end.
user
0 Mary
1 M.A.R.Y
2 Peter
3 Tim Tom
4 The King
...
AssertionError: DataFrame.columns are different
DataFrame.columns values are different (75.0 %)
[left]: Index(['user', 'gender', 'age', 'self_intro'], dtype='object')
[right]: Index(['age', 'gender', 'self_intro', 'user'], dtype='object')
Can you add progress tracking to the "write" method like there is for the "upload" method? I want to be able to directly write binary data to my HDFS and track track the progress. I can't use the "upload" method because there is no local_path
It will make it easier for others to contribute to this package if it adheres to Python style best practices (e.g. 4 spaces per indent)
Suppose such a circumstance: I have implemented a middleware to serve a file object from HDFS to the downstream application. So in the middleware I wrote:
# file_storage.py
def open(path):
return hdfs_client.read(path)
And in my downstream app I wrote somethin like:
# app.py
with contextlib.closing(file_storage.py.open(path)) as stream:
stream.read()
If I wrapped hdfs_client.read
into a with context, then the downstream app would read a closed resource.
So is there someway to obtain the plain file stream directly? Or shall we replace the contextmanager
decorator by a more generic approach?
Many times, you need characters such as '=' in your HDFS path names. For instance, a partitioned Hive table can have '../year=2015/month=6/day=17' as the path name of a directory. Impala, for instance, uses this naming convention by default for partitioned tables.
Using hdfs.InsecureClient(url, user).list(hdfs_path) on such a path results in
HdfsError: File ../year%3D2015/month%3D6/day%3D30 does not exist.
When using confluent schema registry (https://github.com/verisign/python-confluent-schemaregistry) together with this module for AvroReader, then it fails with 'module' object is not callable.
'module' object is not callable
File "/venv1/lib64/python3.4/site-packages/hdfs/ext/avro/__init__.py", line 195, in __enter__\n self._schema = next(self._records) # Prime generator to get schema.\n', File "/venv1/lib64/python3.4/site-packages/hdfs/ext/avro/__init__.py", line 186, in _reader\n avro_reader = fastavro.reader(_SeekableReader(bytes_reader))\
Pseudo-code:
from hdfs import InsecureClient
from hdfs.ext.avro import AvroReader, AvroWriter
from confluent.schemaregistry.client import CachedSchemaRegistryClient
from confluent.schemaregistry.serializers import MessageSerializer
schema_registry_client = CachedSchemaRegistryClient(url='http://'+settings['schema_registry_server'])
serializer = MessageSerializer(schema_registry_client)
with AvroReader(hdfs_client, settings['hdfs_target_file']) as reader:
pass
I am trying to connect to a namenode on a kerberized cluster but am having problems because of an invalid certificate authority. It would be great if I could provide verify=False
to the _Request
objects, perhaps on a per-session basis.
It would be good to rename /usr/bin/hdfs to something else, for example hdfscli. The issue is when installing your package on hosts with Hadoop 2 already installed, when people run the hadoop hdfs command, most likely yours will called instead as /usr/bin is typically earlier on $PATH.
The other issues is if someone like Cloudera, Hortonworks, or Apache Bigtop decides to symlink /usr/bin/hdfs to the hadoop hdfs command, there will be a package conflict preventing installation.
That will be great if the url
parameter for Client
accepted multiple host:port pairs.
This is is REPL with "/test" dir that contains dir with spaces "dir with space". When trying to traverse that dir it tries that with "/test/dir%20with%20space" instead "/test/dir with space".
Hadoop version is: Hadoop 2.5.0-cdh5.3.6
>>> from hdfs import InsecureClient
>>> ic = InsecureClient("http://localhost:50070")
>>> for (path, dirs, files) in ic.walk("/test"):
... print(dirs)
...
[u'dir with space']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 938, in walk
for infos in _walk(hdfs_path, s, depth):
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 932, in _walk
for infos in _walk(path, s, depth - 1):
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 918, in _walk
infos = self.list(dir_path, status=True)
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 887, in list
statuses = self._list_status(hdfs_path).json()['FileStatuses']['FileStatus']
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 90, in api_handler
**self.kwargs
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 182, in _request
return _on_error(response)
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 44, in _on_error
raise HdfsError(message)
hdfs.util.HdfsError: File /test/dir%20with%20space does not exist.
Hey, is there anyway to simply copy data? Can't seem to find it anywhere in the docs. Thanks
After an accidental deletion of some reasonably important data and thinking I'd be able to get some files back from Trash i've discovered that webhdfs does not enable hdfs Trash behind the scenes and it's considered the clients responsibility to do this.
Is this something that could be incorporated into the hdfs module? I've had a look around and it looks like hue has implemented some of this functionality in its internal code.
https://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py
When using hdfs from Windows, the Username often differs from the username on hadoop.
Can you add an example for passing the username to the Insecure Client to the Quickstart tutorial?
this causes an error when downloading files with '%' in the filename:
>>> c.write('/foo/foo bar%.txt', 'hello')
>>> c.download('foo/foo bar%.txt', '/tmp/foobar')
HdfsError: File foo/foo bar%25.txt does not exist
Hi Dear Author(s),
I would like to very thank you for your brilliant job because I just in need of the combination of pandas and HDFS.
However, when I try the run the simplest example, I've encounter such error:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 396, in write
buffersize=buffersize,
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 86, in api_handler
*_self.kwargs
File "/usr/lib/python2.7/site-packages/hdfs/client.py", line 196, in _request
*_kwargs
File "/usr/lib/python2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, *_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, *_send_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, *kwargs)
File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('\x00\x00\x00|{\x08\xff\xff\xff\xff\x0f\x10\x02\x18\t")org.apache.hadoop.ipc.RPC$VersionMismatch>Server IPC version 9 cannot communicate with client version 470\x0e:\x00@\x01',))
My hadoop version is 2.7.1, the latest one.
And I've installed spark on my server, version 1.4.1
Thanks!
Apache Hadoop scripts on PATH use "hdfs". This module's default installation overwrites that and breaks scripts that rely on the Apache Hadoop one, when installed on a host that already has Apache Hadoop installed on it.
Could the binary name be changed?
hi,
similar to https://github.com/traviscrawford/python-hdfs/blob/master/hdfs/hfile.py , is it possible for you to support the HBase/HFile format ? And especially being able to read write AVRO data from hbase.
References:
I seem to hit some syntax errors when running write_df with python 3.4. I wasn't sure if this is supported or not.
Could you make this clearer on your documentation of whether this is supported or not. If not is there any plans to make it compatible in the future.
`>>> from hdfs import Client
client = Client('http://172.17.201.16:50070')
client.list('/')
[u'abc', u'demo', u'input', u'tmp', u'user']
client.upload('/user/hdfs/','/home/spider/haha.txt')
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/hdfs/client.py", line 547, in upload
raise err
requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7fb776c84950>: Failed to establish a new connection: [Errno -2] Name or service not known`
But , i upload file in this way on the server cluster is successful
If I do:
# pip install hdfs==2.0.9 requests==2.9.1
I can successfully run a tests that uploads and downloads a file from HDFS. Here is a log output of the test:
[2016-08-10 18:21:16,054] {client.py:156} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:21:16,055] {client.py:427} INFO - Uploading '/tmp/tmpPQUs8k' to '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,059] {client.py:891} INFO - Listing '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,070] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,076] {client.py:375} INFO - Writing to '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,079] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,111] {base_hook.py:53} INFO - Using connection to: cdh
[2016-08-10 18:21:16,112] {client.py:156} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:21:16,112] {client.py:650} INFO - Downloading '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv' to '/tmp/tmp7imrk9out'.
[2016-08-10 18:21:16,114] {client.py:920} INFO - Walking '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv' (depth 0).
[2016-08-10 18:21:16,119] {client.py:276} INFO - Fetching status for '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,121] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,124] {client.py:591} INFO - Reading file '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,128] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:21:16,137] {base_hook.py:53} INFO - Using connection to: cdh
[2016-08-10 18:21:16,138] {client.py:156} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:21:16,138] {client.py:766} INFO - Deleting '/tmp/hive_temp_djE7CTrDdAQh995ydwfrcK.csv'.
[2016-08-10 18:21:16,139] {connectionpool.py:207} INFO - Starting new HTTP connection (1): cdh
If I now upgrade the requests
package to a later version:
pip install hdfs==2.0.9 requests==2.11.0
I get the following;
[2016-08-10 18:24:45,246] {base_hook.py:53} INFO - Using connection to: cdh
[2016-08-10 18:24:45,247] {client.py:157} INFO - Instantiated <InsecureClient(url='http://cdh:50070')>.
[2016-08-10 18:24:45,248] {client.py:645} INFO - Downloading '/data/barclay/raw_excel/2016-07-10.xlsx' to 'local3.xlsx'.
[2016-08-10 18:24:45,249] {client.py:914} INFO - Walking '/data/barclay/raw_excel/2016-07-10.xlsx' (depth 0).
[2016-08-10 18:24:45,249] {client.py:279} INFO - Fetching status for '/data/barclay/raw_excel/2016-07-10.xlsx'.
[2016-08-10 18:24:45,256] {connectionpool.py:214} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:24:45,262] {client.py:586} INFO - Reading file '/data/barclay/raw_excel/2016-07-10.xlsx'.
[2016-08-10 18:24:45,268] {connectionpool.py:214} INFO - Starting new HTTP connection (1): cdh
[2016-08-10 18:25:44,239] {client.py:721} ERROR - Error while downloading. Attempting cleanup.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/hdfs/client.py", line 717, in download
_download(fpath_tuple)
File "/usr/local/lib/python2.7/site-packages/hdfs/client.py", line 660, in _download
for chunk in reader:
File "/usr/local/lib/python2.7/site-packages/requests/utils.py", line 372, in stream_decode_response_unicode
raise UnicodeError("Unable to decode contents with encoding %s." % encoding)
UnicodeError: Unable to decode contents with encoding None.
---------------------------------------------------------------------------
UnicodeError Traceback (most recent call last)
<ipython-input-3-e7e086a08bcd> in <module>()
----> 1 hook.hdfs_download('/data/barclay/raw_excel/2016-07-10.xlsx', 'local3.xlsx')
/opt/altx_data_pipeline/altx_data_pipeline/utils/air/hooks/hive_hooks.pyc in hdfs_download(self, hdfs_path, local_path, overwrite)
255 def hdfs_download(self, hdfs_path, local_path, overwrite=False):
256 hdfs_client = self._get_hdfs_client()
--> 257 hdfs_client.download(hdfs_path, local_path, overwrite=overwrite)
258
259 def table_from_dataframe(self, df, table_name, engine, as_statement=False):
/opt/altx_data_pipeline/altx_data_pipeline/utils/clients/hdfs_client.pyc in download(self, hdfs_path, local_path, overwrite)
34 def download(self, hdfs_path, local_path, overwrite=False):
35 c = self.get_conn()
---> 36 c.download(hdfs_path, local_path, overwrite=overwrite)
37 logger.debug("[HDFS] Downloaded {} to {}".format(hdfs_path,
38 local_path))
/usr/local/lib/python2.7/site-packages/hdfs/client.pyc in download(self, hdfs_path, local_path, overwrite, n_threads, temp_dir, **kwargs)
728 _logger.error('Unable to cleanup temporary folder.')
729 finally:
--> 730 raise err
731 else:
732 if temp_path != local_path:
UnicodeError: Unable to decode contents with encoding None.
This seems to be a related to an issue in requests version 2.11.0
kennethreitz/requests#3481
Client.resolve
uses self.root
without checking if it is empty which leads to exception.
The documentation is not very specific on how to use or configure hdfs to use Kerberos authentication.
Please advise.
$ klist
Ticket cache: FILE:/homes/udb_grid/cache/krb_ticket
Default principal: [email protected]
Valid starting Expires Service principal
04/21/15 17:36:10 04/22/15 17:36:10 krbtgt/[email protected]
renew until 04/26/15 16:30:46
04/21/15 20:43:08 04/22/15 17:36:10 HTTP/[email protected]
renew until 04/26/15 16:30:46
$ python --version
Python 2.6.6
$ pip list | grep hdfs
hdfs (0.5.2)
$ cat /homes/udb_grid/.hdfsrc
[hdfs]
default.alias=udb_grid@bassniumred
[udb_grid@bassniumred_alias]
root=/projects/udb
url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070
$ rm -f /tmp/hdfs.log; hdfscli --list; echo; echo; cat /tmp/hdfs.log
Authentication failure. Check your credentials.
2015-04-21 21:59:49,266 | DEBU | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: Resolved path '' to '/projects/udb'.
2015-04-21 21:59:49,267 | INFO | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: Fetching status for /projects/udb.
2015-04-21 21:59:49,267 | DEBU | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: Resolved path '/projects/udb' to '/projects/udb'.
2015-04-21 21:59:49,284 | WARN | hdfs.client > <Client(url=http://bassniumred-nn1.red.ygrid.yahoo.com:50070, root=/projects/udb)> :: [401] /webhdfs/v1/projects/udb?op=GETFILESTATUS:
Problem accessing /webhdfs/v1/projects/udb. Reason:
Authentication required
2015-04-21 21:59:49,285 | ERRO | hdfs.util > Authentication failure. Check your credentials.
$
Since 2.0.14 we get this kind of error:
Is there something I'm missing ? (I don't pass any kind of kwargs on the code).
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/airflow/models.py", line 1250, in run
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.5/dist-packages/airflow/operators/python_operator.py", line 121, in execute
condition = super(ShortCircuitOperator, self).execute(context)
File "/usr/local/lib/python3.5/dist-packages/airflow/operators/python_operator.py", line 66, in execute
return_value = self.python_callable(*self.op_args, **self.op_kwargs)
File "/etc/airflow/dags/flux_chanel.zip/flux_chanel_utils/logic.py", line 77, in are_files_here
return hook.check_for_path(hdfs_path)
File "/usr/local/lib/python3.5/dist-packages/airflow/hooks/webhdfs_hook.py", line 57, in check_for_path
c = self.get_conn()
File "/usr/local/lib/python3.5/dist-packages/airflow/hooks/webhdfs_hook.py", line 39, in get_conn
client = KerberosClient(connection_str)
File "/usr/local/lib/python3.5/dist-packages/hdfs/ext/kerberos.py", line 121, in __init__
session.auth = _HdfsHTTPKerberosAuth(int(max_concurrency), **kwargs)
File "/usr/local/lib/python3.5/dist-packages/hdfs/ext/kerberos.py", line 68, in __init__
super(_HdfsHTTPKerberosAuth, self).__init__(**kwargs)
TypeError: __init__() got an unexpected keyword argument 'mutual_auth'
Hi htth.
Just a quick question on something i realized when using hdfscli on a script to stat my prod hdfs.
When using stat on root in order to get the bytes, like client.content('/')['length']
i get a value, say X.
Shortly after that, i walk on a subdirectory, stating each one of them and also storing the lenght, like:
fpaths = [
psp.join(dpath, ddir)
for dpath, ddirs, fnames in client.walk('/', 3)
for ddir in ddirs
]
for item in fpaths:
details = client.content(item)
totalBytes = details['length']
For some directories on this walk (Y), i get some lengths that are up to 300% more than what i get for root (X). Is this expected? What should i use in order to compare the length of a particular subfolder in relation to the complete hdfs length?
TY
Edit:
Nevermind, i just found out that this particular issue is with my hdfs. Apparently, it stops iterating over root when getting to a particular folder. So the same happens when doing:
hdfs dfs -du -h /
-> gets only first folder size
and:
hdfs dfs -du -h -s /
->gets the complete list of first-folder directories.
Workaround is to do the walk by myself. Thank you.
Empty avro part-files seem to yield an error when reading. Allow for skipping these rather than error out.
it is not clear how timeout should be specified. in HTTPRequest, one can specify request_timeout.
HTTPRequest(url_with_host, method=method, body=body, request_timeout=HDFS_TIMEOUT)
we are trying to upload a very large file. it is also not clear if setting parallelism will help. right now, we are getting timeout.
Hi mtth.
In our company, we use our own tarballs as the source for installing modules on the servers. We use setup.py install
The issue i have installing HDFSCLI this way, is that i require the Kerberos extension, wich is not installed by default.
I've googled around, and that seems to be a missing feature on setup.py. So wanted to ask if you can think of a workaround for the Kerberos extension to be installed as the default. (i thought moving the kerberos.py file out one step would do the trick, but the tarball gets messed).
Any ideas?
Hi @mtth . I happen to be using hdfscli with the default ~/.hdfscli.cfg approach. However, after a recent server change, i had to re-create the file. So i was thinking to avoid the file (and even the use of a Environment Variable) by directly passing the parameters to the Client programmatically.
I'm actually just using client = Config().get_client()
on the code. And for the config file:
[global]
default.alias = prod
autoload.modules = hdfs.ext.kerberos[prod.alias]
url = http://cluster.domain:14000/
client = KerberosClient
Any idea how to avoid using the config file with this setup? I've taken a look at:
http://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client
But cant seem to quite get how i'm supposed to use Kerberos client with this...
Evaluating this package. Looks good, but I would like to be able to set timeouts when making requests to cope better with any namenode/datanode hickups.
Can't figure out how to do that easily. Is there a way?
I'm reading in a 2.67 MB json file from a hdfs filesystem. It has 2809922 characters in 254 lines.
from hdfs import InsecureClient
client = InsecureClient(hadoopDirectory, user=me)
Following the quickstart guide and using timeit
I measured this bit of code:
with client.read(filename) as reader:
output = reader.read()
The read()
method takes 59.2 ms
.
However just substituting read()
with readlines()
increases the time to 1 min 3 s
which is a more than thousand-fold increase for the same file.
Reading only the first line with readline()
is also slow, it already takes 991 ms
, though the lines are unevenly long.
Reading the file locally with standard python like this
with open(filename) as reader:
output = reader.readlines()
takes 8.46 ms
.
I have found a very interesting issue with AvroWriter, which breaks the whole writing process.
Basically the files size start growing exponentially.
When I export 100 records, the file size is 10Kb, when I export 200 records (of the same), the file size is 670Kb!!!
I use the standard AvroWriter example (except added a long string value to make the data bigger).
How to reproduce:
The below code exports 2 records. just take those 2 records and copy-paste them until you don't have 200 records of the same and then run the export. Observe that the file size became 670Kb.
Remove half of the records, leave only 100 records. Export again, observe that the file size is 10Kb!
"""Avro extension example."""
from hdfs import InsecureClient
from hdfs.ext.avro import AvroReader, AvroWriter
import os
os.environ['http_proxy'] = ''
os.environ['user'] = 'my_user'
# Get the default alias' client.
client = InsecureClient('http://xxx:50070', user='my_user')
# Some sample data.
records = [
{'name': 'Ann', 'age': 23, 'string': 'asdasd asklhj doiuiroe t43785 iofdjkvnuhyt789u judffkguirety9 jdfvhjkfcvhuierytuiy348957hjdfh jkdfgirhytiery5689347589'},
{'name': 'Bob', 'age': 22, 'string': 'nkdfn kjfhy tuier vgfhe89734 89hIUHIUJKBNUIh iuhiuh48 9 5u7489jkldjvlcdjvoiejuioweufoksdfsdf'},
]
# Write an Avro File to HDFS (since our records' schema is very simple, we let
# the writer infer it automatically, otherwise we would pass it as argument).
# codec can be null, deflate or snappy
with AvroWriter(client, '/user/my_user/names.avro', overwrite=True) as writer:
for record in records:
writer.write(record)
If I try to access a file with a percent in the filename, I need to manually url encode the % sign, otherwise I get a 500 response from the webhdfs API as %'s are reserved characters. RFC 3986 Section 2.1 talks about percent encoding url characters.
-=-=-=-=-=-=-
File name with %
Note: the blank error string showing the exception is not being caught.
-=-=-=-=-=-=-
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Uploading /var/tmp/bigfile to /home/opsmekanix/bigfil%e.
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Writing to /home/opsmekanix/bigfil%e.
DEBUG:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Resolved path /home/opsmekanix/bigfil%e to /home/opsmekanix/bigfil%e.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mynamnodeserver01.example.com
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%e?overwrite=False&op=CREATE HTTP/1.1" 401 0
DEBUG:requests_kerberos.kerberos_:handle_401(): Handling: 401
DEBUG:requests_kerberos.kerberos_:authenticate_user(): Authorization header: Negotiate YIICXwYJKoZIhvc ... ommited header here ... SQZ8qie6jMeIlxS7+ZaW186FDpdDrDYLa9BigK9qhT3qRXx7R0=
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%e?overwrite=False&op=CREATE HTTP/1.1" 500 0
DEBUG:requests_kerberos.kerberos_:authenticate_user(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_401(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_other(): Handling: 500
DEBUG:requests_kerberos.kerberos_:handle_other(): Authenticating the server
DEBUG:requests_kerberos.kerberos_:authenticate_server(): Authenticate header: YGYGCSqGSIb3EgECAgIAb1cwVaADAgEFoQMCAQ+iSTBHoAMCAReiQAQ+G5 ... ommitted header here ... NH6V8mO0OGfmbAxIYfpyIXyaeG8Xikdo8SQbry27Fiui0=
DEBUG:requests_kerberos.kerberos_:authenticate_server(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_other(): returning <Response [500]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [500]>
WARNING:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: [500] /webhdfs/v1/home/opsmekanix/bigfil%e?overwrite=False&op=CREATE:
ERROR:
-=-=-=-=-=-=-
File name where percent is represnted as %26
Note: I controll-c'd out of this request as I don't actually care about uploading the file
-=-=-=-=-=-=-
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Uploading /var/tmp/bigfile to /home/opsmekanix/bigfil%26e.
INFO:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Writing to /home/opsmekanix/bigfil%26e.
DEBUG:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: Resolved path /home/opsmekanix/bigfil%26e to /home/opsmekanix/bigfil%26e.
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mynamnodeserver01.example.com
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%26e?overwrite=False&op=CREATE HTTP/1.1" 401 0
DEBUG:requests_kerberos.kerberos_:handle_401(): Handling: 401
DEBUG:requests_kerberos.kerberos_:authenticate_user(): Authorization header: Negotiate YIICXwYJKoZIhvc ... ommited header here ... SQZ8qie6jMeIlxS7+ZaW186FDpdDrDYLa9BigK9qhT3qRXx7R0=
DEBUG:requests.packages.urllib3.connectionpool:"PUT /webhdfs/v1/home/opsmekanix/bigfil%26e?overwrite=False&op=CREATE HTTP/1.1" 307 0
DEBUG:requests_kerberos.kerberos_:authenticate_user(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_401(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_other(): Handling: 307
DEBUG:requests_kerberos.kerberos_:handle_other(): Authenticating the server
DEBUG:requests_kerberos.kerberos_:authenticate_server(): Authenticate header: YGYGCSqGSIb3EgECAgIAb1cwVaADAgEFoQMCAQ+iSTBHoAMCAReiQAQ+G5 ... ommitted header here ... NH6V8mO0OGfmbAxIYfpyIXyaeG8Xikdo8SQbry27Fiui0=
DEBUG:requests_kerberos.kerberos_:authenticate_server(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_other(): returning <Response [307]>
DEBUG:requests_kerberos.kerberos_:handle_response(): returning <Response [307]>
DEBUG:hdfs.client:<KerberosClient(url=http://mynamnodeserver01.example.com:50070, root=/home/opsmekanix)> :: [307] /webhdfs/v1/home/opsmekanixbigfil%26e?overwrite=False&op=CREATE
INFO:requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): mydatanode123.example.com
^CTraceback (most recent call last):
I'm trying to get the root HDFS directory listing. Se when i do CLIENT.list('/') , i'm getting the following error traceback:
In [1]: CLIENT.list('/')
ERROR generate_request_header(): authGSSClientStep() failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
ERROR (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
ERROR generate_request_header(): authGSSClientStep() failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
ERROR (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/requests_kerberos/kerberos_.py", line 144, in generate_request_header
negotiate_resp_value)
kerberos.GSSError: (('Unspecified GSS failure. Minor code may provide more information', 851968), ('Ticket expired', -1765328352))
---------------------------------------------------------------------------
HdfsError Traceback (most recent call last)
<ipython-input-1-b69df44579dc> in <module>()
----> 1 CLIENT.list('/')
/usr/local/lib/python3.5/site-packages/hdfs/client.py in list(self, hdfs_path, status)
891 _logger.info('Listing %r.', hdfs_path)
892 hdfs_path = self.resolve(hdfs_path)
--> 893 statuses = self._list_status(hdfs_path).json()['FileStatuses']['FileStatus']
894 if len(statuses) == 1 and (
895 not statuses[0]['pathSuffix'] or self.status(hdfs_path)['type'] == 'FILE'
/usr/local/lib/python3.5/site-packages/hdfs/client.py in api_handler(client, hdfs_path, data, strict, **params)
90 params=params,
91 strict=strict,
---> 92 **self.kwargs
93 )
94
/usr/local/lib/python3.5/site-packages/hdfs/client.py in _request(self, method, url, strict, **kwargs)
179 )
180 if strict and not response: # Non 2XX status code.
--> 181 return _on_error(response)
182 else:
183 return response
/usr/local/lib/python3.5/site-packages/hdfs/client.py in _on_error(response)
35 """
36 if response.status_code == 401:
---> 37 raise HdfsError('Authentication failure. Check your credentials.')
38 try:
39 # Cf. http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#Error+Responses
HdfsError: Authentication failure. Check your credentials.
Right now i'm only testing this on dev, so what i have in ~/.hdfscli.cfg is:
[global]
default.alias = dev
autoload.modules = hdfs.ext.kerberos
[dev.alias]
url = http://devxx.atl.enterprise.com:50070/
client = KerberosClient
[prod.alias]
url = http://prod.namenode:port
root = /jobs/
What i'm doing wrong??. I Suspect it may have something to do with the user. (When i shell into the server, the username changes from mine (latencio.site) to hadoop.
Also, i have access to Cloudera Manager to check the Kerberos Configuration.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.