bradruderman / pyhs2 Goto Github PK

View Code? Open in Web Editor NEW

206.0 206.0 108.0 166 KB

License: MIT License

Python 100.00%

pyhs2's People

Contributors

Stargazers

Watchers

Forkers

andrewmains12 edward32tnt rzykov hellobob youngwookim charith-qubit shelocks steven-shi wyukawa gene-peters selenamarie przemekpastuszka mariusmilea zhaobo234 singhasdev renozhang mattinbits movwei prateek taposh neon-lab kirn-opower myyc ari-vedant-jain scout24 kunnet mdmining shashankchaudhary d13sl0w pilipolio kevinftd ranjana-altiscale thrasibule gumption iralls haiyang1987 lupan2015 oflebbe aaron-mar mistercrunch fdion nervenxc ryanroser moutai norrissw sobrals yuchsiao nickhakobian drudim mayurmohite-vizury balazsbotond markiv08 rajeshgopinathan nicku33 rock999 xyhlinx rasterburn lamborryan jinbaowu ronanstokes pombredanne bluesky8640 yoziru luchofarje jhall117 widrow yangspeaking rorovic flarywu hooocp mkmoisen zky001 datasciencee melody-xiaomi crazyaxe young8 nandangk joylinger sebinjohn xiyoulaoyuanjia attomos 15801024150 a134167 erisgalaxi xd9999 rameshmarimuthu nikolayvoronchikhin xy90 javinjunfeng kitter kangxz qinchaofeng wangdabin kioco zzhcs xiaobabylu nandankulkarni laziszaire huyang19881115 apoorvaa31

pyhs2's Issues

has no result, and keep excuting

I has install pyhs2, and using the correct parameters, but it never give a result. It always keep the state of excuting.

Code should not have build artifacts checked in

Currently, there are a bunch of .pyc files checked in, as well as the build directory (created by distutils). These don't really add any value (they will always be generated automatically, and you can't really compare them between commits), and make it more difficult to develop, since you end up with a bunch of changes that you don't care about. As a developer, it's much easier to have these built automatically and ignored by version control.

execute a hive script.

Enable running a hive script through the cursor.

cur.execute("hivescript.sql")

Have the Result Set Returned as a Generator

now that we can bring over a large result set, we should have the option of getting the result set as a generator: blocks of rows from the server on demand instead of the whole set. Bringing back on demand allows for quick initial processing as well as handling of unlimited sizes, as memory limits will no longer be an issue.

pyhs2.connect() call requires does not accommodate an account with no password

Making pyhs2.connect() call w/ no password creates an exception

If your hive user account has no password, supplying none is not supported. (See example below) Passing in password='' will result in package asking for password, also problematic.

python> python 
python 
Python 2.7.5 (default, Aug 25 2013, 00:04:04) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyhs2
import pyhs2
>>> conn = pyhs2.connect(host='184.169.209.24', port=10000, authMechanism="PLAIN",user='hive', database='default')
conn = pyhs2.connect(host='184.169.209.24', port=10000, authMechanism="PLAIN",user='hive', database='default')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "build/bdist.macosx-10.9-intel/egg/pyhs2/__init__.py", line 7, in connect
  File "build/bdist.macosx-10.9-intel/egg/pyhs2/connections.py", line 28, in __init__
  File "build/bdist.macosx-10.9-intel/egg/sasl/saslwrapper.py", line 91, in setAttr
 ValueError: invalid null reference in method 'Client_setAttr', argument 3 of type 'std::string const &'

parameter replacement in Cursor.execute

It does not appear there is any. With other db connectors I can do something like:

cur.execute("select * from foo where v=?",('safer'))

But pyhs2's Cursor.execute cannot use this safer way of generating queries.
Is this coming or is there a workaround?

Connection Issue: thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

I followed the example.py but I am stil having issues connecting to Hive.

example.py seems straight forward enough so I'm not sure where I went wrong.

>>> import pyhs2
>>> pyhs2.connect(host=host_server, port=8088, authMechanism="PLAIN", user=username, password=password, database=database)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/pyhs2/__init__.py", line 7, in connect
    return Connection(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/pyhs2/connections.py", line 46, in __init__
    transport.open()
  File "/Library/Python/2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 74, in open
    status, payload = self._recv_sasl_message()
  File "/Library/Python/2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 95, in _recv_sasl_message
    payload = self._trans.readAll(length)
  File "/Library/Python/2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
    chunk = self.read(sz - have)
  File "/Library/Python/2.7/site-packages/thrift/transport/TSocket.py", line 120, in read
    message='TSocket read 0 bytes')
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

fetch() operator does not work with moderately-sized datasets

I've tested the driver with some queries, and it works for small-sized datasets. However, it hangs with larger datasets. For example, I had a query that succeeds with returning 50,000 records but hangs when 100,000 are requested.

Unable to successfully run pyhs2 code beyond a certain set of rows and columns

Hi,
Wish you a very Happy and Prosperous New Year !

     I am trying to execute a query in Hive using pyhs2 library. My objective is to capture the results of the hive query into a Pandas dataframe. I am experiencing an error in my pyhs2 code beyond a limit (a specific set of rows and columns).

      In my Hive query, I am fetching 145 columns and around 4.5 million rows. My code is failing in the following circumstances:

a) When I fetch more than 52 rows and all the 145 columns, the code fails

b) If I reduced the column to 4 then , I can fetch upto a million rows but beyond that it's failing again.

        Will you please help me to resolve the error with a possible solution. Thanks in advance !

Cheers

Is this able to obtain hadoop JOB ID ?

Is it able to obtain JOB ID ? which can be loged in the log files for issue trace

test example.py, fetch not return anything

Dear all,
i try connect to hs2 hortonworks sandbox using using hue user, and try to select table sample_08. i have set grand access user hue to that table.
when i execute the script it only return :
[['default', '']] <-- database
[{'comment': None, 'columnName': 'sample_08.code', 'type': 'STRING_TYPE'}, {'comment': None, 'columnName': 'sample_08.description', 'type': 'STRING_TYPE'}, {'comment': None, 'columnName': 'sample_08.total_emp', 'type': 'INT_TYPE'}, {'comment': None, 'columnName': 'sample_08.salary', 'type': 'INT_TYPE'}] <-- table schema.

but fetch table result not return anything. is there any step that i miss?

thank you

jimmy

Could not start SASL: Error in sasl_client_start (-4) SASL(-4)

I have CentOs 6.6 running Python 2.6. I installed python 2.7 in a separate localtion /usr/local/bin for being able to use the pyh2 hive client. I installed all the additional packages needed. Eg. all the SASL packages are available:

[tap@localhost test]$ rpm -qa | grep sasl
cyrus-sasl-gssapi-2.1.23-15.el6_6.1.x86_64
cyrus-sasl-lib-2.1.23-15.el6_6.1.x86_64
cyrus-sasl-md5-2.1.23-15.el6_6.1.x86_64
cyrus-sasl-2.1.23-15.el6_6.1.x86_64
cyrus-sasl-plain-2.1.23-15.el6_6.1.x86_64
cyrus-sasl-devel-2.1.23-15.el6_6.1.x86_64

But I still see the following error:
/usr/local/lib/python2.7/site-packages/setuptools-12.3-py2.7.egg/pkg_resources/init.py:1224: UserWarning: /home/tap/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
Traceback (most recent call last):
File "testPktFlow2.py", line 7, in
database='default') as conn:
File "build/bdist.linux-x86_64/egg/pyhs2/init.py", line 7, in connect
File "build/bdist.linux-x86_64/egg/pyhs2/connections.py", line 45, in init
File "build/bdist.linux-x86_64/egg/pyhs2/cloudera/thrift_sasl.py", line 66, in open
thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found

Unicode support

I have some Unicode data (encoded as UTF-8) in HDFS: fŏŏbārbaß (or in Python, u'f\u014f\u014fb\u0101rba\xdf')

When reading (querying) the table that contains this data, the cursor's fetch returns data of type str and not unicode, as I would expect. The Unicode characters have become question marks (?, ordinal value 3f, so it isn't a problem with representation). Querying using Beeswax via Hue returns the expected result.

Is there something I need to do to get the data in the desired format, or isn't Unicode supported?

pyhs2 version: 0.4.1
Python: 2.7.6
Platforms: OS X and Ubuntu

SSL support?

Does pyhs2 support SSL? If so, is there an example of pyhs2 usage with it?

Have a use case where our connection to Hive has to be encrypted and wondering if pyhs2 is something we can look into. Thanks!

Feature request: install sasl as an optional extra

The sasl module is completely unsupported on Windows; I've tried finding a workaround for some time and no joy (I'm running a Mac laptop and Linux servers, but some of my users who would like to run Hive queries through python are on Windows).

The sasl module is only needed for some authentication methods. pyhs2 setup.py could provide a sasl extra, which users only install on request if they know they need SASL-backed auth. For instance, in setup.py:

setup(...
      install_requires=['thrift'],
      extras_require={'sasl': ['sasl']},
      ...)

Users can choose to install pyhs2 with sasl or not by executing:

pip install pyhs2
pip install pyhs2[sasl]

connections.py could be modified to:

try:
    import sasl
    sasl_available = True
except ImportError:
   sasl_available = False
# ... in __init__ else clause for sasl:
    else:
        if not sasl_available:
            raise ImportError('sasl module required for this auth mechanism, install with pip install pyhs2[sasl]')
        sasl_mech = 'PLAIN'
        # ...

If you're interested in pursuing this, I'd be more than happy to submit a PR.

tag/build new release to PyPI including Nov 27 Cursor.fetchmany() fix

The last fix to master on Nov 27th, fixing Cursor.fetchmany(), should be published to the Python Package Index. Thanks.

Getting select privileges error via pyhs2 but not getting it while accessing directly from hive

Hi,
I am getting a peculiar error. While trying to access a table via pyhs2, I am getting the following error: pyhs2.error.Pyhs2Exception: 'Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [user] does not have [SELECT] privilege on [db/tablename]'

      However, when I am trying to access the same table directly through hive via executing the query "select count(*) from db.tablename", I amn't getting any error. The username is the same in both the attempts. Please can you let me know why amn't I able to access the table using pyhs2. Is there any specific permission / privilege which pyhs2 is seeking for ?

Cheers

description field is missing from Cursor object

Hello,

I'm in need of the field cursor.description. I would do it myself but I don't know how to.

This is a field that exists in many (most?) database modules and is defined in PEP249.
[https://www.python.org/dev/peps/pep-0249/#cursor-attributes]

I have made a ""work around"" at my company but its terrible.

The idea was that since I know most queries will be in the form of:
"select blah,blahblah,etc from my_awesome_tbl where 1=1"
I can do the below to fill the first 2 fields:

Invoke:

description = ["name", "type", "display_size", "internal_size", "precision", "scale", "null_ok"]
description[0], description[1] = hive_description(conn, sql_query)

Terrible implementation:

def hive_description(conn, sql):
    """
    This is my terrible fix.

    Only the first 2 fields of the description field are populated
    Modules such as cx_Oracle populate all 7.

    :param conn: ConnectionInfo object
    :param sql: str object
    :return: 2 lists: column name list and column type list
    """
    assert(sql.strip().lower().startswith("select"))

    # yes, this could have been done with regex, but i was lazy
    table = sql[sql.index("from ")+len("from "):]  # remove before tbl name
    table = table[:table.index(" ")]  # remove after table name

    cursor = conn.cursor()
    cursor.execute("describe " + table)
    res = cursor.fetchall()   # its tiny so get the whole thing
    col_names = [i[0] for i in res]
    col_types = [i[1] for i in res]

    return col_names, col_types

Another fun fact:

cur = conn.cursor()
cur.execute('select * from test_me_tbl')
print cur.fetchall()
[['2014-09-18', 'goodbye', 'world'], ['2014-09-17', 'hello', 'world']]
cur.getSchema()
Traceback (most recent call last):
File "", line 1, in
File ".../pyhs2/cursor.py", line 199, in getSchema
col['type'] = get_type(c.typeDesc)
File ".../pyhs2/cursor.py", line 12, in get_type
return TTypeId._VALUES_TO_NAMES[ttype.primitiveEntry.type]
KeyError: 18
cur = conn.cursor()
print cur.getDatabases()
[['default', '']]
print cur.getSchema()
[{'comment': 'Schema name.', 'columnName': 'TABLE_SCHEM', 'type': 'STRING_TYPE'}, {'comment': 'Catalog name.', 'columnName': 'TABLE_CATALOG', 'type': 'STRING_TYPE'}]

How to connect two hiveserver？

Now I hive two hiveserver to connect hive, and How to connect two hiveserver ？

Looking for new owner

Hi All-
My company recently migrated off hive, so I don't think I have the time to maintain this library anymore. Is there anyone who wants to take ownership? Let me know and I can grant access. There is a lot to be done:

Review Pull Requests
Update to PEP standards
Python 3 compatibility
Include dependencies
Test for memory leaks + processing large sets.

Thanks!
Brad

ref to MAX_BLOCK_SIZE

#23

StopIteration exception not working correctly

If I do a query that looks like "SELECT * FROM DB" with code that looks like

while True:
    try:
        some stuff
    except StopIteration:
        break;

this will loop forever

However If I do a query that looks like "SELECT * FROM DB LIMIT N" with the same code it throws the StopIteration exception as expected.

Can you fix this to make it work for queries without limits? I don't want to find the count beforehand because that will take way too long (working with a 40M+ db)

Getting Job ID

Hi,

How to get the hadoop job ID just after submitting the job using cur.execute(ajob)? I think there is no method defined for getting job id.

Could you please help?

Thanks.

Could not connect

Ever time I run the client I am getting the following error:

Traceback (most recent call last):
File "/home/suman/Documents/APIConsumer/APIConsumer/getcodes.py", line
19, in
submitjob.newjob(hivequery)
File "/home/suman/Documents/APIConsumer/APIConsumer/submitjob.py", line
5, in newjob
conn = pyhs2.connect(host='ZZ.XX.SS.QQ', port = 10000,
authMechanism = "PLAIN", user='wrahman', password='******_',
database='default')
File "/usr/local/lib/python2.7/dist-packages/pyhs2/init.py", line 7,
in connect
return Connection(_args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyhs2/connections.py", line
36, in init
transport.open()
File
"/usr/local/lib/python2.7/dist-packages/pyhs2/cloudera/thrift_sasl.py",
line 55, in open
self._trans.open()
File
"/usr/local/lib/python2.7/dist-packages/thrift/transport/TSocket.py",
line 99, in open
message=message)
thrift.transport.TTransport.TTransportException: Could not connect to
ZZ.XX.SS.QQ:10000

Please help.

Could not start SASL

Hi @BradRuderman ,

I'm using Python to connect hive in local/embedded mode but ending up in below error.
Script used
`#!/usr/bin/env python
import pyhs2

with pyhs2.connect(host='localhost',port=50070,authMechanism="PLAIN") as conn:
with conn.cursor() as cur:
#Show databases
print cur.getDatabases()`

Output:
vaibhav@vaibhav-Lenovo-G570:/home/hduser$ ./Automation2.py Traceback (most recent call last): File "./Automation2.py", line 4, in <module> with pyhs2.connect(host='localhost',port=50070,authMechanism="PLAIN") as conn: File "/home/vaibhav/.local/lib/python2.7/site-packages/pyhs2/__init__.py", line 7, in connect return Connection(*args, **kwargs) File "/home/vaibhav/.local/lib/python2.7/site-packages/pyhs2/connections.py", line 46, in __init__ transport.open() File "/home/vaibhav/.local/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.py", line 66, in open message=("Could not start SASL: %s" % self.sasl.getError())) thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found

pip install pyhs2 is not installing the latest master branch?

If I issue a cur.fetchmany(i), it fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/apps/Python/lib/python2.7/site-packages/pyhs2/cursor.py", line 152, in fetchmany
    if size < 0 or size > MAX_BLOCK_SIZE:
NameError: global name 'MAX_BLOCK_SIZE' is not defined

I can see in this repo that the code has been changed to if size < 0 or size > self.MAX_BLOCK_SIZE:

However, the version installed by pip doesn't have the self and it is throwing this error.

Is there a mismatch between this repo and pip?

pip search pyhs2
pyhs2                     - Python Hive Server 2 Client Driver
  INSTALLED: 0.6.0 (latest)

Broken pipe pyhs2

Query not correctly finish and shows broken pipe, how could it be fixed, is there any way to time out the queries ? And how to fix the broken pipe ?

Cannot connect to Hive when hive.server2.transport.mode is set to http

Hi, we can connect to any of our clusters when the transport mode is set to binary, but we seem to have problems connecting to one that uses thrift over http. Is that even supported by pyhs2? If so, what kind of changes need to be made to the connection options in a python script?
Thanks!

Issue accessing Hive via ipython using pyhs2 package

I am using windows 7 64 bit machine and Hive / Hadoop is on the Linux box.

I am using Anaconda and tried to pip install your package (pyhs2).

But I am getting below errors -

C:\Users\viral.parikh\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat -DMS_WIN64 -mdll -O -Wall -Isasl -IC:\Users\viral.parikh\AppD
ata\Local\Continuum\Anaconda\include -IC:\Users\viral.parikh\AppData\Local\Continuum\Anaconda\PC -c sasl/saslwrapper.cpp -o build\te
mp.win-amd64-2.7\Release\sasl\saslwrapper.o

sasl/saslwrapper.cpp:21:23: fatal error: sasl/sasl.h: No such file or directory

compilation terminated.

error: command 'C:\Users\viral.parikh\AppData\Local\Continuum\Anaconda\Scripts\gcc.bat' failed with exit status 1

Cleaning up...
Command C:\Users\viral.parikh\AppData\Local\Continuum\Anaconda\python.exe -c "import setuptools, tokenize;file='c:\users\viral
1.par\appdata\local\temp\pip_build_Viral.Parikh\sasl\setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().
replace('\r\n', '\n'), file, 'exec'))" install --record c:\users\viral1.par\appdata\local\temp\pip-hs8b3m-record\install-record
.txt --single-version-externally-managed --compile failed with error code 1 in c:\users\viral~1.par\appdata\local\temp\pip_build_Vir
al.Parikh\sasl
Storing debug log for failure in C:\Users\viral.parikh\pip\pip.log

I look forward to your help! Does the package work with only Cloudera installation? What if I have Hortonworks installation?

Thank you,
Viral

Point to impyla?

If you're willing, you could also point to impyla on the README, which will be maintained by Cloudera.

thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

I'm trying to connect to hive using python. I tried using both pyhive and pyhs2 but it gives the following error.

Traceback (most recent call last):
File "hello.py", line 8, in
database='hello') as conn:
File "/usr/local/lib/python2.7/dist-packages/pyhs2/init.py", line 7, in connect
return Connection(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyhs2/connections.py", line 47, in init
res = self.client.OpenSession(TOpenSessionReq(username=user, password=password, configuration=configuration))
File "/usr/local/lib/python2.7/dist-packages/pyhs2/TCLIService/TCLIService.py", line 154, in OpenSession
return self.recv_OpenSession()
File "/usr/local/lib/python2.7/dist-packages/pyhs2/TCLIService/TCLIService.py", line 165, in recv_OpenSession
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "build/bdist.linux-x86_64/egg/thrift/protocol/TBinaryProtocol.py", line 148, in readMessageBegin
File "build/bdist.linux-x86_64/egg/thrift/transport/TTransport.py", line 60, in readAll
File "build/bdist.linux-x86_64/egg/thrift/transport/TTransport.py", line 161, in read
File "build/bdist.linux-x86_64/egg/thrift/transport/TSocket.py", line 132, in read
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes

Nothing mentioned in forums or internet was able to help me bypass this issue. Is there a workaround for this ?

Could not start SASL ?

[arrange@localhost pyhs2-master]$ python example.py
Traceback (most recent call last):
File "example.py", line 8, in
database='default') as conn:
File "/home/arrange/software/pyhs2-master/pyhs2/init.py", line 7, in connect
return Connection(_args, *_kwargs)
File "/home/arrange/software/pyhs2-master/pyhs2/connections.py", line 45, in init
transport.open()
File "/home/arrange/software/pyhs2-master/pyhs2/cloudera/thrift_sasl.py", line 66, in open
message=("Could not start SASL: %s" % self.sasl.getError()))
thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found

how to use "group by" sql

I execute "select day,count(*) from tb_name group by day" .
raise Pyhs2Exception(res.status.errorCode, res.status.errorMessage)
pyhs2.error.Pyhs2Exception: 'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask'

cur.fetch() can be extremely slow

I haven't quite figured out when this is the case, but some queries can cause cursor.fetch() to take several hours to complete even though just running the query from a hive CLI and piping to a file only take 60 seconds.

Is there anything that I can do to help you diagose this problem?

Set construction in connections.py not compatible with Python 2.6

I am using pyhs2 with latest CentOS (6.4). It comes with Python 2.6.6 installed globally and this is difficult to change, since doing so breaks parts of the OS.

Line #19 of your connections.py uses the curly brace set construction, which fails in Python 2.6. If you change the line to

authMechanisms = set(['NOSASL', 'PLAIN', 'KERBEROS', 'LDAP'])

It will work in 2.6 (I tested it) and should be forward compatible to Python 2.7 and 3.x.

How to use udf ?

like this
hive -i /usr/local/hive/udf/hive_init

Announcing New Project Ownership

All-
I have talked with a few people via email, and originally thought someone was going to take the project over. That person went silent, so @kkennedy314 from BMW has graciously offered to start maintaining this library. I have spoken with @kkennedy314 and made myself completely available for guidance and any issues he wants me to tackle specifically. I have added @kkennedy314 as a collaborator and if this works out will transfer project ownership to him in the near future. Please keep logging issues, I look forward to @kkennedy314 taking pyhs2 to the next level!

Thanks,
Brad

cur.getSchema() when <date> fields exist throws exception (when field is NULL)?

I've been using pyhs2 with a Hortonworks cluster.

I have a simple table with a type "date" field, when I call the cursor::getSchema() function an exception is thrown (KeyError 17). I think this only happens when a date field value is NULL.

Here is the table:
hive -e 'describe extended test_date'
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
OK
col_name data_type comment
mydate date
_c1 bigint

Detailed Table Information Table(tableName:test_date, dbName:default, owner:jprior, createTime:1422053397, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:mydate, type:date, comment:null), FieldSchema(name:c1, type:bigint, comment:null)], location:hdfs://__.com:8020/apps/hive/warehouse/test_date, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=5, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1422053397, numRows=386, totalSize=6910, rawDataSize=6524}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 2.275 seconds, Fetched: 4 row(s)

Here is how to generate the exception:

ipython

In [1]: import pyhs2
In [2]: cnx = pyhs2.connect(eval("{'host':'', 'port':10000, 'authMechanism':'**', 'database':'default', 'user':'', 'password':'', }"))
In [3]: cur = cnx.cursor()
In [4]: query = 'select mydate from test_date'
In [5]: cur.execute(query)
In [6]: rows = cur.fetch()
In [7]: rows[:10]
Out[7]:
[[None],
['2013-12-31'],
['2014-01-05'],
['2014-01-10'],
['2014-01-15'],
['2014-01-20'],
['2014-01-25'],
['2014-01-30'],
['2014-02-04'],
['2014-02-09']]

In [8]: column_names = [a['columnName'] for a in cur.getSchema()]

KeyError Traceback (most recent call last)
in ()
----> 1 column_names = [a['columnName'] for a in cur.getSchema()]

/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.pyc in getSchema(self)
196 for c in self.client.GetResultSetMetadata(req).schema.columns:
197 col = {}
--> 198 col['type'] = get_type(c.typeDesc)
199 col['columnName'] = c.columnName
200 col['comment'] = c.comment

/edge/1/anaconda/lib/python2.7/site-packages/pyhs2/cursor.pyc in get_type(typeDesc)
10 for ttype in typeDesc.types:
11 if ttype.primitiveEntry is not None:
---> 12 return TTypeId._VALUES_TO_NAMES[ttype.primitiveEntry.type]
13 elif ttype.mapEntry is not None:
14 return ttype.mapEntry

KeyError: 17

can't execute sql containing ' or " sign

like the following

sql=r"select 'qq' from test"
cur.execute(sql)

I always get the error:
pyhs2.error.Pyhs2Exception: 'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask'

But I can execute it directly in hive CLI.

Enable compression for transfer (TZlibTransport?)

Hi, I'm absolutely not an expert but I wonder if there a way to active any sort of compression for the data which is then retrieve from hive.

I've read about TZlibTransport, but I couldn't find any way to use this reusing the pyhs2 initial code.

Any suggestions here?

Many Thanks!

connecting to HiveServ from Python 3.6.1

Hello
I am trying to connect to Hive using python 3.6.1
import pyhs2

with pyhs2.connect(host='xx.xxx.xxx.xxx',
port=xxxx,
authMechanism="PLAIN",
user='group',
password='xxxx',
database='hadoop_introduction') as conn:
with conn.cursor() as curr:
#Show databases
#print curr.getDatabases()

	#Execute query
	curr.execute("select * from slocations")

	#Return column info from query
	#print curr.getSchema()

	#Fetch table results
	for i in curr.fetch():
		print (i)

However when I connect I get the below error
Traceback (most recent call last):
File "connect_hadoop.py", line 14, in
database='hadoop_introduction') as conn:
File "\AppData\Roaming\Python\Python36\site-packages\pyhs2-0.6.0-py3.6.egg\pyhs2_init_.py", line 6, in connect
File \AppData\Roaming\Python\Python36\site-packages\pyhs2-0.6.0-py3.6.egg\pyhs2\connections.py", line 7, in
ModuleNotFoundError: No module named 'cloudera'

ImportError: No module named _saslwrapper,

Traceback (most recent call last):
File "atul_api.py", line 14, in
database='rashmi') as conn:
File "/usr/lib64/python2.6/site-packages/pyhs2/init.py", line 6, in connect
from .connections import Connection
File "/usr/lib64/python2.6/site-packages/pyhs2/connections.py", line 6, in
import sasl
File "/usr/lib64/python2.6/site-packages/sasl/init.py", line 1, in
from sasl.saslwrapper import *
File "/usr/lib64/python2.6/site-packages/sasl/saslwrapper.py", line 6, in
import _saslwrapper
ImportError: No module named _saslwrapper

thrift.Thrift.TApplicationException: Internal error processing ExecuteStatement

Things were working fine and then this error started showing up. Here is the full stack trace:
Traceback (most recent call last):
File "/Users/adsouza/software/pycharm/hive.py", line 30, in
data = hive_con(("select * from alan_test"))
File "/Users/adsouza/software/pycharm/hive.py", line 12, in hive_con
database='lc')
File "/Users/adsouza/anaconda/lib/python2.7/site-packages/pyhs2/init.py", line 7, in connect
return Connection(_args, *_kwargs)
File "/Users/adsouza/anaconda/lib/python2.7/site-packages/pyhs2/connections.py", line 42, in init
cur.execute(query)
File "/Users/adsouza/anaconda/lib/python2.7/site-packages/pyhs2/cursor.py", line 50, in execute
res = self.client.ExecuteStatement(query)
File "/Users/adsouza/anaconda/lib/python2.7/site-packages/pyhs2/TCLIService/TCLIService.py", line 244, in ExecuteStatement
return self.recv_ExecuteStatement()
File "/Users/adsouza/anaconda/lib/python2.7/site-packages/pyhs2/TCLIService/TCLIService.py", line 260, in recv_ExecuteStatement
raise x
thrift.Thrift.TApplicationException: Internal error processing ExecuteStatement

And here is my code:
def hive_con(query):
conn = pyhs2.connect(host='foo.bar.com',
port=10000,
authMechanism="PLAIN",
user='user',
password='user',
database='lc')
cur = conn.cursor()
cur.execute(query)
#Return column info from query
if cur.getSchema() is None:
cur.close()
conn.close()
return None
columnNames = [a['columnName'] for a in cur.getSchema()]
print columnNames
columnNamesStrings = [a['columnName'] for a in cur.getSchema() if a['type']=='STRING_TYPE']
output = pd.DataFrame(cur.fetch(),columns=columnNames)

cur.close()
conn.close()
return output

data = hive_con(("select * from alan_test"))
pprint(data)

Could not Connect | Thrift Error

Hi -

I am trying to connect beeline server using your example. I have KERBEROS authentication. Do I need to install Thrift on client side?

ImportError Traceback (most recent call last)
in ()
9 u = "p624626"
10 s = "roy2015"
---> 11 connection = beeline.connect(host=DEFAULT_SERVER, port= DEFAULT_PORT, authMechanism='KERBEROS', user=u + '@' + DEFAULT_DOMAIN, password=s)
12 statement = "select * from reads.CL_COST_CNTR limit 100"
13 cur = connection.cursor()

/Users/taposh/anaconda/lib/python3.4/site-packages/pyhs2/init.py in connect(_args, *_kwargs)
4 more information.
5 """
----> 6 from .connections import Connection
7 return Connection(_args, *_kwargs)

/Users/taposh/anaconda/lib/python3.4/site-packages/pyhs2/connections.py in ()
1 import sys
2
----> 3 from thrift.protocol.TBinaryProtocol import TBinaryProtocol
4 from thrift.transport.TSocket import TSocket
5 from thrift.transport.TTransport import TBufferedTransport

ImportError: No module named 'thrift'

TTransportException: TSocket read 0 bytes

This is the code I am using

NPS Data

with pyhs2.connect(host='localhost',authMechanism = "PLAIN", port = 10000, user='kumara',database='project_krypton') as connection:
with connection.cursor() as cursor:
print "Connection Established..."
print "Reading Data..."
print "-"_50
cursor.execute("SELECT * FROM nps")
schema = cursor.getSchema()
print schema
cols = [s['columnName'] for s in schema]
print "\n\nColumns Importing: ", cols
nps = pd.DataFrame(data=cursor.fetchall(), columns=cols)
print "Data import complete!"
print "-"_50
print "\n\nDimensions of Imported data: ", all_data.shape
print "\nSnapshot of data: \n\n", all_data.head()
all_data.to_csv('all_data_export.csv')

I am getting this error. Can you help please ?

TTransportException Traceback (most recent call last)
in ()
1 # NPS Data
----> 2 with pyhs2.connect(host='localhost',authMechanism = "PLAIN", port = 10000, user='kumara',database='project_krypton') as connection:
3 with connection.cursor() as cursor:
4 print "Connection Established..."
5 print "Reading Data..."

/Users/kumara/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pyhs2/init.pyc in connect(_args, *_kwargs)
5 """
6 from .connections import Connection
----> 7 return Connection(_args, *_kwargs)

/Users/kumara/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pyhs2/connections.pyc in init(self, host, port, authMechanism, user, password, database, configuration, timeout)
44
45 self.client = TCLIService.Client(TBinaryProtocol(transport))
---> 46 transport.open()
47 res = self.client.OpenSession(TOpenSessionReq(username=user, password=password, configuration=configuration))
48 self.session = res.sessionHandle

/Users/kumara/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.pyc in open(self)
72 # SASL negotiation loop
73 while True:
---> 74 status, payload = self._recv_sasl_message()
75 if status not in (self.OK, self.COMPLETE):
76 raise TTransportException(type=TTransportException.NOT_OPEN,

/Users/kumara/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pyhs2/cloudera/thrift_sasl.pyc in _recv_sasl_message(self)
90
91 def _recv_sasl_message(self):
---> 92 header = self._trans.readAll(5)
93 status, length = struct.unpack(">BI", header)
94 if length > 0:

/Users/kumara/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/thrift/transport/TTransport.pyc in readAll(self, sz)
56 have = 0
57 while (have < sz):
---> 58 chunk = self.read(sz - have)
59 have += len(chunk)
60 buff += chunk

/Users/kumara/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/thrift/transport/TSocket.pyc in read(self, sz)
118 if len(buff) == 0:
119 raise TTransportException(type=TTransportException.END_OF_FILE,
--> 120 message='TSocket read 0 bytes')
121 return buff
122

TTransportException: TSocket read 0 bytes

pip3 install fails on Ubuntu 14.10

Error in compile detailed below.
This is on an ubuntu 14.10 system, using python 3.4.2 (along with the python3-setuptools package and pip3). We have standardized on python 3, so this is an issue, but I was able to install with the ubuntu default python 2.7 instance and pip2)

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -Isasl -I/usr/include/python3.4m -c sasl/saslwrapper.cpp -o build/temp.linux-x86_64-3.4/sasl/saslwrapper.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
sasl/saslwrapper.cpp: In member function ‘void saslwrapper::ClientImpl::interact(sasl_interact_t_)’:
sasl/saslwrapper.cpp:341:11: warning: unused variable ‘input’ [-Wunused-variable]
char_ input;
^
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -fPIC -Isasl -I/usr/include/python3.4m -c sasl/saslwrapper_wrap.cxx -o build/temp.linux-x86_64-3.4/sasl/saslwrapper_wrap.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
sasl/saslwrapper_wrap.cxx: In function ‘void SWIG_Python_AddErrorMsg(const char_)’:
sasl/saslwrapper_wrap.cxx:884:58: error: ‘PyString_AsString’ was not declared in this scope
PyErr_Format(type, "%s %s", PyString_AsString(old_str), mesg);
^
sasl/saslwrapper_wrap.cxx: In function ‘PySwigClientData_ PySwigClientData_New(PyObject_)’:
sasl/saslwrapper_wrap.cxx:1275:26: error: ‘PyClass_Check’ was not declared in this scope
if (PyClass_Check(obj)) {
^
sasl/saslwrapper_wrap.cxx: In function ‘PyObject_ PySwigObject_format(const char_, PySwigObject_)’:
sasl/saslwrapper_wrap.cxx:1348:47: error: ‘PyString_FromString’ was not declared in this scope
PyObject ofmt = PyString_FromString(fmt);
^
sasl/saslwrapper_wrap.cxx:1350:33: error: ‘PyString_Format’ was not declared in this scope
res = PyString_Format(ofmt,args);
^
sasl/saslwrapper_wrap.cxx: In function ‘PyObject PySwigObject_repr(PySwigObject_)’:
sasl/saslwrapper_wrap.cxx:1380:105: error: ‘PyString_AsString’ was not declared in this scope
PyObject *repr = PyString_FromFormat("<Swig Object of type '%s' at 0x%s>", name, PyString_AsString(hex));
^
sasl/saslwrapper_wrap.cxx:1380:106: error: ‘PyString_FromFormat’ was not declared in this scope
PyObject *repr = PyString_FromFormat("<Swig Object of type '%s' at 0x%s>", name, PyString_AsString(hex));
^
sasl/saslwrapper_wrap.cxx:1388:37: error: ‘PyString_ConcatAndDel’ was not declared in this scope
PyString_ConcatAndDel(&repr,nrep);
^
sasl/saslwrapper_wrap.cxx: In function ‘int PySwigObject_print(PySwigObject_, FILE_, int)’:
sasl/saslwrapper_wrap.cxx:1402:33: error: ‘PyString_AsString’ was not declared in this scope
fputs(PyString_AsString(repr), fp);
^
sasl/saslwrapper_wrap.cxx: In function ‘PyObject_ PySwigObject_str(PySwigObject_)’:
sasl/saslwrapper_wrap.cxx:1415:31: error: ‘PyString_FromString’ was not declared in this scope
PyString_FromString(result) : 0;
^
sasl/saslwrapper_wrap.cxx: In function ‘PyTypeObject_ _PySwigObject_type()’:
sasl/saslwrapper_wrap.cxx:1624:6: error: ‘coercion’ was not declared in this scope
(coercion)0, /nb_coerce/
^
sasl/saslwrapper_wrap.cxx:1624:15: error: expected ‘}’ before numeric constant
(coercion)0, /nb_coerce/
^
sasl/saslwrapper_wrap.cxx:1624:15: error: expected ‘,’ or ‘;’ before numeric constant
sasl/saslwrapper_wrap.cxx:1604:15: warning: unused variable ‘swigobject_doc’ [-Wunused-variable]
static char swigobject_doc[] = "Swig object carries a C/C++ instance pointer";
^
sasl/saslwrapper_wrap.cxx:1606:26: warning: unused variable ‘PySwigObject_as_number’ [-Wunused-variable]
static PyNumberMethods PySwigObject_as_number = {
^
sasl/saslwrapper_wrap.cxx:1637:3: warning: no return statement in function returning non-void [-Wreturn-type]
};
^
sasl/saslwrapper_wrap.cxx: At global scope:
sasl/saslwrapper_wrap.cxx:1641:3: error: expected unqualified-id before ‘if’
if (!type_init) {
^
In file included from /usr/include/c++/4.9/stdexcept:38:0,
from sasl/saslwrapper_wrap.cxx:2542:
/usr/include/c++/4.9/exception:35:37: error: expected ‘}’ before end of line
#pragma GCC visibility push(default)
^
/usr/include/c++/4.9/exception:35:37: error: expected declaration before end of line
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Command "/usr/bin/python3 -c "import setuptools, tokenize;__file__='/tmp/pip-build-v5gghcou/sasl/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-022s9m5k-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-v5gghcou/sasl

cursor _fetch raises NoneType AttributeError on empty response

I am upgrading from Hive 0.10 in CDH4 to Hive 0.14 in CDH5. In this version, the attribute TFetchResultsResp.results can be None for queries that return an empty response (such as create database or create table statements). Consequently, cursor._fetch raises an error in Line 222 on the call to get the rows in the results attribute, which is None. This is trivially fixed with the following implementation of _fetch in cursor:

    def _fetch(self, rows, fetchReq):
        resultsRes = self.client.FetchResults(fetchReq)
        if resultsRes.results is None or len(resultsRes.results.rows) == 0:
            self.hasMoreRows = False
            return rows
        for row in resultsRes.results.rows:
            rowData= []
            for i, col in enumerate(row.colVals):
                rowData.append(get_value(col))
            rows.append(rowData)
        return rows

However, since this project is no longer maintained and it appears PRs are not reviewed, I will not be submitting a PR with the fix. I am simply filing the issue for reference. I'll (hackily) handle by catching the error and returning an empty set, though this is naturally a very poor solution to the problem.

Print execution logs of a query using pyhs2

Is it possible to print the execution logs of a hive query using pyhs2?
Is there any configuration that needs to be set to enable python client to print hive query execution logs?

Feature request: Cursors and connections should be context managers to guarantee that operations and sessions get closed

Thanks for writing this library! My company has been using our own wrapper around the thrift API for a while now, but it'll be great to get something standard.

The request:

As the code stands right now, one has to manually ensure that close that the cursors and connections get closed, doing something like:

try:
    conn = Connection(...)
finally:
    conn.close()

This works, but isn't particularly pythonic--it would be much nicer to have connections and cursors be context managers, so one could simply write:

with Connection(...) as conn:
    <do things>

and have the session closing handled for you. The same is true for cursors, which also need to perform some cleanup at the end.

EOFError when using Kerberos with a result longer than 782 rows when arraysize value is 1000

For some reason, after I moved my HiveServer2 setup to Kerberos I started receiving the following errors every time my query results cross 783 rows.

The issue seems to arraysize = 1000 in cursor.py

Changing the value to 700 seems to allow the query to complete

Would anyone know why? Should we change the default?

  File "/home/trixpan/TestReporting/Aggregation/HiveQueries/ACMEcorp/reports/emails/TopSources.py", line 48, in TopSources
    __results = cur.fetchall()
  File "/home/trixpan/reporting/lib/python2.6/site-packages/pyhs2/connections.py", line 58, in __exit__
    self.close()
  File "/home/trixpan/reporting/lib/python2.6/site-packages/pyhs2/connections.py", line 78, in close
    self.client.CloseSession(req)
  File "/home/trixpan/reporting/lib/python2.6/site-packages/pyhs2/TCLIService/TCLIService.py", line 184, in CloseSession
    return self.recv_CloseSession()
  File "/home/trixpan/reporting/lib/python2.6/site-packages/pyhs2/TCLIService/TCLIService.py", line 195, in recv_CloseSession
    (fname, mtype, rseqid) = self._iprot.readMessageBegin()
  File "/home/trixpan/reporting/lib/python2.6/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
    sz = self.readI32()
  File "/home/trixpan/reporting/lib/python2.6/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
    buff = self.trans.readAll(4)
  File "/home/trixpan/reporting/lib/python2.6/site-packages/thrift/transport/TTransport.py", line 63, in readAll
    raise EOFError()
EOFError

bradruderman / pyhs2 Goto Github PK

pyhs2's People

Contributors

Stargazers

Watchers

Forkers

pyhs2's Issues

Making pyhs2.connect() call w/ no password creates an exception

In [8]: column_names = [a['columnName'] for a in cur.getSchema()]

NPS Data

Recommend Projects

Recommend Topics

Recommend Org