heavyai / pymapd Goto Github PK

View Code? Open in Web Editor NEW

111.0 21.0 50.0 6.77 MB

Python client for OmniSci GPU-accelerated SQL engine and analytics platform

Home Page: https://pymapd.readthedocs.io/en/latest/

License: Apache License 2.0

Python 99.28% Shell 0.68% Makefile 0.04%

gpu sqlalchemy ibis pydata machine-learning rapids gpu-dataframe python hpc

pymapd's Introduction

pymapd

A wrapper for the pyomnisci libarary http://github.com/omnisci/pyomnisci, maintained for backwards compatibility.

Existing scripts should be migrated to pyomnisci from pymapd, this library will not be updated moving forward.

Quick Install (CPU)

Packages are available on conda-forge and PyPI:

conda install -c conda-forge pymapd

pip install pymapd

Quick Install (GPU)

We recommend creating a fresh conda 3.7 or 3.8 environment when installing pymapd with GPU capabilities.

To install pymapd and cudf for GPU Dataframe support (conda-only):

conda create -n omnisci-gpu -c rapidsai -c nvidia -c conda-forge \
 -c defaults cudf=0.15 python=3.7 cudatoolkit=10.2 pymapd

pymapd's People

Contributors

Stargazers

Watchers

Forkers

tomaugspurger olegjakushkin andrewseidl codeaudit wesm stethd lulingar wamsiv kkraus14 cuulee mflaxman10 xmnlab ian-r-rose tonyfast dennisdawson quansight jbwinters miiklay vishalbelsare jeremyvillalobostaste jp-harvey luochen52 hyunjay mxj316 zeerg mikehinchey myfreebrain dexception esamuel1 stjordanis algoskynet ykay007 bbagherian dbdarek felipeduarteferreira tmostak spatialbits diskoverltd vnlitvinov anmyachev shivabalan1996 kchu-mapd emlazzarin nonzero muskanmahajan37 guozanhua mattdlh isabella232 tristenharr ajunlonglive

pymapd's Issues

select_ipc error

Couple of questions:
Does pymapd require python3x
I am using python 2.7 on a mac pro
community edition MapD

Trying a simple query "SELECT depdelay, arrdelay FROM flights_2008_10k LIMIT 100"
Getting an error while trying to query
Works with con.execute(query)
but fails when trying
df = con.select_ipc(query)

File "/Users/username/anaconda2/lib/python2.7/site-packages/pymapd/connection.py", line 308, in select_ipc
sm_buf = load_buffer(tdf.sm_handle, tdf.sm_size)
File "pymapd/shm.pyx", line 31, in pymapd.shm.load_buffer
File "pymapd/shm.pyx", line 36, in pymapd.shm.load_buffer
ValueError: Invalid shared memory key 719885386

Any help is appreciated.
Thanks

module 'numpy' has no attribute 'py_buffer'

In pymapd 0.5.0, select_ipc_gpu throws an error on select:

In [7]: gdf= conn.select_ipc_gpu("select * from fordgobike_tripdata_v2 limit 1000")                                 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-c5f38609c9a3> in <module>
----> 1 gdf= conn.select_ipc_gpu("select * from fordgobike_tripdata_v2 limit 1000")

~/miniconda3/envs/earlboston2018/lib/python3.6/site-packages/pymapd/connection.py in select_ipc_gpu(self, operation, parameters, device_id, first_n)
    270             self._session, operation, device_id=device_id, first_n=first_n)
    271         self._tdf = tdf
--> 272         return _parse_tdf_gpu(tdf)
    273 
    274     def select_ipc(self, operation, parameters=None, first_n=-1):

~/miniconda3/envs/earlboston2018/lib/python3.6/site-packages/pymapd/_parsers.py in _parse_tdf_gpu(tdf)
    181     schema_buffer = load_buffer(tdf.sm_handle, tdf.sm_size)
    182     # TODO: extra copy.
--> 183     schema_buffer = np.py_buffer(schema_buffer.to_pybytes(), dtype=np.uint8)
    184 
    185     dtype = np.dtype(np.byte)

I suspect that should read pyarrow.py_buffer, not np

Add Client methods

load (#38)
load_binary
get_tables (#37)
get_table_details (#37)

numpy.dtype size changed (pymapd 0.4.0)

On import, from a fresh conda environment, I get the following message:

/home/rzwitch/miniconda3/envs/bursting/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

(bursting) rzwitch@randyzwitch-workstation:~$ conda list
# packages in environment at /home/rzwitch/miniconda3/envs/bursting:
#
# Name                    Version                   Build  Channel
arrow-cpp                 0.7.1                    py36_2    conda-forge
backcall                  0.1.0                      py_0    conda-forge
blas                      1.0                         mkl  
bleach                    2.1.4                      py_1    conda-forge
bzip2                     1.0.6                h470a237_2    conda-forge
ca-certificates           2018.8.24            ha4d7672_0    conda-forge
certifi                   2018.8.24                py36_1    conda-forge
decorator                 4.3.0                      py_0    conda-forge
entrypoints               0.2.3                    py36_2    conda-forge
gmp                       6.1.2                hfc679d8_0    conda-forge
html5lib                  1.0.1                      py_0    conda-forge
intel-openmp              2019.0                      118  
ipykernel                 4.9.0                    py36_0    conda-forge
ipython                   6.5.0                    py36_0    conda-forge
ipython_genutils          0.2.0                      py_1    conda-forge
jedi                      0.12.1                   py36_0    conda-forge
jinja2                    2.10                       py_1    conda-forge
jsonschema                2.6.0                    py36_2    conda-forge
jupyter_client            5.2.3                      py_1    conda-forge
jupyter_core              4.4.0                      py_0    conda-forge
libffi                    3.2.1                hfc679d8_5    conda-forge
libgcc-ng                 7.2.0                hdf63c60_3    conda-forge
libgfortran-ng            7.2.0                hdf63c60_3    conda-forge
libsodium                 1.0.16               h470a237_1    conda-forge
libstdcxx-ng              7.2.0                hdf63c60_3    conda-forge
markupsafe                1.0              py36h470a237_1    conda-forge
mistune                   0.8.3            py36h470a237_2    conda-forge
mkl                       2019.0                      118  
mkl_fft                   1.0.6                    py36_0    conda-forge
mkl_random                1.0.1                    py36_0    conda-forge
nbconvert                 5.3.1                      py_1    conda-forge
nbformat                  4.4.0                      py_1    conda-forge
ncurses                   6.1                  hfc679d8_1    conda-forge
notebook                  5.6.0                    py36_1    conda-forge
numpy                     1.15.0           py36h1b885b7_0  
numpy-base                1.15.0           py36h3dfced4_0  
openssl                   1.0.2p               h470a237_0    conda-forge
pandas                    0.23.4           py36hf8a1672_0    conda-forge
pandoc                    2.2.2                         1    conda-forge
pandocfilters             1.4.2                      py_1    conda-forge
parquet-cpp               1.3.0.post                    2    conda-forge
parso                     0.3.1                      py_0    conda-forge
pexpect                   4.6.0                    py36_0    conda-forge
pickleshare               0.7.4                    py36_0    conda-forge
pip                       18.0                     py36_1    conda-forge
prometheus_client         0.3.1                      py_1    conda-forge
prompt_toolkit            1.0.15                     py_1    conda-forge
ptyprocess                0.6.0                    py36_0    conda-forge
pyarrow                   0.7.1                    py36_1    conda-forge
pygments                  2.2.0                      py_1    conda-forge
pymapd                    0.4.0                    py36_0    conda-forge
python                    3.6.6                h5001a0f_0    conda-forge
python-dateutil           2.7.3                      py_0    conda-forge
pytz                      2018.5                     py_0    conda-forge
pyzmq                     17.1.2           py36hae99301_0    conda-forge
readline                  7.0                  haf1bffa_1    conda-forge
send2trash                1.5.0                      py_0    conda-forge
setuptools                40.4.0                   py36_0    conda-forge
simplegeneric             0.8.1                      py_1    conda-forge
six                       1.11.0                   py36_1    conda-forge
sqlalchemy                1.2.11           py36h470a237_0    conda-forge
sqlite                    3.24.0               h2f33b56_1    conda-forge
terminado                 0.8.1                    py36_1    conda-forge
testpath                  0.3.1                    py36_1    conda-forge
thrift                    0.11.0           py36hfc679d8_1    conda-forge
tk                        8.6.8                ha92aebf_0    conda-forge
tornado                   5.1.1            py36h470a237_0    conda-forge
traitlets                 4.3.2                    py36_0    conda-forge
wcwidth                   0.1.7                      py_1    conda-forge
webencodings              0.5.1                      py_1    conda-forge
wheel                     0.31.1                   py36_1    conda-forge
xz                        5.2.4                h470a237_1    conda-forge
zeromq                    4.2.5                hfc679d8_5    conda-forge
zlib                      1.2.11               h470a237_3    conda-forge

Column order of existing table not respected in columnar insert

Tables can be created in OmniSci with the columns in a user specified order. However when a Pandas dataframe is created it automatically orders the columns alphabetically, with uppercase letters first. PyMapD does no check for this so as a result the columns can be ordered incorrectly on insert and either the wrong data is inserted into the column or an error is returned.

sql_execute_gpudf returns 'Exception: CHAR is not supported in temporary table.') for TEXT columns

I have opened an issue here.
Which I don't know whether it is a better fit to this repo.

To reproduce the error:

mapdql> CREATE TABLE test_table (col1 TEXT ENCODING DICT);
mapdql> INSERT INTO test_table VALUES ('r1');
mapdql> INSERT INTO test_table VALUES ('r2');

from thrift.protocol import TBinaryProtocol
from thrift.protocol import TJSONProtocol
from thrift.transport import TSocket
from thrift.transport import THttpClient
from thrift.transport import TTransport
from pymapd import MapD

def get_client(host_or_uri, port, http):
  if http:
    transport = THttpClient.THttpClient(host_or_uri)
    protocol = TJSONProtocol.TJSONProtocol(transport)
  else:
    socket = TSocket.TSocket(host_or_uri, port)
    transport = TTransport.TBufferedTransport(socket)
    protocol = TBinaryProtocol.TBinaryProtocol(transport)

  client = MapD.Client(protocol)
  transport.open()
  return client

db_name = 'mapd'
user_name = 'mapd'
passwd = 'HyperInteractive'
hostname = 'localhost'
portno = 9091

client = get_client(hostname, portno, False)
session = client.connect(user_name, passwd, db_name)
print('Connection complete')

client.sql_execute_gpudf(session, "select * from test_table",device_id=0, first_n=-1)

API: Handle creating tables with loaders

Borrow the if_exists={'fail', 'replace', 'append'} API from pandas.

Replace pygdf with cudf

~~Goal would be to allow pymapd to use cudf if the user has it installed, fall back to pygdf if they still have that installed, or throw error if neither is installed (current behavior).~~

Edit: Upon speaking with Keith at NVIDIA, there is no reason to maintain pygdf compatibility, the requirements of Pascal and newer cards for cudf should really just be guidance. Additionally, first work towards this appears that pygdf and cudf not swappable, so going with cudf outright to minimize surface area for bugs

~~Upon first glance, this could be blocked by #107, as a simple swap in of cudf throws an IPC error. Given that we know the IPC process is suboptimal, probably need to fix IPC before tackling this.~~

Create conda channel and build nightly

Deallocation: Implicit TDF assignment

Allow user to deallocate resources for each dataframe. Right now User can do it only for the latest fetched df.

Incorrect decoding of null value

I have a column encoded as SMALLINT ENCODING FIXED(8). When grouping by that column, all results return as expected except null which comes across as the value -32768.

It appears to be an issue with how pymapd parses the Thrift responses because when I execute a the same query through MapD's Immerse tool the results contain null as expected.

load_table: TIMESTAMP import fails

Looks like no values are being sent over the server. Below is an example and file to reproduce the issue:

import pymapd
import pandas as pd

connection = pymapd.connect(user='mapd', password='HyperInteractive', host='localhost', dbname='mapd', port=7091, protocol='binary')
print (connection)

table_name = 'test'

command = 'drop table if exists %s' % (table_name)
[input.zip](https://github.com/mapd/pymapd/files/2422205/input.zip)

connection.execute(command)

create_table_str = 'CREATE TABLE IF NOT EXISTS %s (download_time TIMESTAMP(0), download_ip TEXT ENCODING DICT(8), download_path TEXT ENCODING DICT(8))' % (table_name)
connection.execute(create_table_str)

csv_file = 'input.csv'
df = pd.read_csv(csv_file)
df['download_time'] = pd.to_datetime(df['download_time'])
df.reset_index(drop=True, inplace=True)
connection.load_table(table_name, df)

Query string sensitive to leading whitespace

Having a control character/return prior to a select statement fails. mapdql accepts this sort of spacing (a return just moves to a new line, a space before select works):

pymapd fail:

cts = """
create table utahshp6(
    COUNTY_NAM     TEXT ENCODING DICT,
    COUNTY_ID      SMALLINT,
    ASSESSOR_S     TEXT ENCODING DICT,
    BOUNDARY_S     TEXT ENCODING DICT,
    DISCLAIMER     TEXT ENCODING DICT,
    CURRENT_AS     TEXT ENCODING DICT,
    PARCEL_ID      TEXT ENCODING DICT,
    SERIAL_NUM     TEXT,
    PARCEL_ADD     TEXT,
    PARCEL_CIT     TEXT ENCODING DICT,
    TAXEXEMPT_     TEXT ENCODING DICT,
    TAX_DISTRI     TEXT ENCODING DICT,
    TOTAL_MKT_    float,
    LAND_MKT_V    float,
    PARCEL_ACR    float,
    PROP_CLASS     TEXT ENCODING DICT,
    PRIMARY_RE     TEXT ENCODING DICT,
    HOUSE_CNT      TEXT ENCODING DICT,
    SUBDIV_NAM     TEXT ENCODING DICT,
    BLDG_SQFT     float,
    BLDG_SQFT_     TEXT ENCODING DICT,
    FLOORS_CNT    float,
    FLOORS_INF     TEXT ENCODING DICT,
    BUILT_YR       TEXT ENCODING DICT,
    EFFBUILT_Y     TEXT ENCODING DICT,
    CONST_MATE     TEXT ENCODING DICT,
    SHAPE_Leng    float,
    SHAPE_Area    float,
    mapd_geo      GEOMETRY(POLYGON, 4326)
);
"""

pymapd success:

cts = """create table utahshp6(
    COUNTY_NAM     TEXT ENCODING DICT,
    COUNTY_ID      SMALLINT,
    ASSESSOR_S     TEXT ENCODING DICT,
    BOUNDARY_S     TEXT ENCODING DICT,
    DISCLAIMER     TEXT ENCODING DICT,
    CURRENT_AS     TEXT ENCODING DICT,
    PARCEL_ID      TEXT ENCODING DICT,
    SERIAL_NUM     TEXT,
    PARCEL_ADD     TEXT,
    PARCEL_CIT     TEXT ENCODING DICT,
    TAXEXEMPT_     TEXT ENCODING DICT,
    TAX_DISTRI     TEXT ENCODING DICT,
    TOTAL_MKT_    float,
    LAND_MKT_V    float,
    PARCEL_ACR    float,
    PROP_CLASS     TEXT ENCODING DICT,
    PRIMARY_RE     TEXT ENCODING DICT,
    HOUSE_CNT      TEXT ENCODING DICT,
    SUBDIV_NAM     TEXT ENCODING DICT,
    BLDG_SQFT     float,
    BLDG_SQFT_     TEXT ENCODING DICT,
    FLOORS_CNT    float,
    FLOORS_INF     TEXT ENCODING DICT,
    BUILT_YR       TEXT ENCODING DICT,
    EFFBUILT_Y     TEXT ENCODING DICT,
    CONST_MATE     TEXT ENCODING DICT,
    SHAPE_Leng    float,
    SHAPE_Area    float,
    mapd_geo      GEOMETRY(POLYGON, 4326)
);
"""

Statements run using conn.execute, so that the column types could be specifically defined instead of inferred by pymapd.

Installation fails.

Trying to install via pip:

pip install pymapd
Collecting pymapd
  Using cached pymapd-0.3.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-91aap1_o/pymapd/setup.py", line 108, in <module>
        **extra_kwargs
    NameError: name 'extra_kwargs' is not defined
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-91aap1_o/pymapd/

Environment:

pip --version
pip 9.0.1 from <env_dir>/lib/python3.5/site-packages (python 3.5)

pymapd example doesn't work under pip or conda-forge

In trying to do the example here, a community user gets an error via pip install:

In [1]: import pandas as pd  
   ...: import sys  
   ...: from pymapd import connect
   ...: 
   ...: con = connect(user="mapd", password="HyperInteractive", host="localhost", dbn
   ...: ame="mapd") 
   ...: 
   ...: 

In [2]: df = con.select_ipc("""select CAST(nppes_provider_zip5 as INT) as zipcode,
   ...: sum(total_claim_count) as total_claims,
   ...: sum(opioid_claim_count) as opioid_claims from cms_prescriber 
   ...: group by 1 order by opioid_claims desc limit 100""")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-2-b23320627e1e> in <module>()
      2 sum(total_claim_count) as total_claims,
      3 sum(opioid_claim_count) as opioid_claims from cms_prescriber
----> 4 group by 1 order by opioid_claims desc limit 100""")

~/miniconda3/lib/python3.6/site-packages/pymapd/connection.py in select_ipc(self, operation, parameters, first_n)
    296             raise ImportError("pandas is required for `select_ipc`")
    297 
--> 298         from .shm import load_buffer
    299 
    300         if parameters is not None:

ModuleNotFoundError: No module named 'pymapd.shm'

Unfortunately, installing pymapd via conda-forge gives a different error (using separate conda env for both):

(condainstall) mapdadmin@MapDCE:~$ ipython
Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd  
   ...: import sys  
   ...: from pymapd import connect
   ...: 
   ...: con = connect(user="mapd", password="HyperInteractive", host="localhost", dbname="mapd")  
   ...: 
   ...: 

In [2]: import pandas as pd  
   ...: import sys  
   ...: from pymapd import connect
   ...: 
   ...: con = connect(user="mapd", password="HyperInteractive", host="localhost", dbname="mapd")  
   ...: 
   ...: 

In [3]: prescriber_df = pd.read_csv("data/PartD_Prescriber_PUF_NPI_15.txt", sep='\t', low_memory=False)  
   ...: 

In [4]: str_cols = prescriber_df.columns[prescriber_df.dtypes==object]  
   ...: prescriber_df[str_cols] = prescriber_df[str_cols].fillna('NA')  
   ...: prescriber_df.fillna(0,inplace=True)
   ...: 
   ...: 

In [5]: con.execute('drop table if exists cms_prescriber')  
   ...: con.create_table("cms_prescriber",prescriber_df, preserve_index=False)  
   ...: %time con.load_table("cms_prescriber", prescriber_df, preserve_index=False)
   ...: 
   ...: 
---------------------------------------------------------------------------
TMapDException                            Traceback (most recent call last)
<timed eval> in <module>()

~/miniconda3/envs/condainstall/lib/python3.6/site-packages/pymapd/connection.py in load_table(self, table_name, data, method, preserve_index, create)
    418         if method == 'infer':
    419             if (_is_pandas(data) or _is_arrow(data)) and _HAS_ARROW:
--> 420                 return self.load_table_arrow(table_name, data)
    421 
    422             elif _is_pandas(data):

~/miniconda3/envs/condainstall/lib/python3.6/site-packages/pymapd/connection.py in load_table_arrow(self, table_name, data, preserve_index)
    520                                            preserve_index=preserve_index)
    521         self._client.load_table_binary_arrow(self._session, table_name,
--> 522                                              payload.to_pybytes())
    523 
    524 

~/miniconda3/envs/condainstall/lib/python3.6/site-packages/mapd/MapD.py in load_table_binary_arrow(self, session, table_name, arrow_stream)
   1614         """
   1615         self.send_load_table_binary_arrow(session, table_name, arrow_stream)
-> 1616         self.recv_load_table_binary_arrow()
   1617 
   1618     def send_load_table_binary_arrow(self, session, table_name, arrow_stream):

~/miniconda3/envs/condainstall/lib/python3.6/site-packages/mapd/MapD.py in recv_load_table_binary_arrow(self)
   1638         iprot.readMessageEnd()
   1639         if result.e is not None:
-> 1640             raise result.e
   1641         return
   1642 

TMapDException: TMapDException(error_msg='Expected a single Arrow record batch. Import aborted')

Accept parameters in execute methods

Just need to figure out how to properly quote things.

BLD: Make Cython optional

CI: test CPU shared memory

Seems to cause a hang on travis: https://travis-ci.org/mapd/pymapd/jobs/264417902

_build_input_rows doesn't handle array types properly

When loading a table rowwise if using array types, python lists are formatted into strings that MapD can understand for ingestion. MapD expects arrays to be formatted as "{value1,value2,value3,...}".

Reproducible use case:

from pymapd import connect

# Connect to MapD and create table
con = connect(user="mapd", password="HyperInteractive", host="localhost", dbname="mapd")
cur.execute("DROP TABLE IF EXISTS keith_test_lists;")
cur.execute("CREATE TABLE IF NOT EXISTS keith_test_lists (col1 TEXT, col2 INT[]);")

# Create 2 Test Rows: badrow causes error in ingestion for MapD, goodrow works as expected
badrow = [("row1", [1,2,3]), ("row2", [4,5,6])]
goodrow = [("row1", "{10,20,30}"), ("row2", "{40,50,60}")]

# Try inserting the rows into MapD
con.load_table_rowwise("keith_test_lists", badrow)
con.load_table_rowwise("keith_test_lists", goodrow)

Use mapdql to query the data:

mapdql> select * from keith_test_lists ;
col1|col2
row1|{}
row2|{}
row1|{10, 20, 30}
row2|{40, 50, 60}

arrow: add support for Arrow 0.10.0

The unit tests include some pre-generated copies of Arrow batch data+schema buffers, which are version specific.

These need to either be updated to be generated with Arrow 0.10.0, or else we should look into generating on the fly, as @pearu did in rapidsai/cudf#200

There are also some minor changes required py_buffer vs frombuffer.

Note: the current builds of mapd-core in https://hub.docker.com/r/mapd/core-os-cpu/ are still built with Arrow 0.7.1. I'll have to push a build with 0.10.0 for the Travis tests to use. For local testing, mapd-core can be built with either 0.7.1 or 0.10.0.

DATE not supported in temporary table

When reading from an existing MapD table having a DATE column defined, instead of TIMESTAMP, the following error is thrown:

TMapDException                            Traceback (most recent call last)
<ipython-input-7-32eca2c737c5> in <module>()
      1 query = "SELECT * from hourly_loads limit 100"
----> 2 df = con.select_ipc(query)

~/miniconda3/envs/pygdf_dev/lib/python3.5/site-packages/pymapd/connection.py in select_ipc(self, operation, parameters, first_n)
    303         tdf = self._client.sql_execute_df(
    304             self._session, operation, device_type=0, device_id=0,
--> 305             first_n=first_n
    306         )
    307 

~/miniconda3/envs/pygdf_dev/lib/python3.5/site-packages/mapd/MapD.py in sql_execute_df(self, session, query, device_type, device_id, first_n)
   1069         """
   1070         self.send_sql_execute_df(session, query, device_type, device_id, first_n)
-> 1071         return self.recv_sql_execute_df()
   1072 
   1073     def send_sql_execute_df(self, session, query, device_type, device_id, first_n):

~/miniconda3/envs/pygdf_dev/lib/python3.5/site-packages/mapd/MapD.py in recv_sql_execute_df(self)
   1097             return result.success
   1098         if result.e is not None:
-> 1099             raise result.e
   1100         raise TApplicationException(TApplicationException.MISSING_RESULT, "sql_execute_df failed: unknown result")
   1101 

TMapDException: TMapDException(error_msg='Exception: DATE is not supported in temporary table.')

I suspect this is an underlying pygdf issue, but maybe it makes sense to auto-convert the results to the supported TIMESTAMP type, packing the end with 00:00:00?

Make load_table_columnar the default load_table

If load_table_columnar is roughly 10X faster than load_table (row-wise), and there are no other downsides, it seems it would make sense to make load_table default to load_table_columnar. If there are good reasons we can keep around the row-wise loader as load_table_rowwise or something. Thoughts?

pymapd unable to create pygdf if column datatype is fixed encoding

Hello,

I have a database that has many columns which are of fixed encoding datatype. When I try to use

query = 'Select * from Table'`

df = con.select_ipc_gpu(query)`

I receive the following error:

---------------------------------------------------------------------------
TTransportException                       Traceback (most recent call last)
<ipython-input-23-9322698f472e> in <module>()
      1 t0 =time.time()
----> 2 df = con.select_ipc_gpu(query)
      3 tf = time.time()
      4 
      5 print('time taken to create pygdf dataframe: ', tf-t0)

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/pymapd-0.3.0-py3.5-linux-x86_64.egg/pymapd/connection.py in select_ipc_gpu(self, operation, parameters, device_id, first_n)
    264 
    265         tdf = self._client.sql_execute_gdf(
--> 266             self._session, operation, device_id=device_id, first_n=first_n)
    267         return _parse_tdf_gpu(tdf)
    268 

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/pymapd-0.3.0-py3.5-linux-x86_64.egg/mapd/MapD.py in sql_execute_gdf(self, session, query, device_id, first_n)
   1109         """
   1110         self.send_sql_execute_gdf(session, query, device_id, first_n)
-> 1111         return self.recv_sql_execute_gdf()
   1112 
   1113     def send_sql_execute_gdf(self, session, query, device_id, first_n):

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/pymapd-0.3.0-py3.5-linux-x86_64.egg/mapd/MapD.py in recv_sql_execute_gdf(self)
   1124     def recv_sql_execute_gdf(self):
   1125         iprot = self._iprot
-> 1126         (fname, mtype, rseqid) = iprot.readMessageBegin()
   1127         if mtype == TMessageType.EXCEPTION:
   1128             x = TApplicationException()

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/thrift/protocol/TBinaryProtocol.py in readMessageBegin(self)
    132 
    133     def readMessageBegin(self):
--> 134         sz = self.readI32()
    135         if sz < 0:
    136             version = sz & TBinaryProtocol.VERSION_MASK

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/thrift/protocol/TBinaryProtocol.py in readI32(self)
    215 
    216     def readI32(self):
--> 217         buff = self.trans.readAll(4)
    218         val, = unpack('!i', buff)
    219         return val

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/thrift/transport/TTransport.py in readAll(self, sz)
     58         have = 0
     59         while (have < sz):
---> 60             chunk = self.read(sz - have)
     61             have += len(chunk)
     62             buff += chunk

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/thrift/transport/TTransport.py in read(self, sz)
    159         if len(ret) != 0:
    160             return ret
--> 161         self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size)))
    162         return self.__rbuf.read(sz)
    163 

/home/appuser/.conda/envs/pycudf_notebook_py35/lib/python3.5/site-packages/thrift/transport/TSocket.py in read(self, sz)
    130         if len(buff) == 0:
    131             raise TTransportException(type=TTransportException.END_OF_FILE,
--> 132                                       message='TSocket read 0 bytes')
    133         return buff
    134 

TTransportException: TSocket read 0 bytes

However, if I change the query to only columns which are not fixed encoded, this works.

Is there a way for me to use pymapd to make a pygdf dataframe without having to change the datatypes in my table?

Thanks,
Abraham

conda-recipe should not build .egg

Use python setup.py install --single-version-externally-managed at https://github.com/mapd/pymapd/blob/101f21f34f5beb547db5c7928983e45c8d33d7b7/conda-recipes/pymapd/build.sh#L2

Support new Geo types import from WKT

Now that geo types have landed, it would be great to see a pymapd update accommodate them directly. Currently we are limited to importing geometries as independent columns, or as strings. The only way to get true points, for example, is to re-export as CSV and then re-import. This is awkward and inefficient.

Current Behavior:
WKT columns treated as strings
latitude, longitude columns treated as independent numeric columns.

Expected Behavior:
Generally, as in Immerse "Import" and "Copy From" behave.
For adjacent x,y columns, with a table DDL defining Point, have table uploads create Points.
For supported WKT types, have column type set by WKT geometry type (POINT() -> POINT, etc.)

Select into pyarrow.Table with shared CPU memory

similar to https://github.com/gpuopenanalytics/libgdf/blob/master/src/ipc.cu, but on the CPU.

Set minimum NumPy to 1.14

For pyarrow 0.11.0 and Python 3.7, the minimum NumPy version is 1.14. Feels like we should do the same, if only to keep an oddball scenario from happening. This will also silence the warning we get during build about no NumPy version being set during build.

https://github.com/apache/arrow/blob/master/python/requirements.txt#L4

Merge docs and travis environment files

Per #92, we should make a single environment file for the docs and travis to avoid maintaining two locations for an installable environment.

Making this issue as a placeholder, plan on working on this myself.

Remove numpy shim, replace with pyarrow functionality

Newer versions of pyarrow (v0.10+) have improved IPC for both CPU and GPU. Evaluate if we can remove the NumPy buffer shim via Cython for a pyarrow solution, both for maintainability/simplicity as well as possibly fixing #46

Bulk Copy

https://www.mapd.com/docs/latest/mapd-core-guide/loading-data/

We can currently .execute(COPY <table> from <file>), but the <file> is relative to the mapd server.

Look into client.import_table and client.insert_data.

[ARROW] Automatically cast types when nescessary

xref #53 (comment)

Doc build is failing

https://readthedocs.org/projects/pymapd/builds/5713540/

Something to do with the project not being installed when we go to look up the version info in conf.py.

Hangs in 'COPY <table> FROM...'

Hi!

I'm creating a daily job that moves data from another database into MapD, it generates a .csv file that is moved into MapD server and then use pymapd (execute) to launch the "copy

from '<.csv>'" statement. The problem is that when the .csv is large enough, it hangs and never returns. Any idea how to prevent this?

>>> copy_response = list(mapd.execute("COPY FACT_USER_ACQUISITION_DAILY_MAPD FROM '/raidStorage/imports/FACT_USER_ACQUISITION_DAILY_MAPD_FINAL.csv';"))[0][0]

^CTraceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/pymapd/connection.py", line 226, in execute return c.execute(operation, parameters=parameters) File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/pymapd/cursor.py", line 120, in execute nonce=None, first_n=-1) File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/mapd/MapD.py", line 1030, in sql_execute return self.recv_sql_execute() File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/mapd/MapD.py", line 1046, in recv_sql_execute (fname, mtype, rseqid) = iprot.readMessageBegin() File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/thrift/protocol/TBinaryProtocol.py", line 134, in readMessageBegin sz = self.readI32() File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/thrift/protocol/TBinaryProtocol.py", line 217, in readI32 buff = self.trans.readAll(4) File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/thrift/transport/TTransport.py", line 60, in readAll chunk = self.read(sz - have) File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/thrift/transport/TTransport.py", line 161, in read self.__rbuf = BufferIO(self.__trans.read(max(sz, self.__rbuf_size))) File "/home/user/.conda/envs/teradata_env/lib/python3.5/site-packages/thrift/transport/TSocket.py", line 117, in read buff = self.handle.recv(sz) KeyboardInterrupt

RLS: Cut a new release

Will wait for #34 to be closed first

Fix master build issue on Travis

On Travis, our master badge indicates that our build doesn't pass, but it appears that this is only an issue with automated uploads of the built package (which we don't want to do anyway).

Debug how to fix this, so the package appears to be working correctly from looking at the badge (which it is)

select_ipc hangs

refs #31

My business code is way complicated but I managed to simplify things to be easy to debug.

On my server (CentOS 7.3 with M40), the program hangs every time when count reaches 2048, and it is not able to get data any more, until the server is restarted.

Table schema

create table pymapd_test ( t text not null );

Data

insert into pymapd_test values ('hello world');

Python code

from pymapd import connect

conn = connect(user="mapd", password="HyperInteractive", host="localhost", dbname="mapd")

count = 0
while True:
    query = "select * from pymapd_test"

    # will hang
    df = conn.select_ipc(query)

    # # wont hang
    # with conn as c:
    #     c.execute(query)

    count = count + 1
    print(count)

TIMESTAMP is not supported in temporary table

Hi everyone!

I am trying to execute a simple sql with select_ipc, but it seems that has some conflict with timestamp fields. but using just execute works fine.

mapd_cli.con.select_ipc('''
SELECT *
FROM mapd.flights_2008_10k
LIMIT 5;
''')

---------------------------------------------------------------------------
TMapDException                            Traceback (most recent call last)
<ipython-input-29-3a4f34b7bd4d> in <module>()
      3 FROM mapd.flights_2008_10k
      4 LIMIT 5;
----> 5 ''')

~/miniconda3/envs/ibis/lib/python3.6/site-packages/pymapd/connection.py in select_ipc(self, operation, parameters, first_n)
    303         tdf = self._client.sql_execute_df(
    304             self._session, operation, device_type=0, device_id=0,
--> 305             first_n=first_n
    306         )
    307 

~/miniconda3/envs/ibis/lib/python3.6/site-packages/mapd/MapD.py in sql_execute_df(self, session, query, device_type, device_id, first_n)
   1069         """
   1070         self.send_sql_execute_df(session, query, device_type, device_id, first_n)
-> 1071         return self.recv_sql_execute_df()
   1072 
   1073     def send_sql_execute_df(self, session, query, device_type, device_id, first_n):

~/miniconda3/envs/ibis/lib/python3.6/site-packages/mapd/MapD.py in recv_sql_execute_df(self)
   1097             return result.success
   1098         if result.e is not None:
-> 1099             raise result.e
   1100         raise TApplicationException(TApplicationException.MISSING_RESULT, "sql_execute_df failed: unknown result")
   1101 

TMapDException: TMapDException(error_msg='Exception: TIMESTAMP is not supported in temporary table.')

Test failing on nvidia machine

Likely just a different version of MapD running.

pytest tests/test_integration.py -k test_invalid_sql

self = <mapd.MapD.Client object at 0x7f09ad53f1d0>

    def recv_sql_execute(self):
        iprot = self._iprot
        (fname, mtype, rseqid) = iprot.readMessageBegin()
        if mtype == TMessageType.EXCEPTION:
            x = TApplicationException()
            x.read(iprot)
            iprot.readMessageEnd()
            raise x
        result = sql_execute_result()
        result.read(iprot)
        iprot.readMessageEnd()
        if result.success is not None:
            return result.success
        if result.e is not None:
>           raise result.e
E           mapd.ttypes.TMapDException: TMapDException(error_msg="Exception: Exception occurred: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 11: Column 'it' not found in any table")

mapd/MapD.py:947: TMapDException

The above exception was the direct cause of the following exception:

self = <tests.test_integration.TestIntegration object at 0x7f099d569cc0>, con = Connection(mapd://mapd:***@localhost:9091/mapd?protocol=binary)

    def test_invalid_sql(self, con):
        with pytest.raises(ProgrammingError) as r:
>           con.cursor().execute("select it;")

tests/test_integration.py:50:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pymapd/cursor.py:115: in execute
    six.raise_from(_translate_exception(e), e)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

value = DatabaseError("Exception: Exception occurred: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 11: Column 'it' not found in any table",)
from_value = TMapDException(error_msg="Exception: Exception occurred: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 11: Column 'it' not found in any table")

>   ???
E   pymapd.exceptions.DatabaseError: Exception: Exception occurred: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 11: Column 'it' not found in any table

<string>:2: DatabaseError
==================================================================== 8 tests deselected ====================================================================```

Add sqlalchemy to requirements

From a fresh install of pymapd using the environment.yml conda environment file, SQLAlchemy not installed

(mapd-dev-docs) mapdadmin@MapDCE:~/randy/pydatanyc2018/pymapd$ python
Python 3.5.5 | packaged by conda-forge | (default, Jul 23 2018, 23:45:43) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pymapd
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mapdadmin/randy/pydatanyc2018/pymapd/pymapd/__init__.py", line 18, in <module>
    from .connection import (  # noqa
  File "/home/mapdadmin/randy/pydatanyc2018/pymapd/pymapd/connection.py", line 8, in <module>
    from sqlalchemy.engine.url import make_url
ImportError: No module named 'sqlalchemy'

Current 0.4.0 release downgraded to 0.3.1 when pygdf installed

If you install pymapd first through conda, then install pygdf with conda install -c numba -c conda-forge -c gpuopenanalytics/label/dev -c defaults pygdf=0.1.0a3, it will downgrade pymapd to 0.3.1

Array columns aren't properly parsed which causes exception in fetching results

Minimal example to reproduce error:

from pymapd import connect
con = connect(user="mapd", password="HyperInteractive", host="localhost", dbname="mapd")

con.execute("DROP TABLE IF EXISTS keith_test_lists;")
con.execute("CREATE TABLE IF NOT EXISTS keith_test_lists (col1 TEXT, col2 INT[]);")

row = [("row1", "{10,20,30}"), ("row2", "{40,50,60}")]

con.load_table_rowwise("keith_test_lists", row)

con.execute("select * from keith_test_lists").fetchall()

yields:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/splunk/miniconda2/envs/splunk_mapd/lib/python2.7/site-packages/pymapd/cursor.py", line 170, in fetchall
    return list(self)
  File "/opt/splunk/miniconda2/envs/splunk_mapd/lib/python2.7/site-packages/pymapd/cursor.py", line 206, in make_row_results_set
    yield tuple(columns[j][i] for j in range(ncols))
  File "/opt/splunk/miniconda2/envs/splunk_mapd/lib/python2.7/site-packages/pymapd/cursor.py", line 206, in <genexpr>
    yield tuple(columns[j][i] for j in range(ncols))
IndexError: list index out of range

I'm guessing the issue is related to the following definitions not checking if the columns are array types: https://github.com/mapd/pymapd/blob/b52f92c47f4a192e8f33ef14a04327daa8893dd0/pymapd/_parsers.py#L36-L72

Shouldn't Pandas 'categorical' type map to OmniSci dictionary encoded string?

Pymapd currently does not accept categorical columns from pandas, and issues an appropriate error message.

So this is a feature enhancement request.... would expect that categorical column type matches intent of OmniSci dictionary and could always be held within one.

Current behavior:
Unsupported type error

Expected behavior:
Pandas categorical columns are mapped to text encoded dictionary. This could either be done on the general case, or further optimized so that dictionary size is optimized.

execute sometimes returns results from previous call

I'm using pymapd 0.4.0 in jupyter with mapd 4.0.2. I've seen this happen a few times, and it happened in 0.3.x also. I just restart the connection or the notebook to work around.

I don't know how to reproduce. One guess would be when a call to execute is killed.

Improve documentation

Could do more around a full tutorial, document known limitations, etc.

Make rpc timeout a keyword during authentication

I get more timeouts than ideal when using pymapd, looks like the python thrift library has a method you can set when creating a socket. Proposal is to make this part of the pymapd.connect method with a default value, so users can choose for themselves

https://stackoverflow.com/a/12739744

Add Support for Deallocation of Resources

User Story

As a Pymapd user, I would like to deallocate resources assigned to the (MapD) TGPUDataFrames, once I am done with the operations.

Issue Description

As of now, each time I select data from MapD, few resources get allocated to IPC and get accumulated ultimately leading to exhaustion of memory.

PKG: Specify required and optional dependencies

I've been a bit cavalier with exactly what is a required dependency. We need to

decide on what's required
ensure setup.py works with just the required
document that

Here are the proposed requirements:

Required dependencies

six
thrift
setuptools_scm (build only)
sqlalchemy (for parametrized queries)

Optional Dependencies

GPU
- pygdf
- libgdf
CPU shared memory
- pandas (and numpy)
- pyarrow / arrow-cpp
- cython (build)

Dates less than epoch

I've noticed this fails on historical data. I know mapd can handle back 1901-2038, but does anyone know if that's possible with this module?

'Connection' object has no attribute 'load_table_arrow'

import pyarrow as pa
import pandas as pd
import pymapd as pm
mapd_con = pm.connect(user="mapd", password='password', host="localhost", dbname="mapd")
mapd_cursor = mapd_con.cursor()
df = pd.DataFrame({"A": [1, 2], "B": ['c', 'd']})
table = pa.Table.from_pandas(df)
con.load_table_arrow("test", table)

#error text
'Connection' object has no attribute 'load_table_arrow'

Installed pymapd via conda install -c conda-forge pymapd but it can't find load_table_arrow, load_table_columnar and I get a weird error about columns when trying to use load_table.

With so little code being used and no errors on pm.connect, seems more likely to be a bug in pymapd than something on my end.

Conda-forge package doesn't have thrift version pinned to 0.10.0

It ends up pulling thrift 0.11.0 which causes the following error when trying to create a connection:

TypeError: expecting list of size 2 for struct args

Sample conda create to show issue:

conda create --dry-run -n dummy -c conda-forge pymapd                    
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /home/nfs/kkraus/anaconda3/envs/dummy:

The following NEW packages will be INSTALLED:

    arrow-cpp:       0.8.0-py36_2          conda-forge
    ca-certificates: 2017.11.5-0           conda-forge
    certifi:         2017.11.5-py36_0      conda-forge
    intel-openmp:    2018.0.0-hc7b2577_8              
    libgcc-ng:       7.2.0-h7cc24e2_2                 
    libgfortran-ng:  7.2.0-h9f7466a_2                 
    mkl:             2018.0.1-h19d6760_4              
    ncurses:         5.9-10                conda-forge
    numpy:           1.14.0-py36h3dfced4_1            
    openssl:         1.0.2n-0              conda-forge
    pandas:          0.22.0-py36_0         conda-forge
    parquet-cpp:     1.4.0.pre-0           conda-forge
    pip:             9.0.1-py36_1          conda-forge
    pyarrow:         0.8.0-py36_0          conda-forge
    pymapd:          0.3.1-py36_0          conda-forge
    python:          3.6.4-0               conda-forge
    python-dateutil: 2.6.1-py36_0          conda-forge
    pytz:            2017.3-py_2           conda-forge
    readline:        7.0-0                 conda-forge
    setuptools:      38.4.0-py36_0         conda-forge
    six:             1.11.0-py36_1         conda-forge
    sqlalchemy:      1.2.1-py36_0          conda-forge
    sqlite:          3.20.1-2              conda-forge
    thrift:          0.11.0-py36_0         conda-forge
    tk:              8.6.7-0               conda-forge
    wheel:           0.30.0-py36_2         conda-forge
    xz:              5.2.3-0               conda-forge
    zlib:            1.2.11-0              conda-forge

For a workaround until thrift 0.11.0 is supported do:

conda install -c conda-forge pymapd thrift=0.10.0