wiseio / paratext Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 103.0 319 KB

A library for reading text files over multiple cores.

License: Apache License 2.0

Python 27.63% C++ 69.04% Shell 1.84% R 1.49%

paratext's People

Stargazers

Watchers

Forkers

kiapper galacticsurfer yut148 micous xdjcl yanlinaung flycrane security-geeks cdfpaz mckellyln 42machinelearning bmaggard melsiddieg caseymacphee chagge thomasdic2000 fangzheng354 tzubal chorior deviffy ricohuang codeaudit pombredanne andrewseidl mshah2188 aggayush zonemercy chris-b1 deads dstarr1 dav009 lexsf khanhdinh jonathananderson catchmrbharath basio longscu v0dro krmohanty hmalphettes prehensilecode cc272309126 stangelandcl nkhuyu jamesramm datalytica wangjiahong yubobo zclongpop123 maoyikun simonzhangsm linhua-sun bryonglodencissp leigeng2014 zhuanglineu jiapei100 zhangboxun pdbaines zfq308 jgraving mlh14 laeeth arita37 kumarashwani2710 batterseapower dongzibo rchicoli awesome-python bobquest33 dmerson abk-chicago wenyanghan ajinkyakulkarni sprapat bmritz radovankavicky gapdata gsvijayraajaa cquliaoli cyg056 mrlevo520 tiledb-inc tdenniston tharamathew16 kingsrd xuqianjin-stars wlnpu haochuang tiagokapp iloleg lirunhua kirosg fanfanruyun nikotinus kipropguoguo rcrowell chopinwang muse117 nirvana2211 mzapukhlyak

paratext's Issues

Windows cygwin build fails

I'm experiencing some build issues for a windows machine with cygwin. The g++ compiler is successfully located, but still ends with several compile errors. System information is below, and I have attached the build's output.

OS: Windows 7 64-bit
Compiler: g++ v5.3
Python v3.5

Build command:
python setup.py build --compiler=cygwin > build_output.txt
build_output.txt

Paratext fails to parse test file(s)

Hi there. I have a csv-game on bitbucket. I ran the test file through paratext and it failed. The test file is generated with this script:

#!/bin/bash

# Simple csv file which should flex escaping a little.
for i in $(seq 1 1000000); 
  do echo 'hello,","," ",world,"!"'; 
done > /tmp/hello.csv

# Test for 'hello world'
touch /tmp/empty.csv

The code I use is as follows:

#!/usr/bin/env python
import paratext
print sum(map(lambda x: len(x[1]), paratext.load_raw_csv("/dev/stdin",
    no_header=True, allow_quoted_newlines=True)))

$ python2/csvreader-paratext.py < /tmp/hello.csv
Traceback (most recent call last):
  File "python2/csvreader-paratext.py", line 4, in <module>
    allow_quoted_newlines=True)])
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 271, in load_raw_csv
    loader = internal_create_csv_loader(filename, *args, **kwargs)
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 161, in internal_create_csv_loader
    loader.load(_make_posix_filename(filename), params)
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext_internal.py", line 414, in load
    return _paratext_internal.ColBasedLoader_load(self, filename, params)
RuntimeError: The file ends with an open quote (4506147)

Changing to num_threads=1 fixes this, but obviously it's a racey bug. There are a few other bugs that are associated.

To get an idea of the baseline performance of I also parse an empty file. paratext segfaults:

$ cat /tmp/empty.csv
$ python2/csvreader-paratext.py < /tmp/empty.csv 
Segmentation fault: 11

Normally to test how we process a csv file, we should be able to use basic cmd line tools to subsample the file and pass it to the csv reader. This seems to fail with paratext. The following hangs on 100% cpu and doesn't respond to SIGINT (iow I can't use Ctrl-C and must use SIGSUSP (Ctrl-Z) and then kill %1

$ head -5 /tmp/hello.csv  | time python2/csvreader-paratext.py

Build fails on Windows 7 with Anaconda

I am having issues building this package. Here's what I've got installed:

Python 2.7.12 :: Anaconda 4.1.1 (64-bit)
gcc.exe (GCC) 4.9.3
SWIG Version 3.0.10
Compiled with i686-w64-mingw32-g++ [i686-w64-mingw32]
Configured options: +pcre

This is the output I get when I try to build:

$ python setup.py build install /c/Users/icassidy/AppData/Local/Continuum/Anaconda2/Library/bin/swig ..\src\diagnostic\parse_and_sum.hpp(158) : Warning 302: Identifier 'parse_token' redefined (ignored), ..\src\diagnostic\parse_and_sum.hpp(129) : Warning 302: previous definition of 'parse_token'. 0.2.1rc1 ('running swig: ', ['swig', '-c++', '-python', '-I../src/', '-outdir', './', '../src/paratext_internal.i']) running build running config_cc unifing config_cc, config, build_clib, build_ext, build commands --compiler options running config_fc unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options running build_src build_src building py_modules sources building extension "_paratext_internal" sources build_src: building npy-pkg config files running build_py copying paratext_internal.py -> build\lib.win-amd64-2.7 copying paratext\__init__.py -> build\lib.win-amd64-2.7\paratext running build_ext customize MSVCCompiler customize MSVCCompiler using build_ext customize MSVCCompiler Missing compiler_cxx fix for MSVCCompiler customize MSVCCompiler using build_ext building '_paratext_internal' extension compiling C sources C:\Users\icassidy\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I../src/ -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\core\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\PC /Tp../src/paratext_internal_wrap.cxx /Fobuild\temp.win-amd64-2.7\Release\../src/paratext_internal_wrap.obj -std=c++11 -Wall -Wextra -pthread /Zm1000 Found executable C:\Users\icassidy\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe C:\Users\icassidy\AppData\Local\Continuum\Anaconda2\lib\distutils\dist.py:267: UserWarning: Unknown distribution option: 'include_package_data' warnings.warn(msg) cl : Command line error D8021 : invalid numeric argument '/Wextra' error: Command "C:\Users\icassidy\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I../src/ -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\core\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\PC /Tp../src/paratext_internal_wrap.cxx /Fobuild\temp.win-amd64-2.7\Release\../src/paratext_internal_wrap.obj -std=c++11 -Wall -Wextra -pthread /Zm1000" failed with exit status 2

Please help!

Remove "forget=True" from user API

This should be handled as part of tp_dealloc for this object (or, in the interim, set forget = True by default). Or is there a use case for forget=False that I'm missing?

https://github.com/wiseio/paratext/blob/master/python/paratext/core.py#L207

Allow 'filename' to accept URI-style filenames

It would be very useful if filename accepted URI-style filenames, in addition to standard filepaths. For example,

>>> df = paratext.load_csv_to_pandas(filename="/data/featurization_data.csv", allow_quoted_newlines=True)
>>> df = paratext.load_csv_to_pandas(filename="file:///data/featurization_data.csv", allow_quoted_newlines=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 309, in load_csv_to_pandas
    return pandas.DataFrame.from_items(load_csv_to_expanded_columns(filename, *args, **kwargs))
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1085, in from_items
    keys, values = lzip(*items)
  File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 284, in load_csv_to_expanded_columns
    for name, col, semantics, levels in load_raw_csv(filename, *args, **kwargs):
  File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 227, in load_raw_csv
    loader = internal_create_csv_loader(filename, *args, **kwargs)
  File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 117, in internal_create_csv_loader
    loader.load(filename, params)
  File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext_internal.py", line 251, in load
    def load(self, *args): return _paratext_internal.ColBasedLoader_load(self, *args)
RuntimeError: cannot open file 'file:///data/featurization_data.csv'

setup.py does not support Python 3.5

With Python 3.5.0, running python3.5 setup.py raises error:

$ python3.5 setup.py build
  File "setup.py", line 7
    print "Error: you must install SWIG first."
                                              ^
SyntaxError: Missing parentheses in call to 'print'

Unable to read csv

Below is my code
import pandas as pd
import paratext
%%time
df = paratext.load_csv_to_pandas("sample.csv")

the last line never comes out of execution mode in jupyter notebook. The file isn't that big (contains about 10k rows and file size ~2mb). Any reasons as to why this may be occurring?

OS: Mac OS 10.12.6
python version: 3.6

Suggest where the header starts (line number)

Python Installation Problem

I downloaded the library, went to the python folder, and did python setup.py build install but I get this error when I try to import paratext

'''
Traceback (most recent call last):
File "test_paratext.py", line 3, in
import paratext.testing
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext/init.py", line 4, in
from paratext.core import *
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext/core.py", line 29, in
import paratext_internal as pti
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext_internal.py", line 21, in
_paratext_internal = swig_import_helper()
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext_internal.py", line 20, in swig_import_helper
return importlib.import_module('_paratext_internal')
File "/home/tom/anaconda2/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: /home/tom/anaconda2/lib/python2.7/site-packages/_paratext_internal.so: undefined symbol: _ZTVNSt7__cxx1115basic_stringbufIcSt11char_traitsIcESaIcEEE
'''

Problem compiling paratext on OS X

My setup
OS X version: El Capitan version 10.11.2 (on Macbook Pro)
Swig version: 3.0.8
Python version: 3.5.1

Output when doing: sudo python setup.py install

sudo python setup.py build
/usr/local/bin/swig
0.1.1rc1
running swig: ['swig', '-c++', '-python', '-I../src/', '-outdir', './', '../src/paratext_internal.i']
../src/diagnostic/parse_and_sum.hpp:158: Warning 302: Identifier 'parse_token' redefined (ignored),
../src/diagnostic/parse_and_sum.hpp:129: Warning 302: previous definition of 'parse_token'.
/Users/amund/anaconda/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'include_package_data'
warnings.warn(msg)
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
running build_src
build_src
building py_modules sources
building extension "_paratext_internal" sources
build_src: building npy-pkg config files
running build_py
creating build
creating build/lib.macosx-10.5-x86_64-3.5
copying paratext_internal.py -> build/lib.macosx-10.5-x86_64-3.5
creating build/lib.macosx-10.5-x86_64-3.5/paratext
copying paratext/init.py -> build/lib.macosx-10.5-x86_64-3.5/paratext
copying paratext/core.py -> build/lib.macosx-10.5-x86_64-3.5/paratext
copying paratext/helpers.py -> build/lib.macosx-10.5-x86_64-3.5/paratext
running build_ext
customize UnixCCompiler
customize UnixCCompiler using build_ext
customize UnixCCompiler
customize UnixCCompiler using build_ext
building '_paratext_internal' extension
compiling C++ sources
C compiler: g++ -fno-strict-aliasing -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/amund/anaconda/include -arch x86_64

creating build/temp.macosx-10.5-x86_64-3.5
creating build/temp.macosx-10.5-x86_64-3.5/paratext
creating build/temp.macosx-10.5-x86_64-3.5/paratext/src
compile options: '-I../src/ -I/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include -I/Users/amund/anaconda/include/python3.5m -c'
extra options: '--stdlib=libc++ -std=c++11 -Wall -Wextra -pthread -m64 -D_REENTRANT'
g++: ../src/paratext_internal_wrap.cxx
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:218:19: error: use of undeclared identifier 'PyString_FromStringAndSize'; did you mean 'PyBytes_FromStringAndSize'?
PyObject s = PyString_FromStringAndSize(output.c_str(), output.size());
^~~~~~~~~~~~~~~~~~~~~~~~~~
PyBytes_FromStringAndSize
/Users/amund/anaconda/include/python3.5m/bytesobject.h:51:24: note: 'PyBytes_FromStringAndSize' declared here
PyAPI_FUNC(PyObject *) PyBytes_FromStringAndSize(const char *, Py_ssize_t);
^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:34:
In file included from ../src/csv/colbased_chunk.hpp:31:
../src/util/widening_vector.hpp:308:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values.begin(), values_.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:308:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values_.begin(), values_.end(), (T)0);
~~~~~~~~~~~~~~~^
../src/util/widening_vector.hpp:365:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values_.begin(), values_.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:365:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values_.begin(), values_.end(), (T)0);
~~~~~~~~~~~~~~~^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:36:
../src/csv/parallel.hpp:67:46: warning: initialized lambda captures are a C++14 extension [-Wc++14-extensions]
.emplace_back( it, step, thread_id, f = std::forward(f) {
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:159:28: error: use of undeclared identifier 'PyString_FromStringAndSize'
PyObject *newobj = PyString_FromStringAndSize((_it).c_str(), (_it).size());
^
../src/python/numpy_helper.hpp:304:60: note: in instantiation of member function 'build_array_from_range_implstd::__1::__wrap_iter<std::_1::basic_string<char *>, void>::build_array' requested here
return (PyObject)build_array_from_range_impl::build_array(range);
^
../src/paratext_internal_wrap.cxx:10031:32: note: in instantiation of function template specialization 'build_array_from_rangestd::__1::__wrap_iter<std::1::basic_string >' requested here
resultobj = (PyObject)::build_array_from_range(range);
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:68:86: error: no member named 'id' in 'numpy_type'
PyObject array = (PyObject)PyArray_SimpleNew(1, fdims, numpy_type<value_type>::id);
~~~~~~~~~~~~~~~~~~~~~~~~^
/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include/numpy/ndarrayobject.h:135:46: note: expanded from macro 'PyArray_SimpleNew'
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
^~~~~~~
../src/python/numpy_helper.hpp:299:50: note: in instantiation of member function 'build_array_impl<std::1::vector<unsigned long, std::1::allocator >, void>::build_array' requested here
return (PyObject)build_array_impl::build_array(container);
^
../src/paratext_internal_wrap.cxx:10366:30: note: in instantiation of function template specialization 'build_array<std::1::vector<unsigned long, std::1::allocator > >' requested here
resultobj = (PyObject)::build_arraystd::vector<size_t>(result);
^
1 warning and 7 errors generated.
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:218:19: error: use of undeclared identifier 'PyString_FromStringAndSize'; did you mean 'PyBytes_FromStringAndSize'?
PyObject s = PyString_FromStringAndSize(output.c_str(), output.size());
^~~~~~~~~~~~~~~~~~~~~~~~~~
PyBytes_FromStringAndSize
/Users/amund/anaconda/include/python3.5m/bytesobject.h:51:24: note: 'PyBytes_FromStringAndSize' declared here
PyAPI_FUNC(PyObject *) PyBytes_FromStringAndSize(const char *, Py_ssize_t);
^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:34:
In file included from ../src/csv/colbased_chunk.hpp:31:
../src/util/widening_vector.hpp:308:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:308:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~~~~~~~~~~~^
../src/util/widening_vector.hpp:365:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:365:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~~~~~~~~~~~^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:36:
../src/csv/parallel.hpp:67:46: warning: initialized lambda captures are a C++14 extension [-Wc++14-extensions]
.emplace_back( it, step, thread_id, f = std::forward(f) {
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:159:28: error: use of undeclared identifier 'PyString_FromStringAndSize'
PyObject *newobj = PyString_FromStringAndSize((_it).c_str(), (_it).size());
^
../src/python/numpy_helper.hpp:304:60: note: in instantiation of member function 'build_array_from_range_implstd::__1::__wrap_iter<std::_1::basic_string<char *>, void>::build_array' requested here
return (PyObject)build_array_from_range_impl::build_array(range);
^
../src/paratext_internal_wrap.cxx:10031:32: note: in instantiation of function template specialization 'build_array_from_rangestd::__1::__wrap_iter<std::_1::basic_string >' requested here
resultobj = (PyObject)::build_array_from_range(range);
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:68:86: error: no member named 'id' in 'numpy_type'
PyObject array = (PyObject)PyArray_SimpleNew(1, fdims, numpy_type<value_type>::id);
~~~~~~~~~~~~~~~~~~~~~~~~^
/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include/numpy/ndarrayobject.h:135:46: note: expanded from macro 'PyArray_SimpleNew'
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
^~~~~~~
../src/python/numpy_helper.hpp:299:50: note: in instantiation of member function 'build_array_impl<std::__1::vector<unsigned long, std::_1::allocator >, void>::build_array' requested here
return (PyObject)build_array_impl::build_array(container);
^
../src/paratext_internal_wrap.cxx:10366:30: note: in instantiation of function template specialization 'build_array<std::__1::vector<unsigned long, std::_1::allocator > >' requested here
resultobj = (PyObject)::build_arraystd::vector<size_t>(result);
^
1 warning and 7 errors generated.
error: Command "g++ -fno-strict-aliasing -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/amund/anaconda/include -arch x86_64 -I../src/ -I/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include -I/Users/amund/anaconda/include/python3.5m -c ../src/paratext_internal_wrap.cxx -o build/temp.macosx-10.5-x86_64-3.5/paratext/src/paratext_internal_wrap.o --stdlib=libc++ -std=c++11 -Wall -Wextra -pthread -m64 -D_REENTRANT" failed with exit status 1

Add support for arbitrary delimiters

\t in particular

Windows support

I started looking at this, the only substantive change that seems to be necessary is removing the use of VLAs (e.g. here) - would you accept a PR replacing them with std::vector<char>?

unistd.h is also included in several places, which isn't available on Windows, but it doesn't look anything is actually used from it?

C++ rowbase stream processing?

Is there a quick way to use this library for row by row processing without loading the whole data set?

Related to #45 but I am just looking for an example of how to use existing functionality from C++.

Thank you.

Add support for comments in CSV files

Add Python 3.5 support

As mentioned here, there is lack of support for Python 3.5. We all know about Python 2 vs. Python 3, but with simple project like this it should be very easy to port it to something which has its future. Furthermore, you say Python (2.7 or above) in README.

There are e.g. xrange at https://github.com/wiseio/paratext/blob/master/python/paratext/core.py#L192, which won't work on Python 3. I am willing to help here - is there any specific reason why not to use Python 2 and Python 3? I would recommend using shared codebase with __future__ or six package.

Thanks

Receiving Error when installing on Windows 7

Newbie here, came across paratext via a google search on processing large .cvs files to pandas DataFrames. I've followed the directions, but when I get to the command "python setup.py build install", I keep receiving this error:

(C:\Users\mtrette\AppData\Local\Continuum\Anaconda3\envs\ParaText) c:\User
tte\paratext\python>python setup.py build install
Traceback (most recent call last):
File "setup.py", line 7, in
p = subprocess.Popen(["which", "swig"])
File "C:\Users\mtrette\AppData\Local\Continuum\Anaconda3\envs\ParaText\l
process.py", line 947, in init
restore_signals, start_new_session)
File "C:\Users\mtrette\AppData\Local\Continuum\Anaconda3\envs\ParaText\l
process.py", line 1224, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

Like I said, I'm a newbie so not sure exactly what is going on, was hoping that someone may be able to shed some light and help me out. Thank you! By the way, I've completed the steps using the "normal" (ie not via Anaconda) cmd prompt and got the same error message.

Issue with paratext.load_csv_to_pandas()

Thank you in advanced for your time.

I'm trying to load a csv file with:

data = paratext.load_csv_to_pandas('data.csv')

I'm getting a:

AttributeError: module 'ntpath' has no attribute 'splitunc'

I am able to load the csv file with the traditional method using pd.read_csv().

Full Error Output:

C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py:403: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
  return pandas.DataFrame.from_items(expanded)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-42-de2c6a8a93be> in <module>()
----> 1 data = paratext.load_csv_to_pandas('2016.csv')

C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in load_csv_to_pandas(filename, *args, **kwargs)
    401               return pandas.DataFrame()
    402     else:
--> 403          return pandas.DataFrame.from_items(expanded)
    404 
    405 @_docstring_parameter(_csv_load_params_doc)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in from_items(cls, items, columns, orient)
   1458                       FutureWarning, stacklevel=2)
   1459 
-> 1460         keys, values = lzip(*items)
   1461 
   1462         if orient == 'columns':

C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in load_csv_to_expanded_columns(filename, *args, **kwargs)
    353         return pandas.DataFrame.from_items(filename, *args, **kwargs)
    354     """
--> 355     for name, col, semantics, levels in load_raw_csv(filename, *args, **kwargs):
    356         if levels is not None and len(levels) > 0:
    357             yield name, levels[col]

C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in load_raw_csv(filename, *args, **kwargs)
    296 
    297     """
--> 298     loader = internal_create_csv_loader(filename, *args, **kwargs)
    299     return internal_csv_loader_transfer(loader, forget=True)
    300 

C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in internal_create_csv_loader(filename, num_threads, allow_quoted_newlines, block_size, number_only, no_header, max_level_name_length, max_levels, cat_names, text_names, num_names, in_encoding, out_encoding, convert_null_to_space)
    186     if out_encoding == "utf-8":
    187         loader.set_out_encoding(pti.UNICODE_UTF8)
--> 188     loader.load(_make_posix_filename(filename), params)
    189     return loader
    190 

C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in _make_posix_filename(fn_or_uri)
    118 
    119 def _make_posix_filename(fn_or_uri):
--> 120      if ntpath.splitdrive(fn_or_uri)[0] or ntpath.splitunc(fn_or_uri)[0]:
    121          result = fn_or_uri
    122      else:

AttributeError: module 'ntpath' has no attribute 'splitunc'

Thank you again for your time.

Support for tab-delimited files?

New to python here! Might be a basic question, but does paratext support reading of tab-delimited files with load_csv_to_pandas?

add support for opening multiple files

it would be nice to be able to pass a list of files to load and return a dataframe with the results merged by concatenating rows consistent with using ignore_index=True. This would avoid relying on df.append which creates a copy.

how to install paratext for python3?

Not being able to stat a file should be a runtime_error; not logic_error

In rowbased_loader.hpp there is a logic_error thrown if a file cannot be statted. This should be runtime_error.

can i access data by row

I want to access the csv data row by row,how can i do that?

Initial release for a conda-forge pkg

Thanks for the library it looks great! I would like to have a conda pkg that people can install based on conda-forge. I have started that work here: conda-forge/staged-recipes#731

The only thing needed is an "official" release with a git tag that I can use to freeze the build of the pkg.

Also, would someone want to be added as a maintainer for that conda pkg? If not I am ok doing that.

Thanks!

Add support for DateTime objects.

Allow 'filename' to be a unicode string

Currently, load_csv_to_pandas fails when filename is a unicode string. Example given below:

>>> df = paratext.load_csv_to_pandas(filename="/data/featurization_data.csv", allow_quoted_newlines=True)
>>> df = paratext.load_csv_to_pandas(filename=u"/data/featurization_data.csv", allow_quoted_newlines=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 309, in load_csv_to_pandas
    return pandas.DataFrame.from_items(load_csv_to_expanded_columns(filename, *args, **kwargs))
  File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1085, in from_items
    keys, values = lzip(*items)
TypeError: zip() argument after * must be a sequence, not generator

Add support for ARFF format.

Import fails on OSX 10.12.5 using Anaconda

I made a virtualenv to test out the package, and importing the module into python failed as follows:

Python 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import paratext
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "paratext/__init__.py", line 4, in <module>
    from paratext.core import *
  File "paratext/core.py", line 29, in <module>
    import paratext_internal as pti
  File "paratext_internal.py", line 113
    def value(self) -> "PyObject *":
                    ^
SyntaxError: invalid syntax

Tested on AWS Lambda - EFS ?

Has anyone tried using paratext inside of an AWS Lambda function, reading from the Elastic File System service?

Add support for escaped characters

Add support for pip

Modern Python comes with pip bundled in, making it easy to install the library.

ParaText could be installed using pip. It could be hosted on git or, better, on pypi.

Type Detection

Summary:

When parsing files I ran into problems with how paratext parses some of the elements in a file. I believe these may be related to the process_token() function in colbased_worker.hpp that tries to detect whether a token is an integer, float, exponential/scientific number or otherwise.

I have written code to make changes to this function, but was not able to create a pull request. Please let me know if there is a way I can create a pull request or provide the code change suggestions.

Examples:

It detects a string such as A.1 as a number and I get 0.000000
It detects a string such as 3ABC as a number and I get 3.000000

Details:

Let A.1 be the input. When reading the token and checking if the token is an integer (line 270) it checks if token_[i] is a digit. If not, we move on to see if we are dealing with a float instead. However we advance the index, i, regardless of whether the integer check passes or fails. Therefore, when we get to the float check on line 279 we are looking at the . character instead of A. Then the check for a float passes since we see . and only digits after it. Finally the result is 0.000000 since A.1 gets converted to a float before it is passed to process_float.

Numbers like 12.345 are not picked up as floats (because the integer check fails on . and we check for float on the next character, here 3), but instead as exponentials. They pass as exponentials not because they pass the exponential check, but because exp_possible is set as true at the beginning (on line 272 after the integer check passes on line 270) and does not become false. Both exponentials and floats are passed to process_float in the end. For the same reason 3ABC is detected as an exponential instead of a string and we get 3.000000. (Numbers like .123 are not detected as floats or exponentials because exp_possible is set as false after the integer check .)

Note:

When making updates to the process_token() function I did some simplifications, but did not change the behaviour of the function other than for the issues found.

In the code change suggestion I have let a number on the form 14e-3 be a valid exponential (compared to 14.0e-3). This can be changed if not desired.

Add support for reading large files in chunks

Similar to the chunksize parameter of pandas.read_csv()...

Is this planned or even already possible somehow? Since paratext will most likely be used for reading large CSV files (pandas is usually already fast enough for small ones) which might not fit in memory, this would be very useful in my opinion.

Add usecols option

With same functionality as pandas.read_csv (see docs here).

How to build the library with no Python

Ok, as title says. If I want to build just the lib and no python wrapper how can we do that?

Thanks

Paratext <-> Apache Arrow bridge

@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).

The simplest thing would be to fork the codebase into a libarrow_csv shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow inside parquet-cpp (https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?

how to use it in c or c++?

i am not able to figure out how to use it in c++ program?

C++11 compiler warning with clang on OS X

Hello,

On OS X 10.11.6, with clang version:

Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0

I'm seeing a warning that a C++14 feature is being used. The README states that only a C++ compliant compiler is required --- is C++14 required? Here is the warning:

In file included from paratext/src/csv/colbased_loader.hpp:36:
paratext/src/csv/parallel.hpp:67:46: warning: initialized lambda captures are a C++14
      extension [-Wc++14-extensions]
        .emplace_back([ it, step, thread_id, f = std::forward<F>(f) ]() {
                                             ^

We're constrained to C++11, so it'd be great if there was a workaround for this (other than suppressing the warning).

This warning also appears in the Travis build output:
https://travis-ci.org/wiseio/paratext/jobs/313212429#L1380

Provide for files containing non-UTF8 strings

see https://github.com/wiseio/paratext/pull/20/files#r66715298

Build fails with Python 3.5.0 & SWIG 3.0.10 & g++ 4.8.1

System is RHEL 6.4 with gcc 4.8.1 and SWIG 3.0.10 and Python 3.5.0 installed.

Firstly, python/setup.py is not Python3 clean: it uses print statements rather than print() functions. After running it through 2to3, the build fails with a few occurences of this error:

g++: ../src/paratext_internal_wrap.cxx
In file included from ../src/paratext_internal_wrap.cxx:3140:0:
../src/python/numpy_helper.hpp: In member function ‘string_array_output_iterator& string_array_output_iterator::operator++()’:
../src/python/numpy_helper.hpp:218:75: error: ‘PyString_FromStringAndSize’ was not declared in this scope
     PyObject *s = PyString_FromStringAndSize(output.c_str(), output.size());

Modifying src/python/numpy_helper.hpp to replace "PyString_FromStringAndSize()" with "PyUnicode_FromStringAndSize()" makes the build succeed.

Perhaps something like this is required to support both Python2 and Python3.

add support for .gz files

would be nice to add support for opening .gz files
Ideally we could pass a file handle, eg

import gzip, paratext
with gzip.open(f, 'rb') as fh:
  paratext.read(fh)

It seems like the file handle is opened by the C code, so perhaps this is not practical, and easier to add gzip reading support directly to the C code?

perf problems

I'm getting consistently better timings with pandas.read_csv. Is there some build problem perhaps on OSX? For example this seems typical on some smallish dataset (1 million rows)

In [8]: %time d = pd.read_csv('j.csv', header=None, dtype=str)
CPU times: user 4.15 s, sys: 422 ms, total: 4.57 s
Wall time: 4.57 s

In [9]: %time df = paratext.load_csv_to_dict('j.csv', no_header=True)
CPU times: user 14.5 s, sys: 914 ms, total: 15.4 s
Wall time: 6.45 s

In [12]: paratext.__version__
Out[12]: '0.3.1rc1'

$ python --version
Python 3.5.3 :: Anaconda custom (x86_64)

Feature request : Add support for conda

Hi,

It would be really great if one could install your python package using conda.

Reading from stdin should use single thread or report an error

When using multiple threads to read from stdin, paratext fails. It works fine with actual files or with a single thread reading stdin.

The code I use is as follows:

#!/usr/bin/env python
import paratext
print sum(map(lambda x: len(x[1]), paratext.load_raw_csv("/dev/stdin",
    no_header=True, allow_quoted_newlines=True)))

$ python2/csvreader-paratext.py < /tmp/hello.csv
Traceback (most recent call last):
  File "python2/csvreader-paratext.py", line 4, in <module>
    allow_quoted_newlines=True)])
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 271, in load_raw_csv
    loader = internal_create_csv_loader(filename, *args, **kwargs)
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 161, in internal_create_csv_loader
    loader.load(_make_posix_filename(filename), params)
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext_internal.py", line 414, in load
    return _paratext_internal.ColBasedLoader_load(self, filename, params)
RuntimeError: The file ends with an open quote (4506147)

This could also be an issue if someone were to read in an actual file by name but pass in /dev/stdin.

uuid,document_id,timestamp,platform,geo_location,traffic_source
1fd5f051fba643,120,31905835,1,RS,2
8557aa9004be3b,120,32053104,1,VN>44,2
c351b277a358f0,120,54013023,1,KR>12,1
8205775c5387f9,120,44196592,1,IN>16,2
9cb0ccd8458371,120,65817371,1,US>CA>807,2
2aa611f32875c7,120,71495491,1,CA>ON,2
f55a6eaf2b34ab,120,73309199,1,BR>27,2
cc01b582c8cbff,120,50033577,1,CA>BC,2
6c802978b8dd4d,120,66590306,1,CA>ON,2

But paratext reads it as following:

The uuid conversion is totally unexpected - and the issue persists even if I say text_names=['uuid']

Add a Travis CI build for Python 2/3 on Linux/OS X

related: #13