wiseio / paratext Goto Github PK
View Code? Open in Web Editor NEWA library for reading text files over multiple cores.
License: Apache License 2.0
A library for reading text files over multiple cores.
License: Apache License 2.0
I'm experiencing some build issues for a windows machine with cygwin. The g++ compiler is successfully located, but still ends with several compile errors. System information is below, and I have attached the build's output.
OS: Windows 7 64-bit
Compiler: g++ v5.3
Python v3.5
Build command:
python setup.py build --compiler=cygwin > build_output.txt
build_output.txt
Hi there. I have a csv-game on bitbucket. I ran the test file through paratext and it failed. The test file is generated with this script:
#!/bin/bash
# Simple csv file which should flex escaping a little.
for i in $(seq 1 1000000);
do echo 'hello,","," ",world,"!"';
done > /tmp/hello.csv
# Test for 'hello world'
touch /tmp/empty.csv
The code I use is as follows:
#!/usr/bin/env python
import paratext
print sum(map(lambda x: len(x[1]), paratext.load_raw_csv("/dev/stdin",
no_header=True, allow_quoted_newlines=True)))
$ python2/csvreader-paratext.py < /tmp/hello.csv
Traceback (most recent call last):
File "python2/csvreader-paratext.py", line 4, in <module>
allow_quoted_newlines=True)])
File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 271, in load_raw_csv
loader = internal_create_csv_loader(filename, *args, **kwargs)
File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 161, in internal_create_csv_loader
loader.load(_make_posix_filename(filename), params)
File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext_internal.py", line 414, in load
return _paratext_internal.ColBasedLoader_load(self, filename, params)
RuntimeError: The file ends with an open quote (4506147)
Changing to num_threads=1
fixes this, but obviously it's a racey bug. There are a few other bugs that are associated.
To get an idea of the baseline performance of I also parse an empty file. paratext
segfaults:
$ cat /tmp/empty.csv
$ python2/csvreader-paratext.py < /tmp/empty.csv
Segmentation fault: 11
Normally to test how we process a csv file, we should be able to use basic cmd line tools to subsample the file and pass it to the csv reader. This seems to fail with paratext. The following hangs on 100% cpu and doesn't respond to SIGINT
(iow I can't use Ctrl-C
and must use SIGSUSP
(Ctrl-Z
) and then kill %1
$ head -5 /tmp/hello.csv | time python2/csvreader-paratext.py
I am having issues building this package. Here's what I've got installed:
This is the output I get when I try to build:
$ python setup.py build install /c/Users/icassidy/AppData/Local/Continuum/Anaconda2/Library/bin/swig ..\src\diagnostic\parse_and_sum.hpp(158) : Warning 302: Identifier 'parse_token' redefined (ignored), ..\src\diagnostic\parse_and_sum.hpp(129) : Warning 302: previous definition of 'parse_token'. 0.2.1rc1 ('running swig: ', ['swig', '-c++', '-python', '-I../src/', '-outdir', './', '../src/paratext_internal.i']) running build running config_cc unifing config_cc, config, build_clib, build_ext, build commands --compiler options running config_fc unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options running build_src build_src building py_modules sources building extension "_paratext_internal" sources build_src: building npy-pkg config files running build_py copying paratext_internal.py -> build\lib.win-amd64-2.7 copying paratext\__init__.py -> build\lib.win-amd64-2.7\paratext running build_ext customize MSVCCompiler customize MSVCCompiler using build_ext customize MSVCCompiler Missing compiler_cxx fix for MSVCCompiler customize MSVCCompiler using build_ext building '_paratext_internal' extension compiling C sources C:\Users\icassidy\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I../src/ -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\core\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\PC /Tp../src/paratext_internal_wrap.cxx /Fobuild\temp.win-amd64-2.7\Release\../src/paratext_internal_wrap.obj -std=c++11 -Wall -Wextra -pthread /Zm1000 Found executable C:\Users\icassidy\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe C:\Users\icassidy\AppData\Local\Continuum\Anaconda2\lib\distutils\dist.py:267: UserWarning: Unknown distribution option: 'include_package_data' warnings.warn(msg) cl : Command line error D8021 : invalid numeric argument '/Wextra' error: Command "C:\Users\icassidy\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -I../src/ -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\lib\site-packages\numpy\core\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\include -IC:\Users\icassidy\AppData\Local\Continuum\Anaconda2\PC /Tp../src/paratext_internal_wrap.cxx /Fobuild\temp.win-amd64-2.7\Release\../src/paratext_internal_wrap.obj -std=c++11 -Wall -Wextra -pthread /Zm1000" failed with exit status 2
Please help!
This should be handled as part of tp_dealloc
for this object (or, in the interim, set forget = True
by default). Or is there a use case for forget=False
that I'm missing?
https://github.com/wiseio/paratext/blob/master/python/paratext/core.py#L207
It would be very useful if filename
accepted URI-style filenames, in addition to standard filepaths. For example,
>>> df = paratext.load_csv_to_pandas(filename="/data/featurization_data.csv", allow_quoted_newlines=True)
>>> df = paratext.load_csv_to_pandas(filename="file:///data/featurization_data.csv", allow_quoted_newlines=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 309, in load_csv_to_pandas
return pandas.DataFrame.from_items(load_csv_to_expanded_columns(filename, *args, **kwargs))
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1085, in from_items
keys, values = lzip(*items)
File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 284, in load_csv_to_expanded_columns
for name, col, semantics, levels in load_raw_csv(filename, *args, **kwargs):
File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 227, in load_raw_csv
loader = internal_create_csv_loader(filename, *args, **kwargs)
File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 117, in internal_create_csv_loader
loader.load(filename, params)
File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext_internal.py", line 251, in load
def load(self, *args): return _paratext_internal.ColBasedLoader_load(self, *args)
RuntimeError: cannot open file 'file:///data/featurization_data.csv'
With Python 3.5.0, running python3.5 setup.py raises error:
$ python3.5 setup.py build
File "setup.py", line 7
print "Error: you must install SWIG first."
^
SyntaxError: Missing parentheses in call to 'print'
Below is my code
import pandas as pd
import paratext
%%time
df = paratext.load_csv_to_pandas("sample.csv")
the last line never comes out of execution mode in jupyter notebook. The file isn't that big (contains about 10k rows and file size ~2mb). Any reasons as to why this may be occurring?
OS: Mac OS 10.12.6
python version: 3.6
I downloaded the library, went to the python folder, and did python setup.py build install but I get this error when I try to import paratext
'''
Traceback (most recent call last):
File "test_paratext.py", line 3, in
import paratext.testing
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext/init.py", line 4, in
from paratext.core import *
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext/core.py", line 29, in
import paratext_internal as pti
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext_internal.py", line 21, in
_paratext_internal = swig_import_helper()
File "/home/tom/anaconda2/lib/python2.7/site-packages/paratext_internal.py", line 20, in swig_import_helper
return importlib.import_module('_paratext_internal')
File "/home/tom/anaconda2/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: /home/tom/anaconda2/lib/python2.7/site-packages/_paratext_internal.so: undefined symbol: _ZTVNSt7__cxx1115basic_stringbufIcSt11char_traitsIcESaIcEEE
'''
My setup
OS X version: El Capitan version 10.11.2 (on Macbook Pro)
Swig version: 3.0.8
Python version: 3.5.1
Output when doing: sudo python setup.py install
sudo python setup.py build
/usr/local/bin/swig
0.1.1rc1
running swig: ['swig', '-c++', '-python', '-I../src/', '-outdir', './', '../src/paratext_internal.i']
../src/diagnostic/parse_and_sum.hpp:158: Warning 302: Identifier 'parse_token' redefined (ignored),
../src/diagnostic/parse_and_sum.hpp:129: Warning 302: previous definition of 'parse_token'.
/Users/amund/anaconda/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'include_package_data'
warnings.warn(msg)
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
running build_src
build_src
building py_modules sources
building extension "_paratext_internal" sources
build_src: building npy-pkg config files
running build_py
creating build
creating build/lib.macosx-10.5-x86_64-3.5
copying paratext_internal.py -> build/lib.macosx-10.5-x86_64-3.5
creating build/lib.macosx-10.5-x86_64-3.5/paratext
copying paratext/init.py -> build/lib.macosx-10.5-x86_64-3.5/paratext
copying paratext/core.py -> build/lib.macosx-10.5-x86_64-3.5/paratext
copying paratext/helpers.py -> build/lib.macosx-10.5-x86_64-3.5/paratext
running build_ext
customize UnixCCompiler
customize UnixCCompiler using build_ext
customize UnixCCompiler
customize UnixCCompiler using build_ext
building '_paratext_internal' extension
compiling C++ sources
C compiler: g++ -fno-strict-aliasing -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/amund/anaconda/include -arch x86_64
creating build/temp.macosx-10.5-x86_64-3.5
creating build/temp.macosx-10.5-x86_64-3.5/paratext
creating build/temp.macosx-10.5-x86_64-3.5/paratext/src
compile options: '-I../src/ -I/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include -I/Users/amund/anaconda/include/python3.5m -c'
extra options: '--stdlib=libc++ -std=c++11 -Wall -Wextra -pthread -m64 -D_REENTRANT'
g++: ../src/paratext_internal_wrap.cxx
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:218:19: error: use of undeclared identifier 'PyString_FromStringAndSize'; did you mean 'PyBytes_FromStringAndSize'?
PyObject s = PyString_FromStringAndSize(output.c_str(), output.size());
^~~~~~~~~~~~~~~~~~~~~~~~~~
PyBytes_FromStringAndSize
/Users/amund/anaconda/include/python3.5m/bytesobject.h:51:24: note: 'PyBytes_FromStringAndSize' declared here
PyAPI_FUNC(PyObject *) PyBytes_FromStringAndSize(const char *, Py_ssize_t);
^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:34:
In file included from ../src/csv/colbased_chunk.hpp:31:
../src/util/widening_vector.hpp:308:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values.begin(), values_.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:308:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values_.begin(), values_.end(), (T)0);
~~~~~~~~~~~~~~~^
../src/util/widening_vector.hpp:365:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values_.begin(), values_.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:365:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values_.begin(), values_.end(), (T)0);
~~~~~~~~~~~~~~~^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:36:
../src/csv/parallel.hpp:67:46: warning: initialized lambda captures are a C++14 extension [-Wc++14-extensions]
.emplace_back( it, step, thread_id, f = std::forward(f) {
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:159:28: error: use of undeclared identifier 'PyString_FromStringAndSize'
PyObject *newobj = PyString_FromStringAndSize((_it).c_str(), (_it).size());
^
../src/python/numpy_helper.hpp:304:60: note: in instantiation of member function 'build_array_from_range_implstd::__1::__wrap_iter<std::_1::basic_string<char *>, void>::build_array' requested here
return (PyObject)build_array_from_range_impl::build_array(range);
^
../src/paratext_internal_wrap.cxx:10031:32: note: in instantiation of function template specialization 'build_array_from_rangestd::__1::__wrap_iter<std::1::basic_string >' requested here
resultobj = (PyObject)::build_array_from_range(range);
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:68:86: error: no member named 'id' in 'numpy_type'
PyObject array = (PyObject)PyArray_SimpleNew(1, fdims, numpy_type<value_type>::id);
~~~~~~~~~~~~~~~~~~~~~~~~^
/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include/numpy/ndarrayobject.h:135:46: note: expanded from macro 'PyArray_SimpleNew'
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
^~~~~~~
../src/python/numpy_helper.hpp:299:50: note: in instantiation of member function 'build_array_impl<std::1::vector<unsigned long, std::1::allocator >, void>::build_array' requested here
return (PyObject)build_array_impl::build_array(container);
^
../src/paratext_internal_wrap.cxx:10366:30: note: in instantiation of function template specialization 'build_array<std::1::vector<unsigned long, std::1::allocator > >' requested here
resultobj = (PyObject)::build_arraystd::vector<size_t>(result);
^
1 warning and 7 errors generated.
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:218:19: error: use of undeclared identifier 'PyString_FromStringAndSize'; did you mean 'PyBytes_FromStringAndSize'?
PyObject s = PyString_FromStringAndSize(output.c_str(), output.size());
^~~~~~~~~~~~~~~~~~~~~~~~~~
PyBytes_FromStringAndSize
/Users/amund/anaconda/include/python3.5m/bytesobject.h:51:24: note: 'PyBytes_FromStringAndSize' declared here
PyAPI_FUNC(PyObject *) PyBytes_FromStringAndSize(const char *, Py_ssize_t);
^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:34:
In file included from ../src/csv/colbased_chunk.hpp:31:
../src/util/widening_vector.hpp:308:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:308:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~~~~~~~~~~~^
../src/util/widening_vector.hpp:365:17: error: no member named 'accumulate' in namespace 'std'
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~^
../src/util/widening_vector.hpp:365:43: error: expected '(' for function-style cast or type construction
return std::accumulate<decltype(begin)>(values.begin(), values.end(), (T)0);
~~~~~~~~~~~~~~~^
In file included from ../src/paratext_internal_wrap.cxx:4993:
In file included from ../src/csv/colbased_loader.hpp:36:
../src/csv/parallel.hpp:67:46: warning: initialized lambda captures are a C++14 extension [-Wc++14-extensions]
.emplace_back( it, step, thread_id, f = std::forward(f) {
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:159:28: error: use of undeclared identifier 'PyString_FromStringAndSize'
PyObject *newobj = PyString_FromStringAndSize((_it).c_str(), (_it).size());
^
../src/python/numpy_helper.hpp:304:60: note: in instantiation of member function 'build_array_from_range_implstd::__1::__wrap_iter<std::_1::basic_string<char *>, void>::build_array' requested here
return (PyObject)build_array_from_range_impl::build_array(range);
^
../src/paratext_internal_wrap.cxx:10031:32: note: in instantiation of function template specialization 'build_array_from_rangestd::__1::__wrap_iter<std::_1::basic_string >' requested here
resultobj = (PyObject)::build_array_from_range(range);
^
In file included from ../src/paratext_internal_wrap.cxx:3142:
../src/python/numpy_helper.hpp:68:86: error: no member named 'id' in 'numpy_type'
PyObject array = (PyObject)PyArray_SimpleNew(1, fdims, numpy_type<value_type>::id);
~~~~~~~~~~~~~~~~~~~~~~~~^
/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include/numpy/ndarrayobject.h:135:46: note: expanded from macro 'PyArray_SimpleNew'
PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL)
^~~~~~~
../src/python/numpy_helper.hpp:299:50: note: in instantiation of member function 'build_array_impl<std::__1::vector<unsigned long, std::_1::allocator >, void>::build_array' requested here
return (PyObject)build_array_impl::build_array(container);
^
../src/paratext_internal_wrap.cxx:10366:30: note: in instantiation of function template specialization 'build_array<std::__1::vector<unsigned long, std::_1::allocator > >' requested here
resultobj = (PyObject)::build_arraystd::vector<size_t>(result);
^
1 warning and 7 errors generated.
error: Command "g++ -fno-strict-aliasing -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Users/amund/anaconda/include -arch x86_64 -I../src/ -I/Users/amund/anaconda/lib/python3.5/site-packages/numpy/core/include -I/Users/amund/anaconda/include/python3.5m -c ../src/paratext_internal_wrap.cxx -o build/temp.macosx-10.5-x86_64-3.5/paratext/src/paratext_internal_wrap.o --stdlib=libc++ -std=c++11 -Wall -Wextra -pthread -m64 -D_REENTRANT" failed with exit status 1
\t in particular
I started looking at this, the only substantive change that seems to be necessary is removing the use of VLAs (e.g. here) - would you accept a PR replacing them with std::vector<char>
?
unistd.h
is also included in several places, which isn't available on Windows, but it doesn't look anything is actually used from it?
Is there a quick way to use this library for row by row processing without loading the whole data set?
Related to #45 but I am just looking for an example of how to use existing functionality from C++.
Thank you.
As mentioned here, there is lack of support for Python 3.5. We all know about Python 2 vs. Python 3, but with simple project like this it should be very easy to port it to something which has its future. Furthermore, you say Python (2.7 or above)
in README.
There are e.g. xrange
at https://github.com/wiseio/paratext/blob/master/python/paratext/core.py#L192, which won't work on Python 3. I am willing to help here - is there any specific reason why not to use Python 2 and Python 3? I would recommend using shared codebase with __future__
or six
package.
Thanks
Newbie here, came across paratext via a google search on processing large .cvs files to pandas DataFrames. I've followed the directions, but when I get to the command "python setup.py build install", I keep receiving this error:
(C:\Users\mtrette\AppData\Local\Continuum\Anaconda3\envs\ParaText) c:\User
tte\paratext\python>python setup.py build install
Traceback (most recent call last):
File "setup.py", line 7, in
p = subprocess.Popen(["which", "swig"])
File "C:\Users\mtrette\AppData\Local\Continuum\Anaconda3\envs\ParaText\l
process.py", line 947, in init
restore_signals, start_new_session)
File "C:\Users\mtrette\AppData\Local\Continuum\Anaconda3\envs\ParaText\l
process.py", line 1224, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
Like I said, I'm a newbie so not sure exactly what is going on, was hoping that someone may be able to shed some light and help me out. Thank you! By the way, I've completed the steps using the "normal" (ie not via Anaconda) cmd prompt and got the same error message.
Thank you in advanced for your time.
I'm trying to load a csv file with:
data = paratext.load_csv_to_pandas('data.csv')
I'm getting a:
AttributeError: module 'ntpath' has no attribute 'splitunc'
I am able to load the csv file with the traditional method using pd.read_csv()
.
Full Error Output:
C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py:403: FutureWarning: from_items is deprecated. Please use DataFrame.from_dict(dict(items), ...) instead. DataFrame.from_dict(OrderedDict(items)) may be used to preserve the key order.
return pandas.DataFrame.from_items(expanded)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-42-de2c6a8a93be> in <module>()
----> 1 data = paratext.load_csv_to_pandas('2016.csv')
C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in load_csv_to_pandas(filename, *args, **kwargs)
401 return pandas.DataFrame()
402 else:
--> 403 return pandas.DataFrame.from_items(expanded)
404
405 @_docstring_parameter(_csv_load_params_doc)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in from_items(cls, items, columns, orient)
1458 FutureWarning, stacklevel=2)
1459
-> 1460 keys, values = lzip(*items)
1461
1462 if orient == 'columns':
C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in load_csv_to_expanded_columns(filename, *args, **kwargs)
353 return pandas.DataFrame.from_items(filename, *args, **kwargs)
354 """
--> 355 for name, col, semantics, levels in load_raw_csv(filename, *args, **kwargs):
356 if levels is not None and len(levels) > 0:
357 yield name, levels[col]
C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in load_raw_csv(filename, *args, **kwargs)
296
297 """
--> 298 loader = internal_create_csv_loader(filename, *args, **kwargs)
299 return internal_csv_loader_transfer(loader, forget=True)
300
C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in internal_create_csv_loader(filename, num_threads, allow_quoted_newlines, block_size, number_only, no_header, max_level_name_length, max_levels, cat_names, text_names, num_names, in_encoding, out_encoding, convert_null_to_space)
186 if out_encoding == "utf-8":
187 loader.set_out_encoding(pti.UNICODE_UTF8)
--> 188 loader.load(_make_posix_filename(filename), params)
189 return loader
190
C:\ProgramData\Anaconda3\lib\site-packages\paratext\core.py in _make_posix_filename(fn_or_uri)
118
119 def _make_posix_filename(fn_or_uri):
--> 120 if ntpath.splitdrive(fn_or_uri)[0] or ntpath.splitunc(fn_or_uri)[0]:
121 result = fn_or_uri
122 else:
AttributeError: module 'ntpath' has no attribute 'splitunc'
Thank you again for your time.
New to python here! Might be a basic question, but does paratext support reading of tab-delimited files with load_csv_to_pandas?
it would be nice to be able to pass a list of files to load and return a dataframe with the results merged by concatenating rows consistent with using ignore_index=True
. This would avoid relying on df.append
which creates a copy.
In rowbased_loader.hpp
there is a logic_error
thrown if a file cannot be statted. This should be runtime_error
.
I want to access the csv data row by row,how can i do that?
Thanks for the library it looks great! I would like to have a conda pkg that people can install based on conda-forge. I have started that work here: conda-forge/staged-recipes#731
The only thing needed is an "official" release with a git tag that I can use to freeze the build of the pkg.
Also, would someone want to be added as a maintainer for that conda pkg? If not I am ok doing that.
Thanks!
Currently, load_csv_to_pandas
fails when filename
is a unicode string. Example given below:
>>> df = paratext.load_csv_to_pandas(filename="/data/featurization_data.csv", allow_quoted_newlines=True)
>>> df = paratext.load_csv_to_pandas(filename=u"/data/featurization_data.csv", allow_quoted_newlines=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda/lib/python2.7/site-packages/paratext-0.9.1rc1-py2.7-linux-x86_64.egg/paratext/core.py", line 309, in load_csv_to_pandas
return pandas.DataFrame.from_items(load_csv_to_expanded_columns(filename, *args, **kwargs))
File "/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1085, in from_items
keys, values = lzip(*items)
TypeError: zip() argument after * must be a sequence, not generator
I made a virtualenv to test out the package, and importing the module into python failed as follows:
Python 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> import paratext
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "paratext/__init__.py", line 4, in <module>
from paratext.core import *
File "paratext/core.py", line 29, in <module>
import paratext_internal as pti
File "paratext_internal.py", line 113
def value(self) -> "PyObject *":
^
SyntaxError: invalid syntax
Has anyone tried using paratext inside of an AWS Lambda function, reading from the Elastic File System service?
Modern Python comes with pip
bundled in, making it easy to install the library.
ParaText could be installed using pip
. It could be hosted on git or, better, on pypi.
When parsing files I ran into problems with how paratext parses some of the elements in a file. I believe these may be related to the process_token()
function in colbased_worker.hpp
that tries to detect whether a token is an integer, float, exponential/scientific number or otherwise.
I have written code to make changes to this function, but was not able to create a pull request. Please let me know if there is a way I can create a pull request or provide the code change suggestions.
It detects a string such as A.1
as a number and I get 0.000000
It detects a string such as 3ABC
as a number and I get 3.000000
Let A.1
be the input. When reading the token and checking if the token is an integer (line 270) it checks if token_[i]
is a digit. If not, we move on to see if we are dealing with a float instead. However we advance the index, i
, regardless of whether the integer check passes or fails. Therefore, when we get to the float check on line 279 we are looking at the .
character instead of A
. Then the check for a float passes since we see .
and only digits after it. Finally the result is 0.000000
since A.1
gets converted to a float before it is passed to process_float
.
Numbers like 12.345
are not picked up as floats (because the integer check fails on .
and we check for float on the next character, here 3
), but instead as exponentials. They pass as exponentials not because they pass the exponential check, but because exp_possible is set as true at the beginning (on line 272 after the integer check passes on line 270) and does not become false. Both exponentials and floats are passed to process_float
in the end. For the same reason 3ABC
is detected as an exponential instead of a string and we get 3.000000
. (Numbers like .123
are not detected as floats or exponentials because exp_possible
is set as false after the integer check .)
When making updates to the process_token()
function I did some simplifications, but did not change the behaviour of the function other than for the issues found.
In the code change suggestion I have let a number on the form 14e-3
be a valid exponential (compared to 14.0e-3
). This can be changed if not desired.
Similar to the chunksize parameter of pandas.read_csv()...
Is this planned or even already possible somehow? Since paratext will most likely be used for reading large CSV files (pandas is usually already fast enough for small ones) which might not fit in memory, this would be very useful in my opinion.
With same functionality as pandas.read_csv
(see docs here).
Ok, as title says. If I want to build just the lib and no python wrapper how can we do that?
Thanks
@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray
). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).
The simplest thing would be to fork the codebase into a libarrow_csv
shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow
inside parquet-cpp (https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?
i am not able to figure out how to use it in c++ program?
Hello,
On OS X 10.11.6, with clang version:
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
I'm seeing a warning that a C++14 feature is being used. The README states that only a C++ compliant compiler is required --- is C++14 required? Here is the warning:
In file included from paratext/src/csv/colbased_loader.hpp:36:
paratext/src/csv/parallel.hpp:67:46: warning: initialized lambda captures are a C++14
extension [-Wc++14-extensions]
.emplace_back([ it, step, thread_id, f = std::forward<F>(f) ]() {
^
We're constrained to C++11, so it'd be great if there was a workaround for this (other than suppressing the warning).
This warning also appears in the Travis build output:
https://travis-ci.org/wiseio/paratext/jobs/313212429#L1380
System is RHEL 6.4 with gcc 4.8.1 and SWIG 3.0.10 and Python 3.5.0 installed.
Firstly, python/setup.py is not Python3 clean: it uses print statements rather than print() functions. After running it through 2to3, the build fails with a few occurences of this error:
g++: ../src/paratext_internal_wrap.cxx
In file included from ../src/paratext_internal_wrap.cxx:3140:0:
../src/python/numpy_helper.hpp: In member function ‘string_array_output_iterator& string_array_output_iterator::operator++()’:
../src/python/numpy_helper.hpp:218:75: error: ‘PyString_FromStringAndSize’ was not declared in this scope
PyObject *s = PyString_FromStringAndSize(output.c_str(), output.size());
Modifying src/python/numpy_helper.hpp to replace "PyString_FromStringAndSize()" with "PyUnicode_FromStringAndSize()" makes the build succeed.
Perhaps something like this is required to support both Python2 and Python3.
would be nice to add support for opening .gz files
Ideally we could pass a file handle, eg
import gzip, paratext
with gzip.open(f, 'rb') as fh:
paratext.read(fh)
It seems like the file handle is opened by the C code, so perhaps this is not practical, and easier to add gzip reading support directly to the C code?
I'm getting consistently better timings with pandas.read_csv. Is there some build problem perhaps on OSX? For example this seems typical on some smallish dataset (1 million rows)
In [8]: %time d = pd.read_csv('j.csv', header=None, dtype=str)
CPU times: user 4.15 s, sys: 422 ms, total: 4.57 s
Wall time: 4.57 s
In [9]: %time df = paratext.load_csv_to_dict('j.csv', no_header=True)
CPU times: user 14.5 s, sys: 914 ms, total: 15.4 s
Wall time: 6.45 s
In [12]: paratext.__version__
Out[12]: '0.3.1rc1'
$ python --version
Python 3.5.3 :: Anaconda custom (x86_64)
Hi,
It would be really great if one could install your python package using conda.
When using multiple threads to read from stdin, paratext fails. It works fine with actual files or with a single thread reading stdin.
The code I use is as follows:
#!/usr/bin/env python
import paratext
print sum(map(lambda x: len(x[1]), paratext.load_raw_csv("/dev/stdin",
no_header=True, allow_quoted_newlines=True)))
$ python2/csvreader-paratext.py < /tmp/hello.csv
Traceback (most recent call last):
File "python2/csvreader-paratext.py", line 4, in <module>
allow_quoted_newlines=True)])
File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 271, in load_raw_csv
loader = internal_create_csv_loader(filename, *args, **kwargs)
File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 161, in internal_create_csv_loader
loader.load(_make_posix_filename(filename), params)
File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext_internal.py", line 414, in load
return _paratext_internal.ColBasedLoader_load(self, filename, params)
RuntimeError: The file ends with an open quote (4506147)
This could also be an issue if someone were to read in an actual file by name but pass in /dev/stdin
.
It would be great to have a Ruby wrapper over paratext to bring fast CSV reading to Ruby.
We can use it in daru, for example.
It would be nice to have an option to use tab separated files with this since they sometimes are smaller in size.
I'm reading the following csv file:
uuid,document_id,timestamp,platform,geo_location,traffic_source
1fd5f051fba643,120,31905835,1,RS,2
8557aa9004be3b,120,32053104,1,VN>44,2
c351b277a358f0,120,54013023,1,KR>12,1
8205775c5387f9,120,44196592,1,IN>16,2
9cb0ccd8458371,120,65817371,1,US>CA>807,2
2aa611f32875c7,120,71495491,1,CA>ON,2
f55a6eaf2b34ab,120,73309199,1,BR>27,2
cc01b582c8cbff,120,50033577,1,CA>BC,2
6c802978b8dd4d,120,66590306,1,CA>ON,2
But paratext reads it as following:
The uuid conversion is totally unexpected - and the issue persists even if I say text_names=['uuid']
related: #13
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.