nvidia / nvstrings Goto Github PK
View Code? Open in Web Editor NEWLegacy repository for nvstrings
License: Other
Legacy repository for nvstrings
License: Other
Usage:
import nvstrings, numpy
arr = numpy.array(['apples', 'foobar', 'cowboy'], dtype=object))
nvstrings.to_device(arr)
nvstrings
Output:
<nvstrings count=3>
I've been evaluating cuStrings/nvstrings in the latest Rapids container, and like it. It is very fast. I'm going to try it with some more involved string processing later.
I've tried it on UK Companies House corporate register, company names, and in this example the content of 10.8 million UK Tweets.
import nvstrings
filename = "/data/mesh_twitter_20181211/Data/mesh_twitter_20181211.csv"
%%time
a = nvstrings.from_csv(filename,3)
CPU times: user 1.69 s, sys: 1.82 s, total: 3.51 s
Wall time: 3.5 s
I've observed that an empty string in the source data is converted into a None type, and this causes errors in post analysis work for example summing records matching a criterion.
%%time
s = a.size()
c = a.count('\#')
h = c.count(0)
print("Number of tweets:",s,"Number of hashtags:",sum(c),"Tweets including a hashtag:",s-h,round(100.0*(s-h)/s,2),"%")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
I am having to use a Python list comprehension, or a filter, to check for None type before further processing in such cases. Filter is ok for map reduction type operations, but if order or volume need to be preserved for concatenation, then it would need a list comprehension.
%%time
s = a.size()
c = [x if x is not None else 0 for x in a.count('\#')]
h = c.count(0)
print("Number of tweets:",s,"Number of hashtags:",sum(c),"Tweets including a hashtag:",s-h,round(100.0*(s-h)/s,2),"%")
Number of tweets: 10836607 Number of hashtags: 10801624 Tweets including a hashtag: 4641176 42.83 %
CPU times: user 833 ms, sys: 169 ms, total: 1 s
Wall time: 998 ms
In strings, an empty string can actually be a value ''. It may be concatenated with other fields too.
Therefore my suggestion, in the from_csv function, would be good to have an option to treat empty strings as a value or None. I would recommend '' as the default,. Possibly a Boolean argument?
I am trying to read the following data
head data/G1_1e7_1e2_0_0.csv
#id1,id2,id3,id4,id5,id6,v1,v2,v3
#id046,id007,id0000043878,51,10,59276,1,1,96.8126
#id041,id026,id0000068300,12,58,78315,4,1,83.5654
which is mix of string, int and float.
But getting the following error from read_csv
gdf = cudf.read_csv("data/G1_1e7_1e2_0_0.csv", skiprows=1,
names=['id1','id2','id3','id4','id5','id6','v1','v2','v3
'],
dtype=['str','str','str','int','int','int','int','int','
float'])
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/io/csv.py
", line 141, in read_csv
newcol = Column.from_cffi_view(out[i])
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/column.py", line 70, in from_cffi_view
data_mem, mask_mem = _gdf.cffi_view_to_column_mem(cffi_view)
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/_gdf.py",
line 215, in cffi_view_to_column_mem
dtype=gdf_to_np_dtype(cffi_view.dtype),
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/_gdf.py",
line 153, in gdf_to_np_dtype
}[dtype])
KeyError: 11
I tried to use category
instead of str
but it produces int columns instead.
print(gdf.head(2))
# id1 id2 id3 id4 id5 id6 v1 ... v3
#0 1947148449 -1736452051 -493298331 51 10 59276 1 ... 96.81261
#1 1264700808 1947148449 -493298331 12 58 78315 4 ... 83.56539
#[1 more columns]
gdf.dtypes
#id1 int32
#id2 int32
#id3 int32
#id4 int32
#id5 int32
#id6 int32
#v1 int32
#v2 int32
#v3 float32
#dtype: object
cuda 9.2, cudf 0.4, nvstrings 0.1 (according to suggestion in rapidsai/cudf#736)
find_from function works not as expected when passing array as starts parameter. The parameter has effect only on the first string in the string array.
Code sample:
>>> import nvstrings
>>> from numba import cuda
>>> s1 = "aaaaaaaaa"
>>> s2 = "aaaaaaaaa"
>>> strings = [s1,s2]
>>> gpu_strings = nvstrings.to_device(strings)
>>> gpu_strings.find('a')
[0, 0]
>>> gpu_strings.find('a', start=5)
[5, 5]
>>> starts = cuda.to_device([5,7])
>>> gpu_strings.find_from('a', starts=starts.device_ctypes_pointer.value)
[5, 0]
https://nvstrings.readthedocs.io/en/latest/api.html#nvstrings.nvstrings.sublist
For joins, sorts, groupbys etc, cuDF performs "gather" operations, which amounts to looking up values from a list of indexes.
nvstrings.sublist can be used for that, but we should support writing the result as a device array.
So as I am looking at these apis a few concerns are popping up for me.
One of the premises of the value of using cudf is the ability for us to define new primitivies to that interact with gdf_column without having to be 100% ware of how data is encoded and represented in some cases and with the abilitiy to take advantage of these capacities when needed. One of the issues I see with nvstrings is that instead of having device functions that can be called from kernels to perform string manipulation we have array level functions. Below I lay out some of the issues I see with trying to use this to represent strings in a gdf_column.
Granularity of Operations
There is no way to define an operator between an nvstring and another nvstring or an nvstring and other gdf_types. To give an example of functions that could take input think of a pretty common use case for example where I have a list of names.
NAME
"Ada Lovelace"
"Charles Babbage"
"Mary Shelley"
I want to get all the first names by implementing a kernel which does something like the following:
Per thread
to end up with a list "Ada", "Charles", "Mary". While I agree you could make a primitive to do this, or solve this using some combination of the primitives that exist, having a data structure for which writing new kernels is IMPOSSIBLE feels like it would be a huge limitation to anyone trying to build their own execution kernels on top of cudf which should be a use case we should encourage not only to drive adoption but hopefully someday to start driving users to become contributors. So even if we accept that all of the operations on these strings will be closed source, not having them defined at the level where we have device functions that operate on inputs and some of those inputs can be nvstring means that we cannot recombine and use to build more interesting and complex features.
Incompatibility with existing cudf apis
Unless I am missing something huge, this happens more frequently than I care to admit, there would be no way to add this to the type dispatcher or to use it with existing algorithms that we currently have. Things like group by, sort, join, concatenate columns etc. Yes we coudl make primitives for all of these things and nvidia could commit to maintaining all of the operations we need but this is still not enough. While in these early days people are materializing and combining operators at a high level this should not be the case forever. We need to be able to define optimized non materialized versions of algorithms that we may be interested in.
Right now the guts of cudf have the flexibility for someone to be able to basically come in and say ok I want to take group by and make it operate on some new type that I defined which could then be used to combine multiple inputs into one output to feed the group by reducer. The way most of our primitive operations are being defined now we have left ourselves open to the possibility of cleverly combining these operations in all kinds of ways. Nvstrings would break that ability for any operations that contained strings.
Implements functionality that already exists in cudf
So I understand there is a bit of a chicken and egg problem when it comes to making functions like
unsigned int set_null_bitarray( unsigned char* bitarray, bool emptyIsNull=false, bool todevice=true ); // set int array with position of null strings unsigned int get_nulls( unsigned int* pos, bool emptyIsNull=false, bool todevice=true );
If we want long term compatibility and maintainability we will somehow allow nvstrings to have access to these functions as they are defined in cudf. We just very recently started removing these duplicate functions from everywhere. In this particular case with null bitmasks it was a pain because not everyone was following the arrow spec for how bitmasks are handled leading to lots of confusion.
Closed Source Data reprsentation
Unless I am mistaken we have no actual information about how the data is laid out or how it can be accessed. So if this is the case how is it that we could even use this information in another project where nvstrings might not be available. So for example, not everyone uses cudf, but that doesn't matter if they can operate on buffers of information that can be provided via zero ipc like the ones inside of gdf_column. If we don't have access to the representation we can't make anything like a lightweight adapter for using that representation in another language if all we need is to be able to access the information itself not use the execution. CUDF is supposed to be execution primitives on data representations that are able to be consumed via zero ipc via other processes. We shouldn't have to make the assumption that other tools are only using these representations through cudf, it is perfectly reasonable for people to want to just plug into the data representation.
Closed Source algorithms
This is the least concerning for me because someone can always just roll their own if they are desperate for more granular control over execution if they have access to the data representation.
Arrow on gpu forgotten?
How could this ever make it into arrow on gpu if its a closed source data representation?
Memory Allocation
This constructor static
NVStrings* create_from_array( const char** strs, unsigned int count);` is performing a large amount of persistent allocations that the user does not have control over. This means that if someone is using their own memory manager which wraps RMM for example then they no longer have this control for part of their workflow. For anyone trying to solve larger than gpu ram problems this could be a show stopper and even smaller than that takes control away from the users of cudf.
I have multiple GPU's available. Can you please advise me how can I do computations on, for example, GPU 1 instead of GPU 0
NVStrings.h
needs include guards, like any C/C++ header.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.