Coder Social home page Coder Social logo

nvstrings's People

Contributors

randerzander avatar raydouglass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvstrings's Issues

Default handling of empty strings

I've been evaluating cuStrings/nvstrings in the latest Rapids container, and like it. It is very fast. I'm going to try it with some more involved string processing later.

I've tried it on UK Companies House corporate register, company names, and in this example the content of 10.8 million UK Tweets.

import nvstrings
filename = "/data/mesh_twitter_20181211/Data/mesh_twitter_20181211.csv"
%%time
a = nvstrings.from_csv(filename,3)

CPU times: user 1.69 s, sys: 1.82 s, total: 3.51 s
Wall time: 3.5 s

I've observed that an empty string in the source data is converted into a None type, and this causes errors in post analysis work for example summing records matching a criterion.

%%time
s = a.size()
c = a.count('\#')
h = c.count(0)
print("Number of tweets:",s,"Number of hashtags:",sum(c),"Tweets including a hashtag:",s-h,round(100.0*(s-h)/s,2),"%")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

I am having to use a Python list comprehension, or a filter, to check for None type before further processing in such cases. Filter is ok for map reduction type operations, but if order or volume need to be preserved for concatenation, then it would need a list comprehension.

%%time
s = a.size()
c = [x if x is not None else 0 for x in a.count('\#')]
h = c.count(0)
print("Number of tweets:",s,"Number of hashtags:",sum(c),"Tweets including a hashtag:",s-h,round(100.0*(s-h)/s,2),"%")

Number of tweets: 10836607 Number of hashtags: 10801624 Tweets including a hashtag: 4641176 42.83 %
CPU times: user 833 ms, sys: 169 ms, total: 1 s
Wall time: 998 ms

In strings, an empty string can actually be a value ''. It may be concatenated with other fields too.

Therefore my suggestion, in the from_csv function, would be good to have an option to treat empty strings as a value or None. I would recommend '' as the default,. Possibly a Boolean argument?

Cannot read string data with read_csv

I am trying to read the following data

head data/G1_1e7_1e2_0_0.csv
#id1,id2,id3,id4,id5,id6,v1,v2,v3
#id046,id007,id0000043878,51,10,59276,1,1,96.8126
#id041,id026,id0000068300,12,58,78315,4,1,83.5654

which is mix of string, int and float.
But getting the following error from read_csv

gdf = cudf.read_csv("data/G1_1e7_1e2_0_0.csv", skiprows=1,
                     names=['id1','id2','id3','id4','id5','id6','v1','v2','v3
'],
                     dtype=['str','str','str','int','int','int','int','int','
float'])
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/io/csv.py
", line 141, in read_csv
    newcol = Column.from_cffi_view(out[i])
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/column.py", line 70, in from_cffi_view
    data_mem, mask_mem = _gdf.cffi_view_to_column_mem(cffi_view)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/_gdf.py",
 line 215, in cffi_view_to_column_mem
    dtype=gdf_to_np_dtype(cffi_view.dtype),
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/_gdf.py",
 line 153, in gdf_to_np_dtype
    }[dtype])
KeyError: 11

I tried to use category instead of str but it produces int columns instead.

print(gdf.head(2))
#         id1         id2        id3  id4  id5   id6   v1 ...       v3
#0 1947148449 -1736452051 -493298331   51   10 59276    1 ... 96.81261
#1 1264700808  1947148449 -493298331   12   58 78315    4 ... 83.56539
#[1 more columns]
gdf.dtypes
#id1      int32
#id2      int32
#id3      int32
#id4      int32
#id5      int32
#id6      int32
#v1       int32
#v2       int32
#v3     float32
#dtype: object

cuda 9.2, cudf 0.4, nvstrings 0.1 (according to suggestion in rapidsai/cudf#736)

Bug in find_from

find_from function works not as expected when passing array as starts parameter. The parameter has effect only on the first string in the string array.
Code sample:

>>> import nvstrings
>>> from numba import cuda
>>> s1 = "aaaaaaaaa"
>>> s2 = "aaaaaaaaa"
>>> strings = [s1,s2]
>>> gpu_strings = nvstrings.to_device(strings)
>>> gpu_strings.find('a')
[0, 0]
>>> gpu_strings.find('a', start=5)
[5, 5]
>>> starts = cuda.to_device([5,7])
>>> gpu_strings.find_from('a', starts=starts.device_ctypes_pointer.value)
[5, 0]

Compatibility issues between cudf API and nvstrings API

So as I am looking at these apis a few concerns are popping up for me.

One of the premises of the value of using cudf is the ability for us to define new primitivies to that interact with gdf_column without having to be 100% ware of how data is encoded and represented in some cases and with the abilitiy to take advantage of these capacities when needed. One of the issues I see with nvstrings is that instead of having device functions that can be called from kernels to perform string manipulation we have array level functions. Below I lay out some of the issues I see with trying to use this to represent strings in a gdf_column.

Granularity of Operations
There is no way to define an operator between an nvstring and another nvstring or an nvstring and other gdf_types. To give an example of functions that could take input think of a pretty common use case for example where I have a list of names.

NAME
"Ada Lovelace"
"Charles Babbage"
"Mary Shelley"

I want to get all the first names by implementing a kernel which does something like the following:
Per thread

  1. find the location of the first space character
  2. get the left most n characters where n is the position of the first space character

to end up with a list "Ada", "Charles", "Mary". While I agree you could make a primitive to do this, or solve this using some combination of the primitives that exist, having a data structure for which writing new kernels is IMPOSSIBLE feels like it would be a huge limitation to anyone trying to build their own execution kernels on top of cudf which should be a use case we should encourage not only to drive adoption but hopefully someday to start driving users to become contributors. So even if we accept that all of the operations on these strings will be closed source, not having them defined at the level where we have device functions that operate on inputs and some of those inputs can be nvstring means that we cannot recombine and use to build more interesting and complex features.

Incompatibility with existing cudf apis
Unless I am missing something huge, this happens more frequently than I care to admit, there would be no way to add this to the type dispatcher or to use it with existing algorithms that we currently have. Things like group by, sort, join, concatenate columns etc. Yes we coudl make primitives for all of these things and nvidia could commit to maintaining all of the operations we need but this is still not enough. While in these early days people are materializing and combining operators at a high level this should not be the case forever. We need to be able to define optimized non materialized versions of algorithms that we may be interested in.

Right now the guts of cudf have the flexibility for someone to be able to basically come in and say ok I want to take group by and make it operate on some new type that I defined which could then be used to combine multiple inputs into one output to feed the group by reducer. The way most of our primitive operations are being defined now we have left ourselves open to the possibility of cleverly combining these operations in all kinds of ways. Nvstrings would break that ability for any operations that contained strings.

Implements functionality that already exists in cudf
So I understand there is a bit of a chicken and egg problem when it comes to making functions like
unsigned int set_null_bitarray( unsigned char* bitarray, bool emptyIsNull=false, bool todevice=true ); // set int array with position of null strings unsigned int get_nulls( unsigned int* pos, bool emptyIsNull=false, bool todevice=true );
If we want long term compatibility and maintainability we will somehow allow nvstrings to have access to these functions as they are defined in cudf. We just very recently started removing these duplicate functions from everywhere. In this particular case with null bitmasks it was a pain because not everyone was following the arrow spec for how bitmasks are handled leading to lots of confusion.

Closed Source Data reprsentation
Unless I am mistaken we have no actual information about how the data is laid out or how it can be accessed. So if this is the case how is it that we could even use this information in another project where nvstrings might not be available. So for example, not everyone uses cudf, but that doesn't matter if they can operate on buffers of information that can be provided via zero ipc like the ones inside of gdf_column. If we don't have access to the representation we can't make anything like a lightweight adapter for using that representation in another language if all we need is to be able to access the information itself not use the execution. CUDF is supposed to be execution primitives on data representations that are able to be consumed via zero ipc via other processes. We shouldn't have to make the assumption that other tools are only using these representations through cudf, it is perfectly reasonable for people to want to just plug into the data representation.

Closed Source algorithms
This is the least concerning for me because someone can always just roll their own if they are desperate for more granular control over execution if they have access to the data representation.

Arrow on gpu forgotten?
How could this ever make it into arrow on gpu if its a closed source data representation?

Memory Allocation
This constructor static NVStrings* create_from_array( const char** strs, unsigned int count);` is performing a large amount of persistent allocations that the user does not have control over. This means that if someone is using their own memory manager which wraps RMM for example then they no longer have this control for part of their workflow. For anyone trying to solve larger than gpu ram problems this could be a show stopper and even smaller than that takes control away from the users of cudf.

How to use GPU other than 0

I have multiple GPU's available. Can you please advise me how can I do computations on, for example, GPU 1 instead of GPU 0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.