Comments (6)
A DataFrame and each of its series all contain references to the same index. This currently isn't checked, but must be true for correctness.
It is checked at https://github.com/gpuopenanalytics/pygdf/blob/f3330612316e2f40fb0bda205ee2ad6ad212bc62/pygdf/dataframe.py#L286. Series' index must be equivalent to the dataframe index..
from cudf.
The numeric implementation contains no data, but the categorical implementation does. This makes figuring out where to put methods kind of confusing.
The categorical impl doesn't contain data. I think _categories
and _ordered
are metadata.
All SeriesImpl
subclasses do not contain data. They are the delegates. When a type-specialized operation is invoked, a Series calls a SeriesImpl to handle the operation. In SeriesImpl, the methods always take the "calling" series as one of the arguments.
To add new type-specialized methods, the actual implementation goes to SeriesImpl. The Series will have a small wrapper to delegate the work.
However, I have been lazy sometimes and the actual implementation is put directly to Series. That's okay for now.
from cudf.
As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl
, the data is always passed as an arg of Series
type. Whereas is in the new Data
class, it will be under self
. I would prefer not to couple the data and the type-specialized operation.
from cudf.
Series' index must be equivalent to the dataframe index..
The categorical impl doesn't contain data. I think _categories and _ordered are metadata.
Ah, you are correct in both cases. Ignore my gripes there :).
As for the proposed class hierarchy, the only difference I see would be where the data is. In SeriesImpl, the data is always passed as an arg of Series type. Whereas is in the new Data class, it will be under self. I would prefer not to couple the data and the type-specialized operation.
My only remaining issue is that I'm finding myself frequently using Series
where an array would suffice. For example, GenericIndex
is backed by a series (which also has an index). Concat is another example - to avoid repeatedly concatenating the same index arrays I had to do a kludgy hack in #40. With array like objects (that understand categoricals), concat
could work on the arrays, and then wrap the final output in a series/dataframe with a concatenated index.
I think it would be nice to build operations on a generic data container without an index. Since categorical data has metadata (categories and ordered) at least the categorical data container would have to be a step higher than a gpu array. This is similar to pandas - pd.Series
are either backed by numpy arrays or pd.Categorical
.
Again, feel free to ignore, I'm still figuring things out. I think the differences between this project and pandas are tripping me up and that's leading to confusion on my part.
from cudf.
I agree that the GenericIndex
design has issues. I hacked it up to make progress. It's backed by a Series
just because Series
has a lot of features. But, the Series
in GenericIndex
should always have the basic RangeIndex
as to not recursively containing more Series
.
My only remaining issue is that I'm finding myself frequently using Series where an array would suffice.
That's true. Sadly, the underlying numba gpu array is not as feature-rich.
Now, I think it make sense to introduce the Data
classes. It will resolve the problem in GenericIndex
nicely.
So the Buffer
class will be the physical layer. The Data
class will be the logical layer.
from cudf.
Closing due to #54
from cudf.
Related Issues (20)
- cudf::rank not passed enough parameters in list/struct rank benchmarks HOT 1
- [BUG] TDIGEST_MERGE group by aggregation scales very badly HOT 1
- [BUG] low memory json reader running into an `OverflowError` HOT 2
- [FEA] Allow to run groupby/reduction with externally derived aggregations HOT 1
- [BUG] `cudf.testing.assert_*_equal` raises AssertionError for equivalent `DecimalDtype`d objects HOT 3
- [FEA] Please expose `int2ip` as a public method
- [BUG] Series.duplicated doesn't preserve name HOT 1
- [FEA] Make cudftestutil a header-only package HOT 1
- [BUG] Intermittent bug decompressing files in the parquet reader in 24.08. HOT 3
- [BUG] Error with batched multi-source JSON read with >1 MB per row HOT 1
- [FEA][JNI] Move to RMM apis get_current_device_resource_ref and set_current_device_resource_ref HOT 1
- [FEA] Add libcudf cudf::get_current_device_resource() wrapper for rmm::get_current_device_resource() HOT 1
- [BUG] CUDF JNI build failed FindBoost HOT 4
- [BUG] DataFrame `to_arrow(preserve_index=True)` doesn't preserve index in edge case HOT 2
- [FEA] Support JNI build in rapids dev container HOT 2
- [FEA] Report all unsupported operations for a query in cudf-polars
- [BUG] UnicodeDecodeError: 'cp949' codec can't decode byte 0xf0 in position 2252: illegal multibyte sequence HOT 1
- [FEA] Make TPC-H derived datagen reuse random column generation functions already present in benchmark utils
- [BUG] illegal access error in mixed_join after ODR cleanup PR HOT 1
- [FEA] Change from TPC-H to NDS-H
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.