Comments (12)
We can discuss if this makes sense in the call tomorrow.
Agreed, methodology for constructing the APIs will be the main topic for tomorrow's call. Starting with arrays, where we're a little further along with tooling, and obtaining data and making decisions is generally a bit easier. And then for dataframes we should consider what parts of that methodology are applicable, and how to deal with dataframe-specific pain points.
from dataframe-api.
meta-question (which is probably appropriate for the Thursday call): why do this in a GitHub issue rather than the RFC document? IMO this document is a better home for it so that we can comment inline.
from dataframe-api.
I hope we can term it something else besides "DataFrame". To me dataframe implies pandas version of this concept and I prefer another term to help distinguish. For example, Matlab has Datasets and Tables, R has Tables and xtabs.
Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage, RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc.
I dont care much, as long as we dont overload the term DataFrame.
It will get confusing when pandas APIs for its DataFrames are different than this group's recommended APIs for DataFrames.
from dataframe-api.
from dataframe-api.
Yeah I see R with data.frame vs data.table. For many python data analytics users, the word dataframe is intertwined with pandas dataframe. Feels like this is the "pandas" club, and I hope we can break out from its orbit and declare independence. Is this group's goal to
- Cleanup pandas APIs, consolidate them, and get rid of duplicate methods
- Come up with a better model for the common tasks at hand
If no. 1, we should just state that is what this group is doing. At which point im disappointed and feeling hoodwinked.
If no. 2, we should then indicate the most common tasks a user wants to perform - including two step tasks that can be consolidated into one. For instance, setindex followed by pivot can be made into one task and indicates some of the problems of following the pandas API model.
One step to breaking free from pandas APIs is to change the name. Further, having the same name and methods, that do things differently is more confusing for future users.
from dataframe-api.
from dataframe-api.
Yes the impression I get from most of those was a conscious effort to be similar enough to the pandas dataframe so that existing code could be ported over more easily, and common methods could be used in the same way.
In riptide we called it a Dataset the name difference worked out well. Our examples use "ds" instead of "df". If we called it DataTable or DataGrid we could use "dt" or "dg". Then it would be easier to explain to users why we have different APIs with different kwargs.
Is the goal to be ~80% like the pandas Dataframe API or to be a new class with similar but different methods? Different group members may have different opinions.
Perhaps we can take a vote on this because I do care about the name. From my experience, I think calling it Dataframe is bound to lead to similar methods. Maybe that is what most people want, but it is not what i signed up for.
from dataframe-api.
I think it's best to just vote on it -- we can either do that at tomorrow's meeting, or we can decide to pick a working name for now then vote on a final name later on (when we're finalizing the spec).
DataFrame
is more recognizable to end users (due to it's widespread use in existing projects), but re-using that name could also lead to some confusion on their part if the specification we produce has non-trivial differences from those existing implementations. I assume that's why R's data.table project chose a different name -- it's API is still recognizable as a "DataFrame" in spirit but diverges enough from the built-in R data.frame
that calling it the same thing would have been confusing to end users.
Some additional data points:
datatable
calls it's data structureFrame
.pyarrow
uses the nameTable
- Cassandra (the database) uses the term column family.
Note: I don't have a preference towards any particular name. I do feel like the Data
prefix is somewhat implied though, so I'd lean towards a simpler name like Frame
or Table
.
from dataframe-api.
@datapythonista The naming dicussion has more-or-less taken over this issue, but I think it's important we address the other points you brought up as well. Maybe we just rename / simplify this issue to be about the naming only and copy the rest of your original post to another issue so we can discuss those points?
from dataframe-api.
My main idea here, more than making decisions on the specific points, was about the methodology to move forward. We've got several issues now, with very interesting discussions. But I felt that instead of ending up with even more open discussions, finding the points for a minimal API, and deciding on those, could help get started.
Then we could work as follows:
- We decide the initial points to discuss (e.g. class name, get/set columns...)
- We open issues for those, and we try to reach decisions
- Once there is agreement, we write the minimal API in the intended formats (RFC, not sure if we want a Python definition too, or something else)
- Then, for the rest of issues, once there seems to be consensus in one of them, we open a PR to the RFC... with the outcome, to finalize the discussion.
I think this should help a bit keep focused. Being able to see the progress on what has already been agreed, and somehow limiting the number of open discussions at a time. But that's just a personal preference, to work in a more structured way. Surely other people will have ideas on how to work efficiently.
We can discuss if this makes sense in the call tomorrow. Also, if we want to take a subset of pandas as a starting point or not, as it was asked here. And what are the points we want to start with, if people like this approach.
from dataframe-api.
When designing an API I often request good use case examples. I am not sure we have those yet (perhaps we do and I missed it). Different industries have different use cases, which can help us determine the APIs. For instance, if I want to select all rows with last name 'Dimitri' and first name 'Thomas' , what are the ways to do that? (that is just a simple one, i am hoping for more sophisticated ones).
Therefore I think one step in the methodology for constructing the APIs involves good use cases from different industries.
from dataframe-api.
This has been superseded by #25, closing.
from dataframe-api.
Related Issues (20)
- Column reductions should return 1-row Column HOT 5
- What's the deal with Scalars? HOT 15
- __parameters__ should not be a documented property of the API HOT 1
- Dealing with `if scalar` HOT 2
- Roadmap for Dataframe API HOT 1
- Rename entrypoint to `__consortium_api__`? HOT 7
- Duration/timedelta not supported by dataframe interchange protocol? HOT 2
- Move `year` to `.dt.year`, and other namespace-specific function HOT 1
- Joins, and joining columns HOT 3
- How will cudf handle `df.assign(df.col('a').sort())`? HOT 10
- namespace.coalesce
- Expressions - another attempt HOT 10
- Remove DataFrame.take HOT 6
- Unclear Memory Ownership and Lifetimes
- Nullability Sentinel Very Hard to Use
- Instructions for libraries implementing PEP 249 – Python Database API Specification v2.0
- Row ordering design choice HOT 1
- Example of non-dictionary categorical?
- Question: Dfspec? HOT 2
- Date data type?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframe-api.