Coder Social home page Coder Social logo

Dataframe MVP about dataframe-api HOT 12 CLOSED

data-apis avatar data-apis commented on August 26, 2024
Dataframe MVP

from dataframe-api.

Comments (12)

rgommers avatar rgommers commented on August 26, 2024 2

We can discuss if this makes sense in the call tomorrow.

Agreed, methodology for constructing the APIs will be the main topic for tomorrow's call. Starting with arrays, where we're a little further along with tooling, and obtaining data and making decisions is generally a bit easier. And then for dataframes we should consider what parts of that methodology are applicable, and how to deal with dataframe-specific pain points.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on August 26, 2024 1

meta-question (which is probably appropriate for the Thursday call): why do this in a GitHub issue rather than the RFC document? IMO this document is a better home for it so that we can comment inline.

from dataframe-api.

tdimitri avatar tdimitri commented on August 26, 2024

I hope we can term it something else besides "DataFrame". To me dataframe implies pandas version of this concept and I prefer another term to help distinguish. For example, Matlab has Datasets and Tables, R has Tables and xtabs.

Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage, RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc.

I dont care much, as long as we dont overload the term DataFrame.
It will get confusing when pandas APIs for its DataFrames are different than this group's recommended APIs for DataFrames.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on August 26, 2024

from dataframe-api.

tdimitri avatar tdimitri commented on August 26, 2024

Yeah I see R with data.frame vs data.table. For many python data analytics users, the word dataframe is intertwined with pandas dataframe. Feels like this is the "pandas" club, and I hope we can break out from its orbit and declare independence. Is this group's goal to

  1. Cleanup pandas APIs, consolidate them, and get rid of duplicate methods
  2. Come up with a better model for the common tasks at hand

If no. 1, we should just state that is what this group is doing. At which point im disappointed and feeling hoodwinked.

If no. 2, we should then indicate the most common tasks a user wants to perform - including two step tasks that can be consolidated into one. For instance, setindex followed by pivot can be made into one task and indicates some of the problems of following the pandas API model.

One step to breaking free from pandas APIs is to change the name. Further, having the same name and methods, that do things differently is more confusing for future users.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on August 26, 2024

from dataframe-api.

tdimitri avatar tdimitri commented on August 26, 2024

Yes the impression I get from most of those was a conscious effort to be similar enough to the pandas dataframe so that existing code could be ported over more easily, and common methods could be used in the same way.

In riptide we called it a Dataset the name difference worked out well. Our examples use "ds" instead of "df". If we called it DataTable or DataGrid we could use "dt" or "dg". Then it would be easier to explain to users why we have different APIs with different kwargs.

Is the goal to be ~80% like the pandas Dataframe API or to be a new class with similar but different methods? Different group members may have different opinions.

Perhaps we can take a vote on this because I do care about the name. From my experience, I think calling it Dataframe is bound to lead to similar methods. Maybe that is what most people want, but it is not what i signed up for.

from dataframe-api.

jack-pappas avatar jack-pappas commented on August 26, 2024

I think it's best to just vote on it -- we can either do that at tomorrow's meeting, or we can decide to pick a working name for now then vote on a final name later on (when we're finalizing the spec).

DataFrame is more recognizable to end users (due to it's widespread use in existing projects), but re-using that name could also lead to some confusion on their part if the specification we produce has non-trivial differences from those existing implementations. I assume that's why R's data.table project chose a different name -- it's API is still recognizable as a "DataFrame" in spirit but diverges enough from the built-in R data.frame that calling it the same thing would have been confusing to end users.

Some additional data points:

Note: I don't have a preference towards any particular name. I do feel like the Data prefix is somewhat implied though, so I'd lean towards a simpler name like Frame or Table.

from dataframe-api.

jack-pappas avatar jack-pappas commented on August 26, 2024

@datapythonista The naming dicussion has more-or-less taken over this issue, but I think it's important we address the other points you brought up as well. Maybe we just rename / simplify this issue to be about the naming only and copy the rest of your original post to another issue so we can discuss those points?

from dataframe-api.

datapythonista avatar datapythonista commented on August 26, 2024

My main idea here, more than making decisions on the specific points, was about the methodology to move forward. We've got several issues now, with very interesting discussions. But I felt that instead of ending up with even more open discussions, finding the points for a minimal API, and deciding on those, could help get started.

Then we could work as follows:

  • We decide the initial points to discuss (e.g. class name, get/set columns...)
  • We open issues for those, and we try to reach decisions
  • Once there is agreement, we write the minimal API in the intended formats (RFC, not sure if we want a Python definition too, or something else)
  • Then, for the rest of issues, once there seems to be consensus in one of them, we open a PR to the RFC... with the outcome, to finalize the discussion.

I think this should help a bit keep focused. Being able to see the progress on what has already been agreed, and somehow limiting the number of open discussions at a time. But that's just a personal preference, to work in a more structured way. Surely other people will have ideas on how to work efficiently.

We can discuss if this makes sense in the call tomorrow. Also, if we want to take a subset of pandas as a starting point or not, as it was asked here. And what are the points we want to start with, if people like this approach.

from dataframe-api.

tdimitri avatar tdimitri commented on August 26, 2024

When designing an API I often request good use case examples. I am not sure we have those yet (perhaps we do and I missed it). Different industries have different use cases, which can help us determine the APIs. For instance, if I want to select all rows with last name 'Dimitri' and first name 'Thomas' , what are the ways to do that? (that is just a simple one, i am hoping for more sophisticated ones).

Therefore I think one step in the methodology for constructing the APIs involves good use cases from different industries.

from dataframe-api.

datapythonista avatar datapythonista commented on August 26, 2024

This has been superseded by #25, closing.

from dataframe-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.