Coder Social home page Coder Social logo

Dataframe namespaces about dataframe-api HOT 9 OPEN

data-apis avatar data-apis commented on August 26, 2024 1
Dataframe namespaces

from dataframe-api.

Comments (9)

rgommers avatar rgommers commented on August 26, 2024 1

Small comment: the reduce + sum example is a bit odd, it should be reduce + add. E.g. np.sum is np.add.reduce; summing is an aggregation op that implies reduction. reduce should be combined with the element-wise operations (add, multiply, divide, substract, etc.).

from dataframe-api.

devin-petersohn avatar devin-petersohn commented on August 26, 2024

@datapythonista Great writeup. What do you see as the biggest benefit of defining things as df.reduce(mod.reductions.sum)?

From the my perspective, it feels a bit cumbersome, but I can see value in seeing that the shape (dimension) of the data will change (in this example) without needing to understand details of sum.

I don't think I am leaning in any particular way yet, but I want to understand your thoughts.

from dataframe-api.

datapythonista avatar datapythonista commented on August 26, 2024

Thanks @devin-petersohn, I 100% agree on your comments. The syntax is trickier than some other implementation, and also we need to consider extra parameters, for example: df.reduce(mod.reductions.var, ddof=1).

The main advantages I see are:

  • It decouples things with an interface, so it's easier to deal with the complexity at different levels of abstraction
  • This implies that we could not care about reductions in particular, and just see that part of a dataframe as a framework that can be extended with functions implementing a certain signature. We probably want to define at least a small subset of map and reduce operations. But then we just need to know which functions are part of the standard, their extra parameters, and which types they support, and no other decision need to be made in an individual way
  • We don't need to think about how to extend the API for map, reductions and everything that can be implemented in this way. Users and developers of third-party libraries can implement functions with a certain signature, and users can use them naturally, without a registry, entry points, or other "magic"
  • For users (and this means developers here, since friendlier APIs will be defined on top of this one) the code will become more readable in the sense that the returned shape is known. Of course the code is less readable in some other way (df.sum() vs df.reduce(mod.reductions.sum)), but if you think of less know functions, it's obvious that df.reduce(df.reductions.sem) is returning a scalar (or scalar per column), while for df.sem() you need to check the signature
  • I think implementation can take advantage of this predictability of shapes to for example allocate memory. You can surely do it with another syntax, but having a single .reduce() method that with a known shape (including types) input, has a known output, feels like it should be much easier to manage memory. Not sure if this is only true for lazy implementations, but I guess optimizations are much easier to implement with a single .reduce() method, than with each reduction individually

I think similar things can be achieved with another syntax, like for example df.reduce.sum(). But IMHO everything becomes trickier. For users it's somehow magic what's going on under the hood. Reductions need to be registered, instead of just used directly. The syntax itself doesn't seem to imply that all the reductions implement the same syntax, so validating and enforcing it feels somehow magic.

But to me, the main thing would be to use a divide and conquer approach to reduce the complexity. I see this as similar to what Linux achieves with the X server (and many other components). You're building a complex piece of software, and instead of dealing with all the complexity in desktops, you just create an interface to build on top. Besides allowing a free market of software to interact with yours, the complexity is reduced dramatically. And its modularity makes changes much simpler. Also, I think we could end up having an independent project for reductions (and maps, like string functions...). If every dataframe library needs the same reductions, and all them are implemented on top of the buffer protocol, array API or whatever, feels like it could be healthier for the ecosystem to have a common dependency (better maintained, less bugs, more optimized...).

So, in summary, even if I also think the functional API is somehow cumbersome, I think it's the one that better captures all these goals and advantages, and makes things simpler. Surely for the implementation I'd say, but also for users, who are presented a more structured interface.

from dataframe-api.

rgommers avatar rgommers commented on August 26, 2024

Based on pandas, the number of methods can be 300 or more, so it may be problematic to implement everything in the same namespace

A quick count with [s for s in dir(df) if not s.startswith('_')] says there are currently 217 methods + attributes. 300 would probably still be fine, but I agree way more will start to become messy.

from dataframe-api.

datapythonista avatar datapythonista commented on August 26, 2024

A quick count with [s for s in dir(df) if not s.startswith('_')] says there are currently 217 methods + attributes. 300 would probably still be fine, but I agree way more will start to become messy.

To be more specific on the figure I gave. There are around 200 for dataframe, some more than that for series (many are the same, but I guess the union is probably between 250 and 300). And then there are the accessors, around 50 string methods, and around 50 more for datetime. So, if we implement most things in pandas, we merge series functionality into a single dataframe class, and we don't use accessors (everything is a direct method of dataframe), I guess we're talking of that order of magnitude (300 or more).

from dataframe-api.

rgommers avatar rgommers commented on August 26, 2024

Also, I think it would be good that the API can be extended easily. Couple of example of how pandas can be extended with custom functions:

df.apply(my_custom_function) this is not API-extending I'd say, this is regular use of the apply method.

@pd.api.extensions.register_dataframe_accessor('my_accessor')
class MyAccessor:
    def my_custom_method(self):
        return True

df.my_accessor.my_custom_method()

This one seems pretty horrifying to me. This is giving end users and consumer libraries the ability to basically monkeypatch the DataFrame object. This is Python so anyone can monkeypatch things anyway (unless things are implemented as a compiled extension, then the language won't let you), but providing an API to do this seems very odd. If Pandas would like to do that it's of course free to do so, but I'd much prefer to not standardize such a pattern.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on August 26, 2024

from dataframe-api.

datapythonista avatar datapythonista commented on August 26, 2024

Some comments made in the meeting:

  • Discoverabiliy is important
  • More +1's in method chaining
  • Several conflicts in naming pointed out: replace, count... which are different for .str than for a Series.
  • With a functional API it could be possible avoid having specific map/reductions in the standard. For example, define a df.reduce() method in the standard, but don't make part of the standard the specific reductions (same for string methods and others).

from dataframe-api.

maurosilber avatar maurosilber commented on August 26, 2024

Is this still being considered?

I'd love using an API with few methods that appear in the autocomplete (df.<TAB>), which would imply hiding them somehow from the "main namespace" (that is, neither the top-level nor the prefixed methods approaches).

I'd vote for the accessor approach:

df.reduce.sum() # or add

as it can include the functional one by creating an accessor with a __call__ method, as does pandas.DataFrame.plot.

df.reduce(np.sum)

Then, when using the autocomplete, we would only see a short list of primitive actions to perform on a DataFrame (reduce, transform, plot, export, etc).

A drawback of this would be when using auto-formatters, which can split the accessor and the chosen method from the accessor.

Instead of this:

(
    df
    .fill_nan(0)
    .max()
    .sort()
)

we would write this

(
    df
    .transform.fill_nan(0)
    .reduce.max()
    .transform.sort()
)

but would be auto-formatted to this

(
    df
    .transform
    .fill_nan(0)
    .reduce
    .max()
    .transform
    .sort()
)

from dataframe-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.