From my own experience, there are (at least) two very different use cases to use dataf

I think this is a great point and am glad <a class="user-mention notranslate" data-hov

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

I would like to add that vaex sits somewhere in between: <div class="highlight hig

In Ibis automatic lazy execution is configurable: <a href="https://g

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

APIs for both building pipelines and data analysis,about data-apis/dataframe-api

Comments (43)

teoliphant commented on August 26, 2024 6

I think this is a great point and am glad @datapythonista brought it up. I have seen this distinction with users of NumPy too. In that context, you have "interactive usage" vs. "library usage". Initially, NumPy was primarily used for interactive use. Over the years, it has become more used as a library. The personas of the people that rely on the tool want different things from the API depending on their primary use-case.

For interactive use, people want time-saving help and they want more "intelligence" from the tool like run-time hiding, robustness to inputs, and very little typing information.

For library use, developers typically want more clarity and less "magic" (where "magic" is a bit in the eye of the beholder in terms of what they are used to). In my experience, developers want type information, quick errors when inputs are wrong rather than robustness, and more exposure of run-time details they can take advantage of for speed.

For Python, we have both use-cases interweaving all the time. In practice, in my view, that means there are always at least 2 API categories: One that is more dev-centric and another that is more user-centric. For clarity, and now that packaging is in a better situation, two libraries are often more effective than one.

So, as we think about APIs, I agree in eventually separating them into at least 2 categories. But, I recommend we do that only as it becomes clear we need to because of different spec requirements for the different APIs that the data tell us are in regular use.

from dataframe-api.

TomAugspurger commented on August 26, 2024 4

Yes, it means that that call won't be lazy itself, but that's fine, because it's out of scope of this proposal to try to convert Python control flow to a lazy form.

Agreed.

This would be different then the current behavior of Dask Array, or Ibis, at the moment, because they usually don't execute until you tell them too, not implicitly AFAIK.

Just to clarify, dask.array & dataframe will implicitly compute things when requested, where "requested" means something like a __bool__, __iter__, __array__

In [1]: import numpy as np

In [2]: import dask.array as da

In [3]: a = da.ones((4, 4))

In [5]: a.mean() > 0  #lazy
Out[5]: dask.array<gt, shape=(), dtype=bool, chunksize=(), chunktype=numpy.ndarray>

In [6]: if a.mean() > 0:  # executes (x.mean() > 0).__bool__()
   ...:     print("hi")
   ...:
hi

In [7]: np.asarray(a)
Out[7]:
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

I'm not sure if it's worth looking at, but dask-glm does have some examples of distributed optimization algorithms. They aren't really in use today though. I would say that we haven't really figured out a good model here yet.

https://blog.dask.org/2018/08/07/incremental-saga and https://blog.dask.org/2017/04/19/dask-glm-2 have some discussions as well.

from dataframe-api.

datapythonista commented on August 26, 2024 3

Thanks @maartenbreddels, that's a very good point. I clarified my comment above to note that.

My main point is that there may be API implications on the decision made regarding laziness. If we totally forget about it, and decide once the API is finished, there will be limitations. May be they are not important enough to not postpone the discussion, and keep things simpler for now. But I don't think the user API and eagerness are fully independent topics.

from dataframe-api.

devin-petersohn commented on August 26, 2024 2

My personal opinion here is that adding explicit compute or execute leaks execution details to the user. Separating the execution details from the API is going to be important because systems may want to control when computation is actually triggered, like in the case @maartenbreddels showed with Vaex.

I think we can get away without needing some explicit computation trigger word, lazy systems can just trigger computation when an API calls for the dataframe to be converted to an external format (e.g. array, csv file, __repr__, etc.). This is how laziness is handled in Modin (preliminarily).

from dataframe-api.

TomAugspurger commented on August 26, 2024 2

Yes, the "can" vs. "should" distinction is a good one :)

So I think we're back to Ralph's original point that eager vs. lazy is primarily about performance, rather than API? Does anyone object to that?

@datapythonista you noted in #5 (comment)

My main point is that there may be API implications on the decision made regarding laziness. If we totally forget about it, and decide once the API is finished, there will be limitations.

Can we delve into that a bit? In my experience, this is primarily an issue when we have value dependent behavior, i.e. the output dtype / shape depends on the values in the dataframe, rather than just the input dtype / shape. We've been removing this type of behavior in pandas wherever possible, but in some cases it's unavoidable (e.g. pivot(df, index="a", columns="b") will for a DataFrame whose columns are the values in b). In practice, this just means that lazy systems will need to evaluate the values in b before proceeding with the rest of the computation.

I think that if we agree that eager vs. lazy is (primarily) about performance then we can

Rule out a .compute / .persist API as part of the standard
Try to avoid APIs that aren't sensible in either a lazy or eager system

from dataframe-api.

maartenbreddels commented on August 26, 2024 1

I would like to add that vaex sits somewhere in between:

import vaex
df = vaex.example()
# lazy element wise operations
df['y'] = df.x**2  # just the expression gets added
display(df['y'])  # will compute/show only the the first and last 5 rows
# lazy filtering
dff = df[df['x'] > 1]  # no explicit compute() needed
# half lazy join
df.join(df, 'x', rsuffix='r', allow_duplication=True)  # eagerly calculates the lookup, lazily fetches the values

So in usage, it seems eager, but a lot is done lazily behind the scene.

I think my point is, that even if you have an eager acting system, it can do a lot of things lazily, and it could build computational graphs/DAG's/plans behind the scene to automatically generate pipelines for instance (like what we do in Vaex).

from dataframe-api.

wesm commented on August 26, 2024 1

In Ibis automatic lazy execution is configurable:

https://github.com/ibis-project/ibis/blob/master/ibis/expr/types.py#L25

If you __repr__ an expression object in a Jupyter notebook, for example, then it executes automatically (rather than having to type expr.execute()). I think it's important to support to support both modes since you don't want your library firing off expensive queries to BigQuery by accident.

from dataframe-api.

rgommers commented on August 26, 2024 1

View vs copy is about semantics (it e.g. changes what a subsequent in-place operation does), so that's clearly something that must be defined.

I guess it depends a bit on what level of API we are talking about. sklearn partially works on C-pointer level, which will never be implementation independent.

Right, I think your questions are much more critical for low-level code. For a high level API it can still matter of course, however in many cases it doesn't. For example, Dask is lazy and does a pretty good job of providing APIs that match NumPy and Pandas; a lot of code consuming that API doesn't change when you use Dask.

For C/Cython, let's continue on that other thread you brought it up, there was more discussion there.

As the consumer of the result of a dataframe query, why is it important to know these details, outside of performance?

I think I get @amueller's concern: from the point of view of a library like scikit-learn, one wants to provide performance (memory, computation speed) optimal algorithms. And those are more often than not data structure implementation dependent.

I think the answer to that is: a standard API will still help, it makes different array/dataframe libraries run. For really getting optimal performance, one will then still have to add custom code paths. Whether that is worth doing is a performance vs. maintenance trade-off.

from dataframe-api.

rgommers commented on August 26, 2024 1

@amueller I don't believe those will be a problem, but let's make sure. Can you point to a concrete algorithm you have in mind, and we'll assess if there's a potential issue with it given this eager/lazy discussion?

from dataframe-api.

amueller commented on August 26, 2024 1

Part of the reason to let libraries control execution is that optimizing for each individual implementation can end up removing the benefit of the standard API. This is my main concern with exposing these execution semantics.

@devin-petersohn totally agreed! I'm just trying to see how this would work in practice. We all come from very different directions and so I think it's important to get on the same page. I think the lazy vs eager is probably one of core issues in defining this API.

I read your suggestion as saying that arrays would be always eager while dataframes would not be.
That indeed would probably solve my concerns for sklearn. However, there are several lazy array libraries that probably wouldn't like that.

from dataframe-api.

saulshanabrook commented on August 26, 2024 1

Is this the problem or do I misunderstand?

I agree that this you don't need to necessarily materialize the full computation, just to get access to a single check like this. But I think that's all a backend implementation detail and not anything this spec should concern itself with.

IMHO all that sklearn know is that if x is some array that follow the sec, then code likeif (x.mean() < 0): ... will work. So to make this work, any lazy backend will need to compute enough to return the right return value for this control flow to proceed in Python.

Yes, it means that that call won't be lazy itself, but that's fine, because it's out of scope of this proposal to try to convert Python control flow to a lazy form.

We could possibly try to add lazy versions of control flow, so instead it could be written in a form that lazy backends could take advantage of, like (x.mean() < 0).if_(..., ...), but IMHO this is probably out of scope for at least the first draft here.

EDIT: So to clarify, I think all this spec needs to be clear about is the return values of the different methods and functions, and specify that things like __iter__ and __bool__ should be present to fulfil the spec. Then, sklearn can build it's API around this spec and know it can use Python control flow. If something like Dask Arrays wants to fulfil the spec, they can as long as they properly implement those methods. ~~This would be different then the current behavior of Dask Array, or Ibis, at the moment, because they usually don't execute until you tell them too, not implicitly AFAIK.~~ See comment below for for how they currently.

from dataframe-api.

amueller commented on August 26, 2024 1

As long as the framework / contract handles triggering the computation in these places then I'm happy :) It might leave a couple of sklearn related issues but they are probably not broadly important.

from dataframe-api.

rgommers commented on August 26, 2024

This is a great point @datapythonista, and type casting behaviour is probably the best example of where you want different semantics for the same functionality.

Eager vs lazy really is a separate topic I think that's independent of API (it's about performance, not semantics). Also for developing production pipelines, you want to develop using eager - it's simple the more intuitive way to develop code. And then you want to flip a switch to get lazy behaviour for possible performance optimizations in production. This is exactly what the PyTorch/Tensorflow experience showed (Tensorflow 2 switched to eager by default, and both libraries have a graph compiled mode for production use cases). It's completely analogous for dataframes, except libraries are a little less mature. At this point I'd suggest keeping eager/lazy out of this topic. @teoliphant volunteered to write up a summary of eager/lazy as a standalone topic.

from dataframe-api.

datapythonista commented on August 26, 2024

I mostly agree, but I think the API may be affected. With an example:

import sql

query = sql.connect('foo.db')

query = query.select('*')
query = query.from('table1')
query.execute()

This is obviously a lazy API. In eager mode the line query = query.select('*') is meaningless. So, if you want to make this eager, the API needs to be changed.

So, IMO, a decision needs to be made before defining an API among:

Lazy
Eager
Hybrid (the same syntax can be executed both in lazy and eager modes)
Mixed (some things are executed in eager mode, some in lazy mode, see @maartenbreddels comment below)

May be we're already assuming hybrid is the way to go. But I don't think it's the only option (not saying is not the best).

from dataframe-api.

maartenbreddels commented on August 26, 2024

Interesting to hear Devin & Wes. I wasn't totally sure we could get away with it, but if 3 libraries are doing this already (Vaex, Modin and Ibis), I guess it is demonstrated to be possible.

Then I guess the next question is: Should lazy execution be something we should consider explicitly in the API design? Or can this be an implementation detail of the libraries?

from dataframe-api.

wesm commented on August 26, 2024

This is one of the central questions of this whole project: does the API expose execution semantics or not? The pandas API in general exposes eager execution semantics. This is a significant decision with many implications. My design work with Ibis was an exploration of a data frame library without exposing any eager execution semantics.

In R people use both kinds of APIs, ones that expose execution semantics (base R, data.table) and ones that don't (dplyr).

from dataframe-api.

devin-petersohn commented on August 26, 2024

I don't necessarily agree that the pandas API itself is what exposes eager execution semantics, instead it is the implementation. It is not too difficult to intercept API calls under __getattribute__ and build a DAG (this is what Modin does). Doing a print(df.expensive().head()), users don't care if the entire expensive is fully executed before they see the top 5 lines, especially in EDA/data cleaning.

I don't think we should support an explicit execute or compute or persist in the API. It leaks too much to the user. I would prefer to have full control over when things are executed, not rely on the user to give their best guess about when there is enough information. Ideally users should not have to think about execution, it should just work.

from dataframe-api.

wesm commented on August 26, 2024

It is not too difficult to intercept API calls under getattribute and build a DAG (this is what Modin does).

IMHO, just because you can do this does not mean that you should. If the barrier to entry of implementing the API is a significant about of metaprogramming (regardless of whether there is a example implementation of it or not), that does not strike me necessarily as a good thing.

from dataframe-api.

aregm commented on August 26, 2024

I don't think we should support an explicit execute or compute or persist in the API. It leaks too much to the user. I would prefer to have full control over when things are executed, not rely on the user to give their best guess about when there is enough information. Ideally users should not have to think about execution, it should just work.

@devin-petersohn What about the pipeline execution? Don't you think that having the API that allows more granular control, or even execution scheduling of various portions of DAG is beneficial - should the DAG be exposed as a work item, like, e.g. in Airflow?

from dataframe-api.

devin-petersohn commented on August 26, 2024

IMHO, just because you can do this does not mean that you should. If the barrier to entry of implementing the API is a significant about of metaprogramming (regardless of whether there is a example implementation of it or not), that does not strike me necessarily as a good thing.

@wesm I think you misunderstand the point I was trying to make. The __getattribute__ approach I mention is just one way to handle building the DAG, you can also handle this on an API to API basis (as @maartenbreddels mentioned he does with Vaex) or implement each API to add itself to the DAG or be completely eager. My point was the pandas API itself does not imply eager.

@devin-petersohn What about the pipeline execution? Don't you think that having the API that allows more granular control, or even execution scheduling of various portions of DAG is beneficial - should the DAG be exposed as a work item, like, e.g. in Airflow?

@aregm I am imagining each dataframe system as a black box, similar to SQL systems. Exposing the actual DAG also comes with a new set of requirements and would require some additional API standards. I expect there is another can of worms involved in exposing this level of execution detail as well because some systems may not want to do it at all. I'm interested to hear your thoughts.

from dataframe-api.

wesm commented on August 26, 2024

My point was the pandas API itself does not imply eager.

I don't know about this. By the same logic, no library has eager evaluation so long as API calls can be intercepted and translated into an AST.

Consider this pandas code

df[col][df[col].isnull()] = 5
result = df.groupby(col)[col].sum()

Yes, you can transform this into:

df <- Mutate(
  df,
  AssignWhere(
    ColumnRef(df, col),
    IsNull(ColumnRef(df, col))
    Literal(5)
  )
)             

result <- GroupBy(
  df, ColumnRef(df, col),
  Aggregate(ColumnRef(df, col)))

But was that the intention of the library creators? The execution semantics (especially when dealing with very large data sets) influence a great deal how people write and think about the code.

from dataframe-api.

maartenbreddels commented on August 26, 2024

just because you can do this does not mean that you should

I agree those two are separate questions. The can question is answered with yes.

I don't think we should support an explicit execute or compute or persist in the API.

For a data science-user, I agree. Aas a programmer/engineer-user I am not sure which is the right way.

from dataframe-api.

devin-petersohn commented on August 26, 2024

The "can" vs "should" discussion is a bit of a derailment, not sure why we went there. This is not the place to discuss the merits of how implementations handle things, and was an illustrative example. There's a better place for this type of discussion.

I don't know about this. By the same logic, no library has eager evaluation so long as API calls can be intercepted and translated into an AST.

Yes, this is my point. API and the underlying evaluation should be distinct and separate. It's not that no library has eager evaluation, it's that the API should not imply anything about execution.

But was that the intention of the library creators? The execution semantics (especially when dealing with very large data sets) influence a great deal how people write and think about the code.

You are the library creator in this case, so I will not try to guess your intention 😄. I think intention is not really the point. I think you make a great point here about how the scale of data influences how people write and think about code. I think this is a signal that there has been a failure in the design of the APIs of recent data systems. When I write SQL it will run on 10 rows or 1 trillion rows, I don't think about the size of the data in constructing the query. This is something I think modern data systems have mostly gotten wrong, that deep details of execution and data layout are pushed onto the user as points that they individually have to optimize for their particular workflow.

from dataframe-api.

rgommers commented on August 26, 2024

I don't think we should support an explicit execute or compute or persist in the API.

For a data science-user, I agree. Aas a programmer/engineer-user I am not sure which is the right way.

As a general point: we don't have to figure out every last issue at once. I'd definitely punt on this in the first draft of the spec. One can use execute/compute methods, or decorators, or context managers, or whatever to express you want to do something lazy, or JIT-compiled, or on a GPU, or whatever. There'll be a number of such "out of scope for now" things.

the API should not imply anything about execution.

I think there's general agreement on adopting this principle.

from dataframe-api.

saulshanabrook commented on August 26, 2024

Rule out a .compute / .persist API as part of the standard

Try to avoid APIs that aren't sensible in either a lazy or eager system

@TomAugspurger I like that approach!

If we assume number 1, then this seems to suggest a number of things to avoid which would be awkward in lazy mode, addressing your number 2:

Avoid returning concrete Python types (int, list, bool, etc), instead return protocols/monkey types. Otherwise, we have to materialize to concrete Python types instead of keeping things lazy. So instead of saying "ndarray.shape returns a tuple of integers" we could say "ndarray.shape returns an object with a __getitem__ that takes and returns integers and a __len__". Obviously would need to be a bit more complete than this.
Being clear about view/copy semantics. This would help for @wesm's example. So we should be clear for every argument to a function whether it is mutated or copied.
Building in control flow, so we can avoid Pyton control flow. This is already the case in lots of Pandas and NumPy (i.e. np.where), so shouldn't really be a problem. Point being that any Python control flow breaks lazy mode, because you cannot override it.

Anything else I am missing?

from dataframe-api.

amueller commented on August 26, 2024

I just want to make sure I follow the conversation, if we're talking about 'user' here, who is that?

As an author of a library consuming the dataframe, I feel that I need to know about the execution semantics to write efficient code.

When I write SQL it will run on 10 rows or 1 trillion rows, I don't think about the size of the data in constructing the query. This is something I think modern data systems have mostly gotten wrong, that deep details of execution and data layout are pushed onto the user as points that they individually have to optimize for their particular workflow.

I am very inexperienced with SQL systems, but when I write numerical code, the exact opposite is true.

from dataframe-api.

saulshanabrook commented on August 26, 2024

I just want to make sure I follow the conversation, if we're talking about 'user' here, who is that?

Any code or library that consumes a dataframe I believe.

As an author of a library consuming the dataframe, I feel that I need to know about the execution semantics to write efficient code.

When you say "execution semantics" could you give an example? To me, the user (in this case you, a library author) should depend on the semantics of the operation, but should not on it having particular performance characteristics, because those could be very implementation dependent.

from dataframe-api.

amueller commented on August 26, 2024

Well, I would want to know what is lazy and when something gets evaluated, for one, or whether something is a view or a copy, or potentially peak memory usage. I guess it depends a bit on what level of API we are talking about. sklearn partially works on C-pointer level, which will never be implementation independent.

However, there are parts (like scaling and one-hot encoding) that could reasonably be implemented on an abstract API. But even then, the implementation will probably depend quite a lot on the performance characteristics.

from dataframe-api.

devin-petersohn commented on August 26, 2024

Well, I would want to know what is lazy and when something gets evaluated, for one, or whether something is a view or a copy, or potentially peak memory usage.

I would like to understand why you want know these things. As the consumer of the result of a dataframe query, why is it important to know these details, outside of performance?

from dataframe-api.

amueller commented on August 26, 2024

Honestly, I have a hard time even conceptualizing an iterative algorithm that uses lazy evaluation. If you look at tensorflow or pytorch, the optimizers work on the computational graph, i.e. they deeply inspect the internals and computational structure.

Tensorflow and pytorch basically implement one optimizer: stochastic gradient descent. Scikit-learn implements maybe 50 or so. If you have examples of how it looks like to implement iterative optimization with lazy data structures, I'd love to see it.

@rgommers sorry didn't see your reply, needed to reload the page.
The low-level question is better discussed in the other thread maybe, but the question about how to write / think about iterative algorithms is somewhat separate to me. If people think it's possible to write this without knowing what triggers a compute in a lazy graph, I'm happy to be convinced, I just have a hard time wrapping my head around it.

from dataframe-api.

rgommers commented on August 26, 2024

The low-level question is better discussed in the other thread maybe, but the question about how to write / think about iterative algorithms is somewhat separate to me. If people think it's possible to write this without knowing what triggers a compute in a lazy graph, I'm happy to be convinced, I just have a hard time wrapping my head around it.

Yes I think you're right, iterative algorithms are a bit out of the scope of this discussion - you can't just make those lazy. That mainly applies to a subset of array-based code typically written by library authors though; most end user code for both arrays and dataframes, and a good amount of library code as well, can be lazy just fine (otherwise Dask wouldn't get away with just copying the NumPy and Pandas APIs for example).

from dataframe-api.

amueller commented on August 26, 2024

Hm saying iterative algorithms are out of scope bascially means most downstream libraries for numpy are out of scope, right? So you're basically saying any machine learning stuff is out of scope. If that's the framing that people want, I guess that's fine. Everybody just needs to be aware that this means sklearn is completely out, any probabilistic modelling and forecasting is out, any optimization, portfolios and planning etc...

from dataframe-api.

saulshanabrook commented on August 26, 2024

So it seems like we want to hit a sweet spot with this proposal to enable lazy usage, but also support existing users who write their algorithms imperatively. And along @teoliphant's point about multiple APIs, a high and a low level one, it might make sense for the high level one to try to mirror almost exactly some subset of NumPy's semantics and the lower level one to expose some more functional primitives, to enable higher performance lazy usage.

If people think it's possible to write this without knowing what triggers a compute in a lazy graph, I'm happy to be convinced, I just have a hard time wrapping my head around it.

@amueller IMHO this is implicitly defined by Python, if we assume that users of this API will be evaluating it under a default Python interpreter, not through a tool like Numba or Pythron, which seems like a reasonable assumption. Any builtin dunder methods that must return Python types have to be "eager", i.e. __len__, __bool__, __iter__... Which means if you have a lazy implementation any of those methods need to actually do enough computation to return an accurate result. If they do that, then you are free to continue using the non-functional algorithms you listed in sklearn on lazy backends, they just would end up evaluating quite a lot. There are some other methods, like __str__ or __repr__, that might not be specified in the standard, which means that lazy backends wouldn't have to evaluate on those if they did not want to.

If there are functional ways of representing the iterative algorithms present in the array APIs we settle on, it might then be feasible for sklearn to move to them if they wanted a backend to able to more holistically compile the algorithm, but they wouldn't be forced to.

Honestly, I have a hard time even conceptualizing an iterative algorithm that uses lazy evaluation.

Forcing all users to switch to fully lazy versions of iterative algorithms is definitely out of scope for this group, but I did want to highlight some prior work here: https://futhark-lang.org/ https://www.acceleratehs.org/index.html. Also see the work Google is doing in autograph, to convert Python code in imperative forms into a functional XLA graph.("AUTOGRAPH: IMPERATIVE-STYLE CODING WITH GRAPH-BASED PERFORMANCE"):

Stella Laurenzo, who is at Google, is also working along this line, trying to see how to build a Python frontend for a NumPy MLIR dialect("npcomp - An aspirational MLIR based numpy compiler").

To take a step back, what we need is a way for users to express what they want (in this case sklearn) and for the backend/implementation to execute that in a way that makes sense for them. The "issue" is that the way users want to express things is through iterative algorithms using Python's built in control flow. So from a high level, either that means a) stop relying on Python's control flow and express everything in functional form in the DSL of the library (how TF worked before autograph, more similar to Futhark or Accelerate, but embed a DSL embedded Python) or b) find ways to translate Python's control flow into a form that the backend can control (numba or autograph), so that users can continue using Python control flow but have it be performant. Cython is a shortcut around option b) that works for a particular backend.

But at the moment, as I said above, I think we can avoid making that choice here, if we support an outlet valve/fallback to support the existing APIs that allow imperative execution.

from dataframe-api.

amueller commented on August 26, 2024

So what you're saying is that the user/library should have some control over whether something is evaluated lazily, which I think would be the easiest solution for supporting iterative algorithms, but which seemed to go against what @devin-petersohn was imagining, I think.

I don't have a strong opinion (or expertise) to come up with solutions here, I just wanted to bring up the importance of iterative algorithms in my part of the data science world and that I'm not sure how that would relate to lazy evaluation.

It sounds to me like we do need to make some choice here on how much control the user has.

from dataframe-api.

rgommers commented on August 26, 2024

.... so that users can continue using Python control flow but have it be performant.

Like you said, that's definitely out of scope. This is a really hard problem where all solutions involve compilers and are immature for the foreseeable future. For a standard or even simply adoption in something like scikit-learn, we're talking many years (if ever).

It sounds to me like we do need to make some choice here on how much control the user has.

I think very little or nothing is needed in the API here. It just doesn't apply to iterative algorithms - those are all in Cython/C/C++/Fortran.

from dataframe-api.

amueller commented on August 26, 2024

@rgommers there's plenty of iterative algorithms in scikit-learn in pure numpy. As I said above, if we don't allow iterative algorithms, we basically say scikit-learn is not in scope. So there is no adoption of the standard by scikit-learn [in the sense that there's basically nothing of sklearn left that could adopt the standard].

from dataframe-api.

amueller commented on August 26, 2024

Let's say NMF with multiplicative updates:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/_nmf.py#L710

I guess what I don't really understand is whether computation of the stopping criterion will be lazy or not.

Another maybe more simple example would be gaussian mixture models (thought the fit is actually in a parent class):
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/_gaussian_mixture.py#L434

We also have a pure numpy neural net (not super relevant in practice but maybe a good example as well):
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neural_network/_multilayer_perceptron.py#L625

from dataframe-api.

rgommers commented on August 26, 2024

Thanks.

I guess what I don't really understand is whether computation of the stopping criterion will be lazy or not.

It should not be - the lazy implementations normally know when they need to evaluate, and that then happens without any change to the API. (previous_error - error) / error_at_init < tol cannot be kept lazy for something like Dask.

from dataframe-api.

amueller commented on August 26, 2024

For dask you need to explicitly add a sync point from what I remember, but that might have changed?

from dataframe-api.

devin-petersohn commented on August 26, 2024

From my perspective, iterative algorithms are much more natural in arrays than in dataframes (though they are used). An API for going from dataframe API standard -> array API standard might solve this issue.

to_array_standard() would always trigger computation, we can define this in the spec. Of course this could be abused as well so some thought has to go into not incentivizing

type(df)(df.to_array_standard())

to manually trigger computation in purely lazy systems.

Part of the reason to let libraries control execution is that optimizing for each individual implementation can end up removing the benefit of the standard API. This is my main concern with exposing these execution semantics.

from dataframe-api.

saulshanabrook commented on August 26, 2024

It should not be - the lazy implementations normally know when they need to evaluate, and that then happens without any change to the API.

Yeah, I agree.

So that every method that must return a concrete Python type, would trigger computation.

So for [this line](https://github.com/scikit-learn/scikit-learn/blob/a1cc566605e61bcac5cd9d9f676ca430b5f90a1d/sklearn/decomposition/_nmf.py#L831:

            if (previous_error - error) / error_at_init < tol:

It would end up calling __bool__ on the value emit by the __lt__, which is the point at which it would force it to not be lazy anymore and would force computation.

For dask you need to explicitly add a sync point from what I remember, but that might have changed?

My assumption is then that Dask array would not fulfil the API as is, because it doesn't force computation, AFAIK, when using it in Python control flow. However, I imagine that there could be a compatible wrapper object written on top of the Dask array that does fulfil it.

from dataframe-api.

devin-petersohn commented on August 26, 2024

So for [this line](scikit-learn/scikit-learn:sklearn/decomposition/_nmf.py@a1cc566#L831:

            if (previous_error - error) / error_at_init < tol:

It would end up calling bool on the value emit by the lt, which is the point at which it would force it to not be lazy anymore and would force computation.

I'm not sure I agree that it would force the values to be materialized. It can be much cheaper to know __lt__ than __eq__ and since the implementation has control over both, as long as they return booleans, materialization of any part of the result is not required or guaranteed. Some computation would be required, but materialization is not a prerequisite to either from my perspective.

Perhaps the problem here is that in performance critical code, you don't want to both evaluate some estimation for really fast __lt__ evaluation and evaluate a value later when you need it. Is this the problem or do I misunderstand?

from dataframe-api.

saulshanabrook commented on August 26, 2024

Just to clarify, dask.array & dataframe will implicitly compute things when requested

Ah, thank you for clarifying this! I was incorrect above then, I updated my comment.

from dataframe-api.

APIs for both building pipelines and data analysis about dataframe-api HOT 43 OPEN

Comments (43)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent