Coder Social home page Coder Social logo

Missing Data about dataframe-api HOT 18 OPEN

data-apis avatar data-apis commented on July 23, 2024 1
Missing Data

from dataframe-api.

Comments (18)

rgommers avatar rgommers commented on July 23, 2024 2

@markusweimer that's what this is about indeed; it would be good to be more explicit in the final description. "not-a-number" (nan) is pretty universally available, "missing data" here means "not recorded" and support for it is much more recent and more patchy.

from dataframe-api.

maartenbreddels avatar maartenbreddels commented on July 23, 2024 2

I think we can reasonably choose between NA, null, or missing as a general name for "missing" values.

I feel like 'null' is a bit strange in the context of numbers, since it reminds me of pointers. I think missing is more fitting, though less neutral. But that is maybe what we want (give it an explicit meaning).

In Vaex we defined isna(x) as isnan(x) | ismissing(x), where ismissing(x) means missing values (implemented as masked arrays or Arrow Arrays which naturally have null bitmasks). isnan(x) just follows IEEE standards. So isna is short for 'get rid of anything messy'.

I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex.

I see NaN (just a float number) as orthogonal to missing values, the only connection they have in Vaex is through the convenience methods isna/countna/fillna which follows the definition above. I also think having both NaN and missing values in a column can indicate different things, and a user should be able to distinguish between them.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on July 23, 2024 2

Following up on a question on a question on the call by @apaszke. This demonstrates why a typed NA might be desirable.

We have that datetime - datetime is a timedelta. But datetime - timedelta
is a datetime.

If we have an untyped, scalar NA then you you have to arbitrarily choose that
datetime - NA interprets the NA as, say, a datetime, so the result is a timedelta.

In [17]: a  # datetime
Out[17]:
           A
0 2000-01-01
1 2000-01-01

In [18]: b  # datetime
Out[18]:
           A
0        NA
1 2000-01-01

In [19]: (a - b.iloc[0, 0]).dtypes
Out[19]:
A    timedelta64[ns]
dtype: object

But that loses the (I think desirable) property of knowing the result dtype of
the operation (a - any_scalar_from_b). It would now depend on whether the particular
scalar from b was NA or not.

Having a typed NA scalar like NA<datetime> would resolve this.

from dataframe-api.

teoliphant avatar teoliphant commented on July 23, 2024 2

There were many multiple long discussions about missing data that pre-dated Pandas from 1998 to 2011 while NumPy was being written and emerging in popularity. There were several NEPS and mailing list discussions that didn't result in large agreement. No-one was funded to work on this. I remember these debates well but could not facilitate them effectively because I was either running a consulting company or leaving that company to start Anaconda. I do remember getting two particularly active participants in the discussion together to further the conversation. The output of these efforts were the two participants, Mark and Nathaniel, writing a document that is published here: https://numpy.org/neps/nep-0026-missing-data-summary.html that goes into a lot of detail about the opportunities and challenges from the NumPy perspective.

I think it's very important that we understand that much of the challenge they faced in coming to agreement about NumPy is that changing an existing library and working out all the details of what must be changed in the code is much harder than proposing an API based on existing work.

Of course, for any reference to be relevant, it has to be used, and so it's not completely orthogonal. However, now there are many, many more array libraries and dataframe libraries. Our efforts here are to do our best to express the best API we can confidently describe and then work with projects to consume or produce these.

My personal conclusion about the missing data APIs: the problem actually rests in the fact that NumPy only created an approximate type system (dtypes) and did not build well on Python's type system.

A type system is what connects the bytes contained in a data-structure to how those types should be interpreted by code. Certainly the sentinel concept is clearly a new kind of type (in ndtypes we called it an optional type). Even the masked concept could be considered a kind of type (if you consider the mask-bits part of the element data -- even though the mask bits are stored elsewhere). It is probably better, though to consider a mask array as a separate container type that could be used for a data-frame with native support for missing data.

NumPy has a nascent type system, but it is not easily extended (though you can do it in C with some effort). The type extension system is very different from the builtin types which means NumPy's types are somewhat related to Python 1.0 classes. If NumPy had a more easily extended type-system, then we could have had many more experiments with missing data and would be farther along.

So, in my mind, the missing data problem is actually deeply connected to the "type" problem which does not have a great solution currently today in Python. I have ideas and designs about how to fix this fundamentally (anyone want to fund me to fix it?). There is even quite a bit of code in the xnd, ndtypes, and mtypes repositories (some of which may be useful).

For the purposes of this consortium, however, I think we will have to effectively follow what Vaex is doing here (and it sounds like Pandas is heading to) and have both (NAN) and (NA) and leave it to libraries to comply with the standard.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on July 23, 2024 1

Yes, we should make the distinction between NA and NaN clear.

There might be reason to support both within a single colum. For example,

>>> a = DataFrame({"A": [0, 1, NA, np.nan]})
>>> b = DataFrame({"A": [0, 0, 0, 0]})
>>> a / b
DataFrame({"A": [nan, inf, NA, nan]})  # float dtype

0/0 is defined to be nan. We would be saying that NA / 0 would be NA, by the principle that the result depends on the NA value.

This has implications for other parts of the API: does DataFrame.dropna() drop just NA values? Or does it drop NaN values as well?

Discussion on the NA vs. NaN distinction at pandas-dev/pandas#32265. In particular, cudf users have reported that some users appreciate being able to store both NaN and NA values within a single column: pandas-dev/pandas#32265 (comment).

from dataframe-api.

markusweimer avatar markusweimer commented on July 23, 2024

Do others see value in different kinds of missing? E.g. not_recorded which indicates the data was not present at the source vs. NaN which would indicate a computation returned an invalid result.

from dataframe-api.

amueller avatar amueller commented on July 23, 2024

What's NA * NaN?

I agree that having both might be useful, though I think I'm not entirely decided on whether it's necessary. They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right?

Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level?

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on July 23, 2024

What's NA * NaN?

NA. edit: I'm not actually sure about this. Pandas current implementation (returning NA) may not be valid.

They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right?

What do you mean by "change the outcome", or rather, how does that differ from "different semantics"? To me those sound the same :) (e.g. np.nan > 0 is False, while NA > 0 is NA sounds like both different semantics and different outcomes).

Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level?

Indeed, handling both might be difficult, or at least requires some thought. Scikit-Learn is choosing to treat pandas.NA as np.nan at the boundary in check_array: scikit-learn/scikit-learn#16508. This results in a loss of precision for large integers, but that might be the right choice for that library.

from dataframe-api.

amueller avatar amueller commented on July 23, 2024

The NA * NaN was a bit tongue in cheek, though if we want to have both types inside a column then this actually needs an answer, and there's probably other situations that are at least as unclear.
Ok, good example of different behavior. With semantics I meant more semantics in user code, i.e. even if both had exactly the same behavior within a library, it might still be useful to have both so that users could distinguish them in their code.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on July 23, 2024

from dataframe-api.

maartenbreddels avatar maartenbreddels commented on July 23, 2024

Is there an agreement on what 'NA' means, does it means 'Not available'?

I would say the meaning of 'missing' is the least ambiguous (which has its cons and pros), also NaN has a very explicit meaning, the meaning of null and NA are less clear to me.

from dataframe-api.

TomAugspurger avatar TomAugspurger commented on July 23, 2024

Yes, I think "Not available".

from dataframe-api.

saulshanabrook avatar saulshanabrook commented on July 23, 2024

FWIW when I hear NA I think Not Applicable, but maybe I am just not used to the domain specific usage here.

from dataframe-api.

amueller avatar amueller commented on July 23, 2024

I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex.

I'm not sure I follow this part, could you please elaborte?

from dataframe-api.

datapythonista avatar datapythonista commented on July 23, 2024

I see NaN as a particular value of float. My understanding is that the advantage is that CPU's understand it, and can operate with it. So, I assume the performance should be much faster than having a boolean mask, when operating with it.

I guess for projects like numpy that can be important, but I don't think it is worth the trouble for dataframes. My opinion is that it'd make live easier for the users if NaN values are automatically convert to NA in the boolean mask, and users of dataframes forget about them.

from dataframe-api.

maartenbreddels avatar maartenbreddels commented on July 23, 2024

I'm not sure I follow this part, could you please elaborte?

Since integers don't have a special value like NaN, you cannot 'abuse' NaN as a missing value. You could use a special value, but that would cause trouble since if you happen to have data which includes that special value, you suddenly have an accidental missing value.

A user might be able to get away with that I think, but having that solution as a building block for an ecosystem to build on does not sound like a good plan.

I think that means you have to keep track of a mask (bit or bytemask). And I guess that's also what database do. They will not limit you to not use a special value for integers, because it's reserved for a 'missing value sentinel' (correct me if I'm wrong, but I'd be surprised).

On top of that, NaN and missing values can have different meaning. A missing value can be a missing value, as it indicates, a NaN could mean a measurement gone wrong, a math 'error' etc. NaN and missing values are fundamentally different things, although one could group them (say call them NA).

I think I fully agree with Apache Arrow's idea. Each array can have missing values, and in that case it has a bitmask attached to the array, but it's optional. If you compute on this, I think the plan is to just brute force compute over all the data (ignoring the array has missing values, since it's all victorized/SIMD down in the loops.)
Apart from that, the optional bitmasks can be combined, in whatever way the algorithm things is required. I think performance-wise, that should be quite efficient.

Note that using a bitmask is not that memory consuming. Say a 1 billion column of float64 would use 1e9B*8=~8GB, if it had a full mask (1 bit per element), it would require 1e9B/8=~125MB extra, 1/64=1.5%.

My opinion is that it'd make live easier for the users if NaN values are automatically convert to NA in the boolean mask, and users of dataframes forget about them.

I disagree and agree here because I think you should be able to distinguish between them, but also have the "I don't care , so throw away any data that's NA/null/missing/NaN whatever".
This is the reason why I chose in vaex to have isna/isnan/ismissing and countna/countnan/countmissing etc. I usually use isna, but sometimes I need isnan or ismissing.

from dataframe-api.

maartenbreddels avatar maartenbreddels commented on July 23, 2024

Thanks @TomAugspurger, that's really useful to keep in mind.

@teoliphant said it should be possible for ndarrays to have extra data added to it, like a mask (bit or byte). If normal ndarrays were more like numpy masked arrays and keep track of their masks, and numpy scalars would also hold this information (it would keep a single bit or byte), we could have masked scalar values. You wouldn't need sentinel value.

I think masked arrays in numpy has to happen someday (not added on), instead of solving it in the DataFrame layer (recognizing that's probably an order of magnitude more difficult to get off the ground).

from dataframe-api.

gatorsmile avatar gatorsmile commented on July 23, 2024

FYI, PySpark is following the NULL semantics that was defined in ANSI SQL. We documented our behaviors in http://spark.apache.org/docs/latest/sql-ref-null-semantics.html

from dataframe-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.