Comments (18)
@markusweimer that's what this is about indeed; it would be good to be more explicit in the final description. "not-a-number" (nan
) is pretty universally available, "missing data" here means "not recorded" and support for it is much more recent and more patchy.
from dataframe-api.
I think we can reasonably choose between
NA
,null
, ormissing
as a general name for "missing" values.
I feel like 'null' is a bit strange in the context of numbers, since it reminds me of pointers. I think missing is more fitting, though less neutral. But that is maybe what we want (give it an explicit meaning).
In Vaex we defined isna(x)
as isnan(x) | ismissing(x)
, where ismissing(x)
means missing values (implemented as masked arrays or Arrow Arrays which naturally have null bitmasks). isnan(x)
just follows IEEE standards. So isna
is short for 'get rid of anything messy'.
I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex.
I see NaN (just a float number) as orthogonal to missing values, the only connection they have in Vaex is through the convenience methods isna/countna/fillna
which follows the definition above. I also think having both NaN and missing values in a column can indicate different things, and a user should be able to distinguish between them.
from dataframe-api.
Following up on a question on a question on the call by @apaszke. This demonstrates why a typed NA
might be desirable.
We have that datetime - datetime
is a timedelta
. But datetime - timedelta
is a datetime
.
If we have an untyped, scalar NA then you you have to arbitrarily choose that
datetime - NA
interprets the NA
as, say, a datetime, so the result is a timedelta.
In [17]: a # datetime
Out[17]:
A
0 2000-01-01
1 2000-01-01
In [18]: b # datetime
Out[18]:
A
0 NA
1 2000-01-01
In [19]: (a - b.iloc[0, 0]).dtypes
Out[19]:
A timedelta64[ns]
dtype: object
But that loses the (I think desirable) property of knowing the result dtype of
the operation (a - any_scalar_from_b)
. It would now depend on whether the particular
scalar from b
was NA or not.
Having a typed NA scalar like NA<datetime>
would resolve this.
from dataframe-api.
There were many multiple long discussions about missing data that pre-dated Pandas from 1998 to 2011 while NumPy was being written and emerging in popularity. There were several NEPS and mailing list discussions that didn't result in large agreement. No-one was funded to work on this. I remember these debates well but could not facilitate them effectively because I was either running a consulting company or leaving that company to start Anaconda. I do remember getting two particularly active participants in the discussion together to further the conversation. The output of these efforts were the two participants, Mark and Nathaniel, writing a document that is published here: https://numpy.org/neps/nep-0026-missing-data-summary.html that goes into a lot of detail about the opportunities and challenges from the NumPy perspective.
I think it's very important that we understand that much of the challenge they faced in coming to agreement about NumPy is that changing an existing library and working out all the details of what must be changed in the code is much harder than proposing an API based on existing work.
Of course, for any reference to be relevant, it has to be used, and so it's not completely orthogonal. However, now there are many, many more array libraries and dataframe libraries. Our efforts here are to do our best to express the best API we can confidently describe and then work with projects to consume or produce these.
My personal conclusion about the missing data APIs: the problem actually rests in the fact that NumPy only created an approximate type system (dtypes) and did not build well on Python's type system.
A type system is what connects the bytes contained in a data-structure to how those types should be interpreted by code. Certainly the sentinel concept is clearly a new kind of type (in ndtypes we called it an optional type). Even the masked concept could be considered a kind of type (if you consider the mask-bits part of the element data -- even though the mask bits are stored elsewhere). It is probably better, though to consider a mask array as a separate container type that could be used for a data-frame with native support for missing data.
NumPy has a nascent type system, but it is not easily extended (though you can do it in C with some effort). The type extension system is very different from the builtin types which means NumPy's types are somewhat related to Python 1.0 classes. If NumPy had a more easily extended type-system, then we could have had many more experiments with missing data and would be farther along.
So, in my mind, the missing data problem is actually deeply connected to the "type" problem which does not have a great solution currently today in Python. I have ideas and designs about how to fix this fundamentally (anyone want to fund me to fix it?). There is even quite a bit of code in the xnd, ndtypes, and mtypes repositories (some of which may be useful).
For the purposes of this consortium, however, I think we will have to effectively follow what Vaex is doing here (and it sounds like Pandas is heading to) and have both (NAN) and (NA) and leave it to libraries to comply with the standard.
from dataframe-api.
Yes, we should make the distinction between NA and NaN clear.
There might be reason to support both within a single colum. For example,
>>> a = DataFrame({"A": [0, 1, NA, np.nan]})
>>> b = DataFrame({"A": [0, 0, 0, 0]})
>>> a / b
DataFrame({"A": [nan, inf, NA, nan]}) # float dtype
0/0 is defined to be nan. We would be saying that NA / 0
would be NA, by the principle that the result depends on the NA
value.
This has implications for other parts of the API: does DataFrame.dropna()
drop just NA
values? Or does it drop NaN
values as well?
Discussion on the NA vs. NaN distinction at pandas-dev/pandas#32265. In particular, cudf users have reported that some users appreciate being able to store both NaN and NA values within a single column: pandas-dev/pandas#32265 (comment).
from dataframe-api.
Do others see value in different kinds of missing? E.g. not_recorded
which indicates the data was not present at the source vs. NaN
which would indicate a computation returned an invalid result.
from dataframe-api.
What's NA * NaN
?
I agree that having both might be useful, though I think I'm not entirely decided on whether it's necessary. They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right?
Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level?
from dataframe-api.
What's NA * NaN?
NA. edit: I'm not actually sure about this. Pandas current implementation (returning NA) may not be valid.
They do have different semantics, but the cases where the different semantics change the outcome are pretty rare, right?
What do you mean by "change the outcome", or rather, how does that differ from "different semantics"? To me those sound the same :) (e.g. np.nan > 0
is False, while NA > 0
is NA
sounds like both different semantics and different outcomes).
Having both in a single column will certainly make life for downstream packages harder because now we might need to deal with two special cases everywhere. Unless they are both mapped to the same at the numeric level and only are different on the dataframe level?
Indeed, handling both might be difficult, or at least requires some thought. Scikit-Learn is choosing to treat pandas.NA
as np.nan
at the boundary in check_array
: scikit-learn/scikit-learn#16508. This results in a loss of precision for large integers, but that might be the right choice for that library.
from dataframe-api.
The NA * NaN
was a bit tongue in cheek, though if we want to have both types inside a column then this actually needs an answer, and there's probably other situations that are at least as unclear.
Ok, good example of different behavior. With semantics I meant more semantics in user code, i.e. even if both had exactly the same behavior within a library, it might still be useful to have both so that users could distinguish them in their code.
from dataframe-api.
from dataframe-api.
Is there an agreement on what 'NA' means, does it means 'Not available'?
I would say the meaning of 'missing' is the least ambiguous (which has its cons and pros), also NaN has a very explicit meaning, the meaning of null and NA are less clear to me.
from dataframe-api.
Yes, I think "Not available".
from dataframe-api.
FWIW when I hear NA I think Not Applicable, but maybe I am just not used to the domain specific usage here.
from dataframe-api.
I strongly dislike using sentinels/special values for missing values in a library since for integers there is basically no solution. This means you need to support a byte or bitmask anway to keep track of them. Mixing sentinels and missing values just makes life more complex.
I'm not sure I follow this part, could you please elaborte?
from dataframe-api.
I see NaN
as a particular value of float. My understanding is that the advantage is that CPU's understand it, and can operate with it. So, I assume the performance should be much faster than having a boolean mask, when operating with it.
I guess for projects like numpy that can be important, but I don't think it is worth the trouble for dataframes. My opinion is that it'd make live easier for the users if NaN
values are automatically convert to NA
in the boolean mask, and users of dataframes forget about them.
from dataframe-api.
I'm not sure I follow this part, could you please elaborte?
Since integers don't have a special value like NaN, you cannot 'abuse' NaN as a missing value. You could use a special value, but that would cause trouble since if you happen to have data which includes that special value, you suddenly have an accidental missing value.
A user might be able to get away with that I think, but having that solution as a building block for an ecosystem to build on does not sound like a good plan.
I think that means you have to keep track of a mask (bit or bytemask). And I guess that's also what database do. They will not limit you to not use a special value for integers, because it's reserved for a 'missing value sentinel' (correct me if I'm wrong, but I'd be surprised).
On top of that, NaN and missing values can have different meaning. A missing value can be a missing value, as it indicates, a NaN could mean a measurement gone wrong, a math 'error' etc. NaN and missing values are fundamentally different things, although one could group them (say call them NA).
I think I fully agree with Apache Arrow's idea. Each array can have missing values, and in that case it has a bitmask attached to the array, but it's optional. If you compute on this, I think the plan is to just brute force compute over all the data (ignoring the array has missing values, since it's all victorized/SIMD down in the loops.)
Apart from that, the optional bitmasks can be combined, in whatever way the algorithm things is required. I think performance-wise, that should be quite efficient.
Note that using a bitmask is not that memory consuming. Say a 1 billion column of float64 would use 1e9B*8=~8GB, if it had a full mask (1 bit per element), it would require 1e9B/8=~125MB extra, 1/64=1.5%.
My opinion is that it'd make live easier for the users if
NaN
values are automatically convert toNA
in the boolean mask, and users of dataframes forget about them.
I disagree and agree here because I think you should be able to distinguish between them, but also have the "I don't care , so throw away any data that's NA/null/missing/NaN whatever".
This is the reason why I chose in vaex to have isna/isnan/ismissing
and countna/countnan/countmissing
etc. I usually use isna
, but sometimes I need isnan
or ismissing
.
from dataframe-api.
Thanks @TomAugspurger, that's really useful to keep in mind.
@teoliphant said it should be possible for ndarrays to have extra data added to it, like a mask (bit or byte). If normal ndarrays were more like numpy masked arrays and keep track of their masks, and numpy scalars would also hold this information (it would keep a single bit or byte), we could have masked scalar values. You wouldn't need sentinel value.
I think masked arrays in numpy has to happen someday (not added on), instead of solving it in the DataFrame layer (recognizing that's probably an order of magnitude more difficult to get off the ground).
from dataframe-api.
FYI, PySpark is following the NULL semantics that was defined in ANSI SQL. We documented our behaviors in http://spark.apache.org/docs/latest/sql-ref-null-semantics.html
from dataframe-api.
Related Issues (20)
- Request for more examples
- Feature request: `get_column_by_name` impairs readability HOT 14
- from_dataframe: only interchange supported columns? HOT 9
- Column reductions should return 1-row Column HOT 5
- What's the deal with Scalars? HOT 15
- __parameters__ should not be a documented property of the API HOT 1
- Dealing with `if scalar` HOT 2
- Roadmap for Dataframe API HOT 1
- Rename entrypoint to `__consortium_api__`? HOT 7
- Duration/timedelta not supported by dataframe interchange protocol? HOT 2
- Move `year` to `.dt.year`, and other namespace-specific function HOT 1
- Joins, and joining columns HOT 3
- How will cudf handle `df.assign(df.col('a').sort())`? HOT 10
- namespace.coalesce
- Expressions - another attempt HOT 10
- Remove DataFrame.take HOT 6
- Unclear Memory Ownership and Lifetimes
- Nullability Sentinel Very Hard to Use
- Instructions for libraries implementing PEP 249 – Python Database API Specification v2.0
- Row ordering design choice HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframe-api.