We should decide if this is a bug or a feature. The main question is how much we requi

There are some corner cases: If I delete a column that's in a

Quick clarification: I meant: deleting a column that is a memb

A few concrete examples: <div class="snippet-clipboard-content notranslate positio

Indexing a DataFrame removes column groupings about dataframes.jl HOT 19 CLOSED

juliadata commented on July 21, 2024

Indexing a DataFrame removes column groupings

from dataframes.jl.

Comments (19)

johnmyleswhite commented on July 21, 2024

Is there a definite case in which one wouldn't want the groupings to persist?

from dataframes.jl.

doobwa commented on July 21, 2024

We probably wouldn't want them to persist if the indexed DataFrame no longer includes a particular column belonging to one of the groups.

from dataframes.jl.

johnmyleswhite commented on July 21, 2024

That was my original reaction, but I'm not as sure now. Suppose you originally had columns January1900, February1900, ..., December1999. You might have groupings of Winter, Spring, Summer and Fall. If you removed February, it seems like the Winter grouping could be still useful to have and would never harmful to have. Basically, it seems valuable to keep the grouping if the grouping could be written using simple column-wise predicates: column C is in Group G iff predicate(column C) holds. That doesn't depend on the other members of the group. Or do we think grouping has more content than this?

from dataframes.jl.

tshort commented on July 21, 2024

There are some corner cases:

If I delete a column that's in a grouping, does that delete the whole grouping or just removes that columns from the grouping?
On cbind, what happens when two dfs have the same grouping?
What about indexing that ends up repeating columns, like df[:,[1,1,2,2,3:6]]? What happens to a grouping that includes column 1?
If we rbind two DataFrames, and one has a grouping, and the other doesn't, what do you do?
If I do a grouping operation (by(df, ["colA", "colB"], [:sum]) on a DataFrame, do we try to maintain column groups on the answer? For some operations, it would make sense, but for others, it would not.

from dataframes.jl.

doobwa commented on July 21, 2024

My preference is that column groups are preserved if they can be used to index the DateFrame. For the above corner cases, this more or less implies the following behaviour:

deleting a member of a group deletes the group
cbind only includes unique groupings in the newly formed DataFrame
if all of the members of a group are present in the final indexed DataFrame, include the group
include the grouping on rbind
maintaining column groups in the answer if all of the members of the group are present in the answer

IMHO the above behaviour sounds quite useful. I have not, however, thought through how this would be implemented.

from dataframes.jl.

HarlanH commented on July 21, 2024

Hm, without actual workflows, I'm sorta guessing, but I have different instincts from Chris...

deleting a member of a group retains the other members (as John suggested)
on cbind, use the same behavior if you've got duplicate column names. Looks like we're using make_unique() to append numbers to repeated column names. I'd suggest doing the same for group names.
If indexing repeats columns, the group is preserved with the indexed columns, plus any column duplicates with names created by make_unique().
Include the grouping on rbind. Probably throw an error if you have the same group names but different columsn in the groups.
maintain column groups in the answer if any of the members of the group are present in the answer.

But again, I might change my mind after actually using this in anger...

from dataframes.jl.

doobwa commented on July 21, 2024

Quick clarification:

I meant: deleting a column that is a member of a group deletes the grouping (i.e. just the element of the Dict), but not the member columns.
Interesting.
Do you mind providing a concrete example? I think this would help us discuss the merits one way or another.
We agree on this one.
I prefer requiring all rather than any: if one of the members is not there, you are no longer able to index the DataFrame using that column group.

from dataframes.jl.

johnmyleswhite commented on July 21, 2024

To help us keep track of this discussion, here's a quick summary of the issues we've raised about the persistance of column groupings under various operations. For some of these, we don't seem to have definite answers yet nor did we seem to have great reasons for any of our individual first reaction positions.

When are column groupings deleted after column deletion?

Deletion always takes place
Deletion takes place if any member of the previous group is missing
Deletion takes place if all members of the previous group are missing

It's not clear to me that we have good arguments for any of these. If we come to the conclusion that none of these causes harm, then I think we should just pick whichever is preferred by a majority of us.

What happens to groupings when we use cbind to combine df1 and df2?

We always delete all groupings when applying cbind.
If there is a grouping with name X in only one of the DataFrame's, we keep it.
If there is a grouping with name X in both DataFrame's, we delete it.
If there is a grouping with name X in both DataFrame's, we use make_unique() to append numbers to the repeated group names of df2 until there is no repetition.

I don't see any strong objections to the last proposal, which was raised by Harlan.

What happens to groupings when we use rbind to combine df1 and df2?

If there is a grouping with name X in one DataFrame, but not the other, we delete this grouping from the result.
If there is a grouping with name X in one DataFrame, but not the other, we propagate this grouping to the result.

I didn't see any arguments raised for either side. I'd say that we should decide this using a more general rule also applicable to column deletion: if there's no harm, should we try to keep as many groupings as possible?

What happens to groupings when we use column indexing with a repeated index?

If indexing repeats columns, the group is preserved with the indexed columns, plus any column duplicates with names created by make_unique().

I'm unclear about the exact results of this. In the new DataFrame are we asserting that the grouping that previously contained Column C should now contain two copies of Column C with different names?

What happens to groupings when we use by to group entries of df?

If I do a grouping operation (by(df, ["colA", "colB"], [:sum]) on a DataFrame, do we try to maintain column groups on the answer?

I have no real intuitions for this one.

from dataframes.jl.

doobwa commented on July 21, 2024

A few concrete examples:

# Column grouping example
a = DataFrame(quote
    x1 = randn(50)
    x2 = randn(50)
    x3 = randn(50)
end)

set_group(a, "odd_predictors", ["x1","x3"])

del(a, "x3")
a["odd_predictors"]  # should this still be possible?

I vote the above should not be possible. It seems like indexing should be predictable: if I ask for 3 columns (e.g. via a column group), I want to get 3 columns -- if that is not possible, then I should get an error.

On the next item, I agree with Harlan that groupings should use the names given by make_unique():

# Column grouping example
b = DataFrame(quote
    x1 = randn(50)
    x2 = randn(50)
    x3 = randn(50)
end)

set_group(b, "odd_predictors", ["x1", "x3"])

x = cbind(a,b)
@assert colnames(x) == ["x1", "x2", "x3", "x1_1", "x2_1", "x3_1"]
# Probably want the following for get_groups(x):
# odd_predictors:   x1, x3
# odd_predictors_1: x1_1, x3_1

Also, when using repeated column indices, should the resulting DataFrame have unique column names? This is not currently the case, it seems:

x = a[:,[1,1,2,2,3]]
colnames(x)  # "x1" "x1" "x2" "x2" "x3"

If we end up using make_unique() for the column names, then I think it makes sense to have get_groups(x) return the same groupings as a if all of the members are present in x. I think all vs any is the crux of the issue. I wouldn't say it's "harmful" to do any, but I think it's more predictable to do all.

PS. I'm not sure it makes any sense to maintain column groups for that example grouping operation. Won't that do something like plyr, so that colA and colB are actually no longer columns in the resulting DataFrame?

from dataframes.jl.

doobwa commented on July 21, 2024

On second thought, maybe we should just show a warning when somebody indexes a DataFrame with a column group when not all the members are present.

from dataframes.jl.

HarlanH commented on July 21, 2024

Not a fan of warnings in libraries, if we can avoid it, unless it's
optional... Makes production code weird.

On Tue, Aug 7, 2012 at 11:15 AM, Chris DuBois [email protected]:

On second thought, maybe we should just show a warning when somebody
indexes a DataFrame with a column group when not all the members are
present.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/47#issuecomment-7556497.

from dataframes.jl.

johnmyleswhite commented on July 21, 2024

IIRC, there are methods in Julia for only outputting errors during interactive sessions.

from dataframes.jl.

doobwa commented on July 21, 2024

What about modifying the groupings? i.e. deleting the column removes it from all column groupings of that DataFrame.

from dataframes.jl.

HarlanH commented on July 21, 2024

Yes, I thought that modifying groupings is what I was arguing for!

from dataframes.jl.

doobwa commented on July 21, 2024

My apologies: I was getting caught up on the use of "group" vs. "grouping"!

from dataframes.jl.

doobwa commented on July 21, 2024

@tshort It seems that del! in index.jl correctly removes a column from any groups that it was a member of. del does not however. It seems that both functions will modify the object in place, so couldn't del just call del!? That at least takes care of the first point on the above lists, but I haven't gotten around to looking at the others quite yet.

from dataframes.jl.

doobwa commented on July 21, 2024

I've made some progress on the above items in ac8a9c0. @tshort: I'm still uncertain about the role of del and deepcopy, and I was unable to modify deepcopy to respect groupings while still having tests/data.jl pass.

Part of me thinks that DataFrames with groupings should be its own type: it might keep us from having functions that check for groupings, but might be less convenient to the end user.

from dataframes.jl.

HarlanH commented on July 21, 2024

and I think I fixed copy() and deepcopy() too. e63e6a9

Do we need to do anything else on this issue, or can it be closed?

from dataframes.jl.

doobwa commented on July 21, 2024

Looks good to me. I added a test in 2d220b7 for bullet 3 of tshort's original corner cases. All cases are now addressed but the last one; since groupings might not preserve columns I think we can ignore it.

from dataframes.jl.

Indexing a DataFrame removes column groupings about dataframes.jl HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent