Comments (19)
Is there a definite case in which one wouldn't want the groupings to persist?
from dataframes.jl.
We probably wouldn't want them to persist if the indexed DataFrame
no longer includes a particular column belonging to one of the groups.
from dataframes.jl.
That was my original reaction, but I'm not as sure now. Suppose you originally had columns January1900, February1900, ..., December1999. You might have groupings of Winter, Spring, Summer and Fall. If you removed February, it seems like the Winter grouping could be still useful to have and would never harmful to have. Basically, it seems valuable to keep the grouping if the grouping could be written using simple column-wise predicates: column C is in Group G iff predicate(column C) holds. That doesn't depend on the other members of the group. Or do we think grouping has more content than this?
from dataframes.jl.
There are some corner cases:
- If I delete a column that's in a grouping, does that delete the whole grouping or just removes that columns from the grouping?
- On cbind, what happens when two dfs have the same grouping?
- What about indexing that ends up repeating columns, like df[:,[1,1,2,2,3:6]]? What happens to a grouping that includes column 1?
- If we
rbind
two DataFrames, and one has a grouping, and the other doesn't, what do you do? - If I do a grouping operation (by(df, ["colA", "colB"], [:sum]) on a DataFrame, do we try to maintain column groups on the answer? For some operations, it would make sense, but for others, it would not.
from dataframes.jl.
My preference is that column groups are preserved if they can be used to index the DateFrame
. For the above corner cases, this more or less implies the following behaviour:
- deleting a member of a group deletes the group
- cbind only includes unique groupings in the newly formed DataFrame
- if all of the members of a group are present in the final indexed DataFrame, include the group
- include the grouping on rbind
- maintaining column groups in the answer if all of the members of the group are present in the answer
IMHO the above behaviour sounds quite useful. I have not, however, thought through how this would be implemented.
from dataframes.jl.
Hm, without actual workflows, I'm sorta guessing, but I have different instincts from Chris...
- deleting a member of a group retains the other members (as John suggested)
- on cbind, use the same behavior if you've got duplicate column names. Looks like we're using make_unique() to append numbers to repeated column names. I'd suggest doing the same for group names.
- If indexing repeats columns, the group is preserved with the indexed columns, plus any column duplicates with names created by make_unique().
- Include the grouping on rbind. Probably throw an error if you have the same group names but different columsn in the groups.
- maintain column groups in the answer if any of the members of the group are present in the answer.
But again, I might change my mind after actually using this in anger...
from dataframes.jl.
Quick clarification:
- I meant: deleting a column that is a member of a group deletes the grouping (i.e. just the element of the Dict), but not the member columns.
- Interesting.
- Do you mind providing a concrete example? I think this would help us discuss the merits one way or another.
- We agree on this one.
- I prefer requiring all rather than any: if one of the members is not there, you are no longer able to index the DataFrame using that column group.
from dataframes.jl.
To help us keep track of this discussion, here's a quick summary of the issues we've raised about the persistance of column groupings under various operations. For some of these, we don't seem to have definite answers yet nor did we seem to have great reasons for any of our individual first reaction positions.
When are column groupings deleted after column deletion?
- Deletion always takes place
- Deletion takes place if any member of the previous group is missing
- Deletion takes place if all members of the previous group are missing
It's not clear to me that we have good arguments for any of these. If we come to the conclusion that none of these causes harm, then I think we should just pick whichever is preferred by a majority of us.
What happens to groupings when we use cbind
to combine df1
and df2
?
- We always delete all groupings when applying
cbind
. - If there is a grouping with name
X
in only one of the DataFrame's, we keep it. - If there is a grouping with name
X
in both DataFrame's, we delete it. - If there is a grouping with name
X
in both DataFrame's, we usemake_unique()
to append numbers to the repeated group names ofdf2
until there is no repetition.
I don't see any strong objections to the last proposal, which was raised by Harlan.
What happens to groupings when we use rbind
to combine df1
and df2
?
- If there is a grouping with name
X
in one DataFrame, but not the other, we delete this grouping from the result. - If there is a grouping with name
X
in one DataFrame, but not the other, we propagate this grouping to the result.
I didn't see any arguments raised for either side. I'd say that we should decide this using a more general rule also applicable to column deletion: if there's no harm, should we try to keep as many groupings as possible?
What happens to groupings when we use column indexing with a repeated index?
- If indexing repeats columns, the group is preserved with the indexed columns, plus any column duplicates with names created by
make_unique()
.
I'm unclear about the exact results of this. In the new DataFrame are we asserting that the grouping that previously contained Column C should now contain two copies of Column C with different names?
What happens to groupings when we use by
to group entries of df
?
- If I do a grouping operation (by(df, ["colA", "colB"], [:sum]) on a DataFrame, do we try to maintain column groups on the answer?
I have no real intuitions for this one.
from dataframes.jl.
A few concrete examples:
# Column grouping example
a = DataFrame(quote
x1 = randn(50)
x2 = randn(50)
x3 = randn(50)
end)
set_group(a, "odd_predictors", ["x1","x3"])
del(a, "x3")
a["odd_predictors"] # should this still be possible?
I vote the above should not be possible. It seems like indexing should be predictable: if I ask for 3 columns (e.g. via a column group), I want to get 3 columns -- if that is not possible, then I should get an error.
On the next item, I agree with Harlan that groupings should use the names given by make_unique()
:
# Column grouping example
b = DataFrame(quote
x1 = randn(50)
x2 = randn(50)
x3 = randn(50)
end)
set_group(b, "odd_predictors", ["x1", "x3"])
x = cbind(a,b)
@assert colnames(x) == ["x1", "x2", "x3", "x1_1", "x2_1", "x3_1"]
# Probably want the following for get_groups(x):
# odd_predictors: x1, x3
# odd_predictors_1: x1_1, x3_1
Also, when using repeated column indices, should the resulting DataFrame have unique column names? This is not currently the case, it seems:
x = a[:,[1,1,2,2,3]]
colnames(x) # "x1" "x1" "x2" "x2" "x3"
If we end up using make_unique()
for the column names, then I think it makes sense to have get_groups(x)
return the same groupings as a
if all of the members are present in x
. I think all vs any is the crux of the issue. I wouldn't say it's "harmful" to do any, but I think it's more predictable to do all.
PS. I'm not sure it makes any sense to maintain column groups for that example grouping operation. Won't that do something like plyr, so that colA and colB are actually no longer columns in the resulting DataFrame?
from dataframes.jl.
On second thought, maybe we should just show a warning when somebody indexes a DataFrame with a column group when not all the members are present.
from dataframes.jl.
Not a fan of warnings in libraries, if we can avoid it, unless it's
optional... Makes production code weird.
On Tue, Aug 7, 2012 at 11:15 AM, Chris DuBois [email protected]:
On second thought, maybe we should just show a warning when somebody
indexes a DataFrame with a column group when not all the members are
present.—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/47#issuecomment-7556497.
from dataframes.jl.
IIRC, there are methods in Julia for only outputting errors during interactive sessions.
from dataframes.jl.
What about modifying the groupings? i.e. deleting the column removes it from all column groupings of that DataFrame.
from dataframes.jl.
Yes, I thought that modifying groupings is what I was arguing for!
from dataframes.jl.
My apologies: I was getting caught up on the use of "group" vs. "grouping"!
from dataframes.jl.
@tshort It seems that del!
in index.jl correctly removes a column from any groups that it was a member of. del
does not however. It seems that both functions will modify the object in place, so couldn't del
just call del!
? That at least takes care of the first point on the above lists, but I haven't gotten around to looking at the others quite yet.
from dataframes.jl.
I've made some progress on the above items in ac8a9c0. @tshort: I'm still uncertain about the role of del
and deepcopy
, and I was unable to modify deepcopy
to respect groupings while still having tests/data.jl
pass.
Part of me thinks that DataFrames with groupings should be its own type: it might keep us from having functions that check for groupings, but might be less convenient to the end user.
from dataframes.jl.
and I think I fixed copy() and deepcopy() too. e63e6a9
Do we need to do anything else on this issue, or can it be closed?
from dataframes.jl.
Looks good to me. I added a test in 2d220b7 for bullet 3 of tshort's original corner cases. All cases are now addressed but the last one; since groupings might not preserve columns I think we can ignore it.
from dataframes.jl.
Related Issues (20)
- Segmentation Fault when reading compressed file HOT 1
- Revisit spreading for `AsTable` output` HOT 6
- Better error message when forming a DataFrame from a vector of dictionaries with missing data. HOT 2
- `describe` is slow HOT 3
- CartesianIndex error in Julia 1.11 HOT 4
- `DataFrame(x=Int[], y=Int)` HOT 3
- Add comparison function for dataframes which can handle both isapprox and isequal column types HOT 2
- unique fails with column-type FixedDecimal HOT 5
- mapcols! should modify the parent of a SubDataFrame HOT 11
- Feature request: Pairs in stack HOT 2
- Grouped DataFrame with array elements fails to combine HOT 4
- error when combining a grouped empty dataframe using `first` HOT 6
- Short circuit && on subset? HOT 1
- Integer strings as colnames/selectors are error prone HOT 2
- Suggestion - Matrix Syntax for hcat (as well as vcat) HOT 4
- Document custom generation of column names in manual HOT 9
- `join` should not introduce `Missing` types to schema HOT 1
- Consider removing Tables.allocatecolumn in vcat
- DataFrame(t::Table) converts PooledVector columns HOT 2
- Sampling GroupedDataFrames (rand) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframes.jl.