I am having an issue with partioning a dataframe, which I don't understand. The groupe

IIRC the fix from <a class="user-mention notranslate" data-hovercard-type="user" data-

Partition not working about multidplyr HOT 16 CLOSED

tidyverse commented on August 28, 2024

Partition not working

from multidplyr.

Comments (16)

hadley commented on August 28, 2024

Without a reproducible example, I think you'll need to step through the code in partition() yourself, and figure out what's going wrong.

How many total combinations of year and month do you have? It's possible that it's small and because of the randomisation partition does, sometimes there aren't enough to spread over 31 nodes?

from multidplyr.

fugufisch commented on August 28, 2024

Yes, it's only 6 combinations. But even if I initialize a cluster with only 5 or 6 nodes I get this error.
It works if I only use 2 nodes though.

I get the same error with:

data(movies)
movies %>% partition(length)

This works:

movies %>% partition(length, cluster=create_cluster(2))

So I always need more combinations than nodes? How do I know how many nodes I can use?

from multidplyr.

hadley commented on August 28, 2024

You shouldn't have to but clearly there's a bug in my implementation of partition().

from multidplyr.

fugufisch commented on August 28, 2024

Ok, I'll see if I can fix it and maybe make a pull request. Thanks!

from multidplyr.

jmorten commented on August 28, 2024

I am having the same problem. I have 35,000,000 lines with 4,600,000 unique values that I am partitioning on. I would assume that partition would be able to create shards given the input.

partition() seems to work at random sometimes giving the below error and sometimes returning the data frame.

I tried with 32 partitions, 16, 8, 7, all failed. However, as above, 2 partitions do work.

from multidplyr.

fugufisch commented on August 28, 2024

I did a fix for this in my fork that solved the issue for me. It would be great if you could test my fix with your dataset.

from multidplyr.

jmorten commented on August 28, 2024

I would love to implement your fix, but I believe there might be an underlying issue since partition() works in a random manor.

I will restart my session and partition() will work, while other times it does not.

from multidplyr.

gdbassett commented on August 28, 2024

I also get the "Error: length(values) == length(cluster) is not TRUE" error when placing a feature in "partition". It even happens when a cluster is created the same length as unique features on the partitioned features.

> names(bla)
 [1] "id"        "sha256"    "timestamp" "src_port"  "dst_port"  "thing"       "thing2"  "thing3"     "thing4"       "thing5"
> length(unique(bla$thing4))
[1] 20
> cluster <- create_cluster(20)
Initialising 20 core cluster.
> chunk <- bla %>% partition(thing4, cluster=cluster) %>% group_by(thing3, thing4) %>% summarize(x=n()) %>% collect() %>% group_by(thing2) %>% mutate(n=sum(x), freq=x/n) %>% select(industry, ext, x, n, freq) %>% ungroup()
Error: length(values) == length(cluster) is not TRUE

I get the same error if 'cluster' isn't specified in partition or if the cluster size is 10 or 5 cores.

from multidplyr.

julou commented on August 28, 2024

a fix for this would be really helpful!
For now I'm living with the workaround of @fugufisch i.e. using my cluster[1:n] where n is smaller than the number of shards…

btw, I would say that fugufisch@fee752e is more general than Ax3man@cd04680 which as far as I understand is only relevant when no cluster argument is passed (but maybe I missed some subtlety…)

from multidplyr.

Ax3man commented on August 28, 2024

IIRC the fix from @fugufisch didn't always work for me (depending on the random shard distribution I think). Also, if no cluster is specified, it may create a cluster that is larger than necessary.

But it does automatically work when you supply a cluster of the wrong size. I think I can steal that and combine.

from multidplyr.

julou commented on August 28, 2024

@Ax3man could you elaborate?
this fix is only subsetting the cluster in case there are less shards than nodes…

from multidplyr.

Ax3man commented on August 28, 2024

Hmm, sorry that was another bug that I fixed that day. I now see they were separate commits. But it does create an oversized cluster (when cluster isn't specified and there are too few groups).

from multidplyr.

sw1sh commented on August 28, 2024

I have the same problem with partitioning data.frame that is also belongs to 'data.table' class (because I use fread for fast reading). So the solution in this case is simply class(df) <- 'data.frame'.

from multidplyr.

gmonaie commented on August 28, 2024

The line creating the issue is in partition_ where groups are being assigned to partitions or shards.

shard.R#51
groups$part_id <- floor(m * (cumsum(groups$n) - 1) / sum(groups$n) + 1)

As an example, number of rows (n) is 18, number of shards (m) is 5.

data is to be partitioned by date, looks like this:

> data %>% select(date)
Source: local data frame [18 x 1]
Groups: date

         date
1  2016-08-04
2  2016-08-04
3  2016-08-10
4  2016-08-10
5  2016-08-10
6  2016-08-10
7  2016-08-11
8  2016-08-11
9  2016-08-11
10 2016-08-11
11 2016-08-11
12 2016-08-11
13 2016-08-11
14 2016-08-11
15 2016-08-17
16 2016-08-17
17 2016-08-18
18 2016-08-18

Before assigning partitions:

> groups
Source: local data frame [5 x 2]

  id n
1  4 2
2  3 8
3  1 2
4  5 2
5  2 4

shard.R#51 groups$part_id <- floor(m * (cumsum(groups$n) - 1) / sum(groups$n) + 1)
This part_id assignment makes an uneven assignment of groups to shards with this result:

> groups
Source: local data frame [5 x 3]

  id n part_id
1  4 2       1
2  3 8       3
3  1 2       4
4  5 2       4
5  2 4       5

We can either change line #51 or change the cluster assignment to have a 1 to 1 mapping with the shards on line #59

cluster_assign_each(cluster[as.integer(names(shards))], name, shards)

from multidplyr.

kendonB commented on August 28, 2024

Please see and test my solution at https://github.com/kendonB/multidplyr

from multidplyr.

cwbishop commented on August 28, 2024

I'm a huge fan of this package, but I am also running into the same issue frequently when the number of partitions < the number of cores in the cluster.

My typical work flow is to create a forked cluster across all available cores after loading all necessary libraries, etc. in my interactive environment. I then leverage this cluster as needed for embarrassingly parallel operations (e.g., repetitive queries against a database, partition-wise computations or summaries, etc.). Cluster creation in this case is a heavy operation and thus creating a new cluster for each computation is prohibitively time consuming. It is also difficult to preemptively detect edge cases in which I have fewer partitions than cores in my cluster. The result is frequent errors and crashes.

Minimal reproducible example below.

# Example data
library(nycflights13)

# Fails
flights %>% group_by(dest) %>% slice(1) %>% ungroup() %>% slice(1) %>% multidplyr::partition(dest, cluster=multidplyr::create_cluster(2)) %>% mutate(new_col=arr_time - 10) %>% collect(n=Inf)

# Fails
flights %>% group_by(dest) %>% slice(1) %>% ungroup() %>% slice(2) %>% multidplyr::partition(dest, cluster=multidplyr::create_cluster(2)) %>% mutate(new_col=arr_time - 10) %>% collect(n=Inf)

# Works
flights %>% group_by(dest) %>% slice(1) %>% ungroup() %>% slice(1:2) %>% multidplyr::partition(dest, cluster=multidplyr::create_cluster(2)) %>% mutate(new_col=arr_time - 10) %>% collect(n=Inf)

I have not yet had a chance to try the solution suggested by @kendonB, but wanted to provide add a reproducible example.

from multidplyr.

Partition not working about multidplyr HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent