Coder Social home page Coder Social logo

Partition not working about multidplyr HOT 16 CLOSED

tidyverse avatar tidyverse commented on August 28, 2024
Partition not working

from multidplyr.

Comments (16)

hadley avatar hadley commented on August 28, 2024

Without a reproducible example, I think you'll need to step through the code in partition() yourself, and figure out what's going wrong.

How many total combinations of year and month do you have? It's possible that it's small and because of the randomisation partition does, sometimes there aren't enough to spread over 31 nodes?

from multidplyr.

fugufisch avatar fugufisch commented on August 28, 2024

Yes, it's only 6 combinations. But even if I initialize a cluster with only 5 or 6 nodes I get this error.
It works if I only use 2 nodes though.

I get the same error with:

data(movies)
movies %>% partition(length)

This works:

movies %>% partition(length, cluster=create_cluster(2))

So I always need more combinations than nodes? How do I know how many nodes I can use?

from multidplyr.

hadley avatar hadley commented on August 28, 2024

You shouldn't have to but clearly there's a bug in my implementation of partition().

from multidplyr.

fugufisch avatar fugufisch commented on August 28, 2024

Ok, I'll see if I can fix it and maybe make a pull request. Thanks!

from multidplyr.

jmorten avatar jmorten commented on August 28, 2024

I am having the same problem. I have 35,000,000 lines with 4,600,000 unique values that I am partitioning on. I would assume that partition would be able to create shards given the input.

partition() seems to work at random sometimes giving the below error and sometimes returning the data frame.

I tried with 32 partitions, 16, 8, 7, all failed. However, as above, 2 partitions do work.

from multidplyr.

fugufisch avatar fugufisch commented on August 28, 2024

I did a fix for this in my fork that solved the issue for me. It would be great if you could test my fix with your dataset.

from multidplyr.

jmorten avatar jmorten commented on August 28, 2024

I would love to implement your fix, but I believe there might be an underlying issue since partition() works in a random manor.

I will restart my session and partition() will work, while other times it does not.

from multidplyr.

gdbassett avatar gdbassett commented on August 28, 2024

I also get the "Error: length(values) == length(cluster) is not TRUE" error when placing a feature in "partition". It even happens when a cluster is created the same length as unique features on the partitioned features.

> names(bla)
 [1] "id"        "sha256"    "timestamp" "src_port"  "dst_port"  "thing"       "thing2"  "thing3"     "thing4"       "thing5"
> length(unique(bla$thing4))
[1] 20
> cluster <- create_cluster(20)
Initialising 20 core cluster.
> chunk <- bla %>% partition(thing4, cluster=cluster) %>% group_by(thing3, thing4) %>% summarize(x=n()) %>% collect() %>% group_by(thing2) %>% mutate(n=sum(x), freq=x/n) %>% select(industry, ext, x, n, freq) %>% ungroup()
Error: length(values) == length(cluster) is not TRUE

I get the same error if 'cluster' isn't specified in partition or if the cluster size is 10 or 5 cores.

from multidplyr.

julou avatar julou commented on August 28, 2024

a fix for this would be really helpful!
For now I'm living with the workaround of @fugufisch i.e. using my cluster[1:n] where n is smaller than the number of shards…

btw, I would say that fugufisch@fee752e is more general than Ax3man@cd04680 which as far as I understand is only relevant when no cluster argument is passed (but maybe I missed some subtlety…)

from multidplyr.

Ax3man avatar Ax3man commented on August 28, 2024

IIRC the fix from @fugufisch didn't always work for me (depending on the random shard distribution I think). Also, if no cluster is specified, it may create a cluster that is larger than necessary.

But it does automatically work when you supply a cluster of the wrong size. I think I can steal that and combine.

from multidplyr.

julou avatar julou commented on August 28, 2024

@Ax3man could you elaborate?
this fix is only subsetting the cluster in case there are less shards than nodes…

from multidplyr.

Ax3man avatar Ax3man commented on August 28, 2024

Hmm, sorry that was another bug that I fixed that day. I now see they were separate commits. But it does create an oversized cluster (when cluster isn't specified and there are too few groups).

from multidplyr.

sw1sh avatar sw1sh commented on August 28, 2024

I have the same problem with partitioning data.frame that is also belongs to 'data.table' class (because I use fread for fast reading). So the solution in this case is simply class(df) <- 'data.frame'.

from multidplyr.

gmonaie avatar gmonaie commented on August 28, 2024

The line creating the issue is in partition_ where groups are being assigned to partitions or shards.

shard.R#51
groups$part_id <- floor(m * (cumsum(groups$n) - 1) / sum(groups$n) + 1)

As an example, number of rows (n) is 18, number of shards (m) is 5.

data is to be partitioned by date, looks like this:

> data %>% select(date)
Source: local data frame [18 x 1]
Groups: date

         date
1  2016-08-04
2  2016-08-04
3  2016-08-10
4  2016-08-10
5  2016-08-10
6  2016-08-10
7  2016-08-11
8  2016-08-11
9  2016-08-11
10 2016-08-11
11 2016-08-11
12 2016-08-11
13 2016-08-11
14 2016-08-11
15 2016-08-17
16 2016-08-17
17 2016-08-18
18 2016-08-18

Before assigning partitions:

> groups
Source: local data frame [5 x 2]

  id n
1  4 2
2  3 8
3  1 2
4  5 2
5  2 4

shard.R#51 groups$part_id <- floor(m * (cumsum(groups$n) - 1) / sum(groups$n) + 1)
This part_id assignment makes an uneven assignment of groups to shards with this result:

> groups
Source: local data frame [5 x 3]

  id n part_id
1  4 2       1
2  3 8       3
3  1 2       4
4  5 2       4
5  2 4       5

We can either change line #51 or change the cluster assignment to have a 1 to 1 mapping with the shards on line #59

cluster_assign_each(cluster[as.integer(names(shards))], name, shards)

from multidplyr.

kendonB avatar kendonB commented on August 28, 2024

Please see and test my solution at https://github.com/kendonB/multidplyr

from multidplyr.

cwbishop avatar cwbishop commented on August 28, 2024

I'm a huge fan of this package, but I am also running into the same issue frequently when the number of partitions < the number of cores in the cluster.

My typical work flow is to create a forked cluster across all available cores after loading all necessary libraries, etc. in my interactive environment. I then leverage this cluster as needed for embarrassingly parallel operations (e.g., repetitive queries against a database, partition-wise computations or summaries, etc.). Cluster creation in this case is a heavy operation and thus creating a new cluster for each computation is prohibitively time consuming. It is also difficult to preemptively detect edge cases in which I have fewer partitions than cores in my cluster. The result is frequent errors and crashes.

Minimal reproducible example below.

# Example data
library(nycflights13)

# Fails
flights %>% group_by(dest) %>% slice(1) %>% ungroup() %>% slice(1) %>% multidplyr::partition(dest, cluster=multidplyr::create_cluster(2)) %>% mutate(new_col=arr_time - 10) %>% collect(n=Inf)

# Fails
flights %>% group_by(dest) %>% slice(1) %>% ungroup() %>% slice(2) %>% multidplyr::partition(dest, cluster=multidplyr::create_cluster(2)) %>% mutate(new_col=arr_time - 10) %>% collect(n=Inf)

# Works
flights %>% group_by(dest) %>% slice(1) %>% ungroup() %>% slice(1:2) %>% multidplyr::partition(dest, cluster=multidplyr::create_cluster(2)) %>% mutate(new_col=arr_time - 10) %>% collect(n=Inf)

I have not yet had a chance to try the solution suggested by @kendonB, but wanted to provide add a reproducible example.

from multidplyr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.