By default, S3 <-> Redshift copies will not work if the S3 bucket and Redshift c

Now that <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions about spark-redshift HOT 12 CLOSED

JoshRosen commented on August 19, 2024

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions

from spark-redshift.

Comments (12)

JoshRosen commented on August 19, 2024

Come to think of it, there are cases where we want to support cross-region transfers. Therefore, we might choose to split this into two separate issues: giving a more informative warning message and giving instructions on how to configure the cross-region UNLOAD command.

As far as I know, there's not an easy way to determine the cluster's region over JDBC, so I don't know that we'd be able to automatically figure out the correct UNLOAD command: http://stackoverflow.com/q/32545040/590203

from spark-redshift.

cfeduke commented on August 19, 2024

Having aws_region as a configuration parameter might be enough. That's the first thing I looked for when I ran into this problem. (I created an S3 bucket in Standard to get around this problem.)

from spark-redshift.

JoshRosen commented on August 19, 2024

That sounds reasonable to me; I was considering doing something similar. I think that tempdir_aws_region or s3_aws_region might be a slightly clearer configuration name, though.

from spark-redshift.

JoshRosen commented on August 19, 2024

I'm a bit overloaded with other work at the moment and this task isn't part of our current sprint, so this is up-for-grabs if anyone wants to work on this. I do have time to review / revise small patches to spark-redshift if they won't involve too many rounds of back-and-forth.

from spark-redshift.

JoshRosen commented on August 19, 2024

Now that #35 has been merged, this can be worked around using the new extracopyoptions configuration; note, however, that we have not published a release containing that patch yet, so you'd have to build your own version of spark-redshift in order for that to work.

from spark-redshift.

JoshRosen commented on August 19, 2024

extracopyoptions doesn't entirely address this issue, since you also need the ability to specify the same option during reads as well. Therefore, we should still add a configuration / patch for this.

from spark-redshift.

JoshRosen commented on August 19, 2024

Users: please comment on this thread to vote on this issue if it's important to you. I'd like to implement this but am holding off until I hear about more demand for this feature, since I have limited time to devote to spark-redshift and want to prioritize features.

from spark-redshift.

tristanreid commented on August 19, 2024

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:
Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

I'm glad to add this functionality, but it appears it would only be one-way (write-only)

from spark-redshift.

JoshRosen commented on August 19, 2024

If you want this functionality yourself, I would be happy to accept a patch for it; the one-way limitation is fine.

from spark-redshift.

tristanreid commented on August 19, 2024

I was just finishing the unit test for this when I realized that there's already a trivial workaround with the existing extracopyoptions option. Maybe it's cleaner to just document the possibility of adding region. Should I do that, or PR the distinct option? Either way seems fine to me.

from spark-redshift.

karanveerm commented on August 19, 2024

+1 We have a use case where we pull data from redshift for several different clients and would prefer to only use a single S3 bucket instead of having an S3 bucket in every region. This will be very helpful.

from spark-redshift.

JoshRosen commented on August 19, 2024

@karanveerm, note the limitation described in #87 (comment):

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:

Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

Given this limitation, I don't think we'll be able to support your use-case of using a single bucket to pull data from several clients. However, I believe that you could use a single bucket as the staging area for writes by using the extracopyoptions technique described upthread.

from spark-redshift.

Add configurations to allow tempdir and Redshift cluster to be in different AWS regions about spark-redshift HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent