Coder Social home page Coder Social logo

Comments (12)

JoshRosen avatar JoshRosen commented on August 19, 2024

Come to think of it, there are cases where we want to support cross-region transfers. Therefore, we might choose to split this into two separate issues: giving a more informative warning message and giving instructions on how to configure the cross-region UNLOAD command.

As far as I know, there's not an easy way to determine the cluster's region over JDBC, so I don't know that we'd be able to automatically figure out the correct UNLOAD command: http://stackoverflow.com/q/32545040/590203

from spark-redshift.

cfeduke avatar cfeduke commented on August 19, 2024

Having aws_region as a configuration parameter might be enough. That's the first thing I looked for when I ran into this problem. (I created an S3 bucket in Standard to get around this problem.)

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

That sounds reasonable to me; I was considering doing something similar. I think that tempdir_aws_region or s3_aws_region might be a slightly clearer configuration name, though.

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

I'm a bit overloaded with other work at the moment and this task isn't part of our current sprint, so this is up-for-grabs if anyone wants to work on this. I do have time to review / revise small patches to spark-redshift if they won't involve too many rounds of back-and-forth.

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

Now that #35 has been merged, this can be worked around using the new extracopyoptions configuration; note, however, that we have not published a release containing that patch yet, so you'd have to build your own version of spark-redshift in order for that to work.

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

extracopyoptions doesn't entirely address this issue, since you also need the ability to specify the same option during reads as well. Therefore, we should still add a configuration / patch for this.

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

Users: please comment on this thread to vote on this issue if it's important to you. I'd like to implement this but am holding off until I hear about more demand for this feature, since I have limited time to devote to spark-redshift and want to prioritize features.

from spark-redshift.

tristanreid avatar tristanreid commented on August 19, 2024

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:
Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

I'm glad to add this functionality, but it appears it would only be one-way (write-only)

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

If you want this functionality yourself, I would be happy to accept a patch for it; the one-way limitation is fine.

from spark-redshift.

tristanreid avatar tristanreid commented on August 19, 2024

I was just finishing the unit test for this when I realized that there's already a trivial workaround with the existing extracopyoptions option. Maybe it's cleaner to just document the possibility of adding region. Should I do that, or PR the distinct option? Either way seems fine to me.

from spark-redshift.

karanveerm avatar karanveerm commented on August 19, 2024

+1 We have a use case where we pull data from redshift for several different clients and would prefer to only use a single S3 bucket instead of having an S3 bucket in every region. This will be very helpful.

from spark-redshift.

JoshRosen avatar JoshRosen commented on August 19, 2024

@karanveerm, note the limitation described in #87 (comment):

While the COPY command seems to support a region, it appears that the UNLOAD command doesn't:

From http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html:

Important
The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

Given this limitation, I don't think we'll be able to support your use-case of using a single bucket to pull data from several clients. However, I believe that you could use a single bucket as the staging area for writes by using the extracopyoptions technique described upthread.

from spark-redshift.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.