We are currently using your redshift driver as a sink of a spark stream that copies ba

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Taking another look at the <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/In

In addition, according to the documentation for <a href="https://docs.aws.amazon.com/r

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

java.sql.SQLException: [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist about spark-redshift HOT 22 OPEN

lehnerm commented on August 19, 2024

java.sql.SQLException: [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist

from spark-redshift.

Comments (22)

JoshRosen commented on August 19, 2024

Hi @lehnerm,

Just to help me narrow down potential causes, could you let me know which Spark Redshift version you're using and which AWS region(s) are hosting your Spark driver, S3 bucket, and Redshift cluster?

The files are definitely written to S3 before we issue the Redshift COPY command, so my hunch is that any race-condition is due to S3 eventual consistency issues.

In spark-redshift 0.5.1+ (see #99), we use a manifest to tell Redshift the exact set of files which should be loaded. If we were to just pass Redshift the name of the directory containing the Avro files, without explicitly listing every file's name, then eventual consistency could mean that Redshift wouldn't see some partitions and would just silently skip them since it didn't know to expect them.

What I suspect might be happening is that you are hitting such an eventual consistency issue and this error message means that Redshift caught the issue and failed rather than silently losing data.

There's one aspect of this that I'm slightly confused about, though. According to Redshift's guide on managing data consistency:

All regions provide read-after-write consistency for uploads of new objects with unique object keys. [...] Amazon S3 provides eventual consistency in all regions for overwrite operations. Creating new file names, or object keys, in Amazon S3 for each data load operation provides strong consistency in all regions.

We always produce new keys, so according to this I think our S3 writes should appear to be strongly consistent as long as the S3 bucket and Redshift cluster are in the same AWS region.

Do you happen to be doing a cross-region copy (i.e. is your S3 bucket in a different AWS region than your Redshift cluster)?

from spark-redshift.

lehnerm commented on August 19, 2024

Hi @JoshRosen,

thanks for your reply, here's a overview of our setup:

Spark 1.5.0 on YARN (Amazon EMR 4.1)
Spark-Redshift 0.5.2
Scala 2.10 / Java 7
Redshift JDBC 4.1 (1.1.7.1007)

The job is using the new manifest.json and keeps generating new files every time- it however as well uses 24 partitions and therefore 24 files are written to S3 every time - which might cause issues with S3? I'm a bit confused about the S3 consistency rules as well. The only thing I did notice, is that the second job repartitions to the number of Redshift nodes (8) due to a shuffle operation in the middle of the job and that significantly improves stability (however does not ensure it completely). I'll give repartitioning to the number of Redshift nodes a try.

EMR, Redshift and the S3 Bucket are all in the eu-west1 region. However the s3n:// URI that I use for the temp folder does not include the region name.

With the job running every 5 minutes from the same machine, UUID collision should be highly unlikely.

from spark-redshift.

lehnerm commented on August 19, 2024

Been doing some research on s3 consistency. From the S3 docs, one would expect that the observed behavior is not possible when dataFrame.save() in spark is a fully blocking operation.

I'm trying to get hold of somebody at AWS that can shed some light on the details of S3s read after write consistency for us.

As you are anyway listing the bucket before issuing the copy command in #99, I'll add some diagnostic logging to track whether at least EMRs view of S3 is always fully consistent. The unfortunate truth is however, if that were the case we can't really rely on EMRs and Redshifts view of S3 to be the same.

from spark-redshift.

JoshRosen commented on August 19, 2024

@lehnerm, I think that I may have some contacts on the S3 and Redshift teams, so I'll forward this thread to them to see if they have any insights.

Ping @aarondav, this is that weird S3 issue that I mentioned yesterday.

from spark-redshift.

JoshRosen commented on August 19, 2024

Taking another look at the Amazon S3 Data Consistency Model docs (emphasis mine):

Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers. If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:

A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.

[...]

According to the S3 FAQ:

Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES.

On the AWS forum, someone asked about how to reconcile these statements, and according to ChrisP@AWS:

Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated.

So given this, if we used a LIST operation we might silently skip missing files. But since we've specified all filenames in a manifest, I would imagine that Redshift would directly issue GETs for the files listed in the manifest and hence would see the object.

from spark-redshift.

JoshRosen commented on August 19, 2024

In addition, according to the documentation for Redshift manifest files, emphasis mine:

You can explicitly specify which files to load by using a manifest file. When you use a manifest file, COPY enforces strong consistency by searching secondary servers if it does not find a listed file on the primary server. The manifest file can be configured with an optional mandatory flag. If mandatory is true and the file is not found, COPY returns an error.

from spark-redshift.

JoshRosen commented on August 19, 2024

@lehnerm, a couple of other questions that I just thought of:

Have you configured and Hadoop or Spark OutputCommitter settings to be different than their defaults?
Are you using some sort of DirectOutputCommitter?
Is Spark's speculative execution enabled?

As of Spark 1.5.0, I'm not aware of any bugs related to these components which would explain the behavior that you saw here, but I just wanted to check.

from spark-redshift.

lankygit commented on August 19, 2024

We know that this is clustered - we know that we can beat a request if it hasn't propagated - the level of consistency is "strong" - not "absolute" - using a manifest and the "mandatory" boolean allows us to know when we failed to get the object - so it looks like we are protected - when we fail to get the object, we know - so this is probably working as designed... I suspect that the AWS documentation could be made more clear, but the fact is that we have a closed control loop - we're not going to trip and fall - just have to re-take a step sometimes when we run very fast.

from spark-redshift.

lehnerm commented on August 19, 2024

Hi @JoshRosen,

sorry for my late response, to answer your questions:
Have you configured and Hadoop or Spark OutputCommitter settings to be different than their defaults?
No

Are you using some sort of DirectOutputCommitter?
No, guess that follows from 1

Is Spark's speculative execution enabled?
If it's not enabled by default, No

Would adding a retry mechanism with catching the Redshift exception and then repeating just the COPY statement after a configurable delay be valid solution from your perspective? Seems like we are only seeing minor consistency glitches for a couple of seconds, and once that happens we can wait for the file to appear.

from spark-redshift.

JoshRosen commented on August 19, 2024

Based on a more careful reading of the announcement about the availability of strong read-after-write consistency in all regions (https://forums.aws.amazon.com/ann.jspa?annID=3112), it sounds like this might actually be saying that each region has a endpoint which provides these consistency guarantees, not that all endpoints in all regions have them. I guess one question is whether Redshift uses these endpoints for its writes and reads and whether our AWS client does the same. If possible, I wonder if we can pin spark-redshift to use the s3-external-1.amazonaws.com endpoint, which definitely supports the newer, stronger consistency guarantees.

from spark-redshift.

camspilly commented on August 19, 2024

Is this still being looked into?

from spark-redshift.

caseyvu commented on August 19, 2024

Anyone looking into this issue? I'm getting the same error.
@lehnerm : did you have any work-around for this?

from spark-redshift.

camspilly commented on August 19, 2024

Has there been any update on this issue?

from spark-redshift.

tzhang101 commented on August 19, 2024

we are hitting the same issue occasionally and we are using spark 1.6.2 and
"com.databricks" %% "spark-redshift" % "1.1.0". Any updates on this?

from spark-redshift.

pnc commented on August 19, 2024

The S3 docs also say that:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.

(Emphasis mine.)

Does anybody who understands the nitty-gritty of FileOutputCommitter better know if it indeed ends up triggering a GET or HEAD prior to writing? (Assuming algorithm 2, since that's the new default.)

from spark-redshift.

lchhieu commented on August 19, 2024

Hi @JoshRosen , when i run Redshift COPY command and get a error "Problem reading manifest file - S3ServiceException:The specified key does not exist.,Status 404". I dont known what this error?

from spark-redshift.

pnc commented on August 19, 2024

@ichhieu, can you post the exact command (perhaps with any role or IAM info scrubbed) you're running that triggers this error? Also, does the error happen every time, or just on some runs?

from spark-redshift.

lchhieu commented on August 19, 2024

@pnc , thank you

from spark-redshift.

syntropo commented on August 19, 2024

I was discussing the consistency issue with someone from Databricks at a recent Spark summit, and they thought that using EMRFS's consistency option would alleviate it. I'm here to report that it does not seem to... We (very occasionally) get this error for an individual partition file when trying to write a DataFrame to Redshift:

S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey

I asked AWS support to look into it and they said EMRFS would not even be used since Redshift COPY is not a Hadoop job. They suggested waiting 5-10 seconds between writing the files to S3 and calling upon Redshift to COPY, in order to avoid the error. Evidently S3 also 'negatively' caches GET and HEAD requests for up to 90 seconds, so subsequent requests must wait that long after getting the error.

Is there no way to add an option to spark-redshift that injects a little wait time? I'm using their RedshiftJDBC42-1.1.17.1017 driver - might I have better luck with the standard one?

from spark-redshift.

sphinks commented on August 19, 2024

Any update on fixing this issue? AWS Glue constantly throws aa error. Support is referencing to this issue. Any knowing workaround?

from spark-redshift.

imranece59 commented on August 19, 2024

+1
This is affecting our larger load run and had to setup re-rerunning the job everytime

from spark-redshift.

SumitMoodys commented on August 19, 2024

Any update on the above issue.

from spark-redshift.

java.sql.SQLException: [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist about spark-redshift HOT 22 OPEN

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent