Created a spark job subclassing CCSparkJob to retrieve html text data. This job

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for the reply <a class="user-mention notranslate" data-hovercard-type="user" da

Closing for now. <a class="user-mention notranslate" data-hovercard-type="user" data-h

boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input about cc-pyspark HOT 5 CLOSED

commoncrawl commented on June 18, 2024

boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input

from cc-pyspark.

Comments (5)

sebastian-nagel commented on June 18, 2024

Hi @praveenr019, given the error message "botocore.exceptions.NoCredentialsError: Unable to locate credentials": is the job run on a Spark cluster or on a single instance? If on a cluster: how are the credentials deployed to the cluster instances (eg. via IAM roles)?

see https://github.com/commoncrawl/cc-pyspark#authenticated-s3-access-or-access-via-http
if not running on AWS: use --input_base_url https://data.commoncrawl.org/

from cc-pyspark.

sebastian-nagel commented on June 18, 2024

If on a single instance: I haven't seen a credential error just because of processing more data. How are the credentials configured?

from cc-pyspark.

praveenr019 commented on June 18, 2024

Thanks for the reply @sebastian-nagel. Yes, the job is run on a Spark cluster in AWS and the credentials are setup using IAM roles.

from cc-pyspark.

sebastian-nagel commented on June 18, 2024

No glue what could be the reason. And never seen this.

My assumption is that in cluster mode, every Python runner is a separate process. This would exclude any concurrency issues while fetching the credentials (for example here).

To address the problem, I'd catch the NoCredentialsError along the ClientError (sparkcc.py, line 283), log the error, re-instantiate the S3 client and try the download a second time. Let me know if you need help to implement this. Otherwise, would be interesting to hear whether this solves the problem.

from cc-pyspark.

sebastian-nagel commented on June 18, 2024

Closing for now. @praveenr019 let me know if this is still an issue!

from cc-pyspark.

Recommend Projects

boto3 credentials error when running CCSparkJob with ~100 S3 warc paths as input, but works with <10 S3 warc paths as input about cc-pyspark HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent