Coder Social home page Coder Social logo

Comments (6)

dgraeber avatar dgraeber commented on August 11, 2024

Is there anyway to determine what bucket it cannot access? Is it the bucket where the pyspark code resides or is it a bucket where data resides? Can you look in the logs for any more details?

from autonomous-driving-data-framework.

manojrajpurohit avatar manojrajpurohit commented on August 11, 2024

One thing that I am not able to understand is, my EMR virtual cluster's EKS namespace is emr-eks-spark , refer point 1 below, whereas there is no such namespace called "emr-eks-spark" in the EKS cluster , refer point 2 below. My understanding is, EKS namespace should be created and only then EMR virtual cluster can be created (I may be wrong).
But if a namespace is a hard per-requisite for EMR virtual cluster then the namespace emr-eks-spark existed at some point in time but I am unable to find any place in ADDF where EKS namespace emr-eks-spark is created.

1. describe virtual cluster to get namespace 
(.venv) (base) aws emr-containers describe-virtual-cluster --id <masked>

{
    "virtualCluster": {
        "id": "<masked>",
        "name": "addf-ros-image-demo-emr-emr-<masked>",
        "arn": "arn:aws:emr-containers:<masked>:<masked>:/virtualclusters/<masked>",
        "state": "RUNNING",
        "containerProvider": {
            "type": "EKS",
            "id": "addf-ros-image-demo-core-eks-cluster",
            "info": {
                "eksInfo": {
                    "namespace": "**emr-eks-spark**"
                }
            }
        },
        "createdAt": "<masked>",
        "tags": {
            "Deployment": "addf-ros-image-demo"
        }
    }
}

2. describe all namespace 
(.venv) (base) kubectl describe ns
Name:         default
Labels:       kubernetes.io/metadata.name=default
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.


Name:         kube-node-lease
Labels:       kubernetes.io/metadata.name=kube-node-lease
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.


Name:         kube-public
Labels:       kubernetes.io/metadata.name=kube-public
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.


Name:         kube-system

from autonomous-driving-data-framework.

manojrajpurohit avatar manojrajpurohit commented on August 11, 2024

Is there anyway to determine what bucket it cannot access? Is it the bucket where the pyspark code resides or is it a bucket where data resides? Can you look in the logs for any more details?

I tried submitting the pyspark job by specifying the S3 Log bucket and Cloudwatch logs , refer code below, job submitted successfully, job failed with same error as stated in issue, logs were not seen in S3 neither in Cloudwatch logs. I had provided temporary elevated access to "execution-role-arn" before submitting job so it doesn't seem like IAM access issue on EMR JOb's end.

aws emr-containers start-job-run \
--virtual-cluster-id <masked>\
--name scene_detection_manual \
--execution-role-arn arn:aws:iam::<masked>:role/addf-ros-image-demo-emr-e-<masked> \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
    "entryPoint": "s3://addf-ros-image-demo-artifacts-bucket-<masked>/dags/ros-image-demo/dags-aws/spark_scripts/detect_scenes.py",
    "entryPointArguments": ["--batch-metadata-table-name addf-ros-image-demo-dags-aws-drive-tracking --batch-id 6_dec_2022 --bucket addf-ros-image-demo-curated-bucket-<masked>--region ap-south-1 --output-dynamo-table addf-ros-image-demo-dags-aws-scenes"],
    "sparkSubmitParameters": "--conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.driver.memory=2G --conf spark.executor.cores=2 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false --packages com.audienceproject:spark-dynamodb_2.12:1.1.1"
  }
}' --configuration-overrides '{
  "monitoringConfiguration": {
    "cloudWatchMonitoringConfiguration": {
      "logGroupName": "/emr-on-eks/emr-on-eks-to-delete",
      "logStreamNamePrefix": "detect_scenes_todelete"
    },
    "s3MonitoringConfiguration": {
       "logUri": "s3://addf-ros-image-demo-logs-bucket-<masked>/emr-on-eks"
    }
  }
}' 

from autonomous-driving-data-framework.

manojrajpurohit avatar manojrajpurohit commented on August 11, 2024

Since the namespace 'emr-eks-spark' did not exists, we followed this doc link to test further

  1. created a new namespace "spark"
    kubectl create namespace spark
  2. added emr-containers in config map of EKS cluster
    eksctl create iamidentitymapping --cluster addf-ros-image-demo-core-eks-cluster --namespace spark --service-name "emr-containers"
  3. enabled IAM role
    eksctl utils associate-iam-oidc-provider --cluster addf-ros-image-demo-core-eks-cluster --approve
  4. created new virtual cluster
aws emr-containers create-virtual-cluster \

--name to-delete \
--container-provider '{
    "id": "addf-ros-image-demo-core-eks-cluster",
    "type": "EKS",
    "info": {
        "eksInfo": {
            "namespace": "spark"
        }
    }
}'
  1. I was able to submit spark jobs to this new virtual cluster, spark jobs were in scheduled state for 15 minutes and then they fail.

Observation : Earlier the spark jobs were failing as soon as they were submitted (<2 seconds). In New virtual EMR cluster created with proper namespace, the jobs stay in scheduled mode for 15 minutes and then fail. In Scheduled state , the Spark's resource manager negotiates resource allocation with cluster manager, I think the communication or resource allocation between Spark's resource manager and EKS cluster is the root cause of this issue.

PS: I had discussed the same with @kevinsoucy

from autonomous-driving-data-framework.

dgraeber avatar dgraeber commented on August 11, 2024

@srinivasreddych @manojrajpurohit Sooo...I think you may have hosed your cluster when you ran the eksctl utils associate-iam-oidc-provider --cluster addf-ros-image-demo-core-eks-cluster --approve command as the ADDF cluster already has an OIDC provider. It sounds like the service account for EMR-on-EKS was not installed by the module correctly? @srinivasreddych can you look and we can circle back later?

from autonomous-driving-data-framework.

dgraeber avatar dgraeber commented on August 11, 2024

Closing due to inactivity. Please reopen once eyes can focus on it

from autonomous-driving-data-framework.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.