awslabs / autonomous-driving-data-framework Goto Github PK

ADDF is a collection of modules, deployed using the SeedFarmer orchestration tool. ADDF modules enable users to quickly bootstrap environments for the process and analysis of autonomous driving data.

License: Apache License 2.0

Python 87.22% Shell 6.13% Dockerfile 1.20% TypeScript 2.59% Jupyter Notebook 2.74% HCL 0.12%

autonomous-driving-data-framework's Introduction

Autonomous Driving Data Framework(ADDF)

ADDF is a collection of modules for Scene Detection, Simulation (mock), Visualization, Compute, Storage, Centralized logging etc, deployed using SeedFarmer orchestration tool. ADDF allows you to build distinct, stand alone Infrastructure as code (IAAC) modules and exchange information about dependencies using metadata which can be exported from one module and imported into another. Each module can be found under the modules directory of this repository.

Deployment Instructions

You can refer to the SeedFarmer guide to understand how SeedFarmer CLI can be used to bootstrap and deploy ADDF.

You can follow instructions in the Deployment Guide Readme. You can also follow the blogpost for understanding ADDF in detail.

Please see the ADDF Security and Operations Guide for in-depth recommendations on assessing, deploying, customizing, and operating ADDF.

Different types of modules supported by ADDF

Use-case specific Modules

Type	Description
Rosbag Image Pipeline Module	Deploys a Rosbag Image pipeline for use in ADDF
Rosbag WebViz Module	Deploys and Visualizes Rosbag Data on AWS using Webviz for use in ADDF

Optional Modules

Type	Description
DataLake Buckets Module	Deploys shared datalake buckets such as input, intermediate, output, logging, artifact buckets for use in ADDF

Integration Modules

Type	Description
DDB to Opensearch Module	This module integrates DynamoDB table with Opensearch cluster
EKS to Opensearch Module	This module integrates EKS Cluster with Opensearch cluster
EMR to Opensearch Module	This module integrates EMR Cluster with Opensearch cluster
Opensearch Proxy Module	This module deploys a Proxy server to access Opensearch cluster

Simulation Modules

Type	Description
K8s-Managed Module	This module helps running simulations on AWS EKS, when triggered by KubernetesJob Operator from airflow environment
AWS Batch Module	This module helps running simulations on AWS Batch, when triggered by Batch Operator from airflow environment

IDE Modules

Type	Description
Self Managed JupyterHub Module	This module deploys self managed JupyterHub environment on AWS EKS
Self Managed VSCode Module	This module deploys self managed VSCode environment on AWS EKS
AWS Managed EMR Studio Module	This module deploys AWS managed EMR Studio with EMR on EKS

Example Modules

Type	Description
Example DAG Module	This module deploys a pattern to integrate a target DAG module to work with shared MWAA Cluster

Reporting Issues

If you notice a defect, feel free to create an Issue

Deployment FAQ

If you need to debug a deployment in ADDF, here are few things you can checkout Readme

autonomous-driving-data-framework's People

Contributors

Stargazers

Watchers

Forkers

qvacua kevinsoucy chamcca srinivasreddych dgraeber andyefaa a13zen kraw alterant chatchai-komrangded junjietang-d1 iwelch82 cro13 msahil manojrajpurohit rb201 mariaguerrra jasaws1048 zaremb charlesakalugwu mzhuang1 patwie jiamery kunzt-aws frostyslav xujunbj kukushking cheefoo malachi-constant mnjkart shea-marshall sauronalexander arpitgpt junyang0412 bdfurlong niubencoolboy leonluttenberger chinaloryu luisosses outsider7 rpan2022

autonomous-driving-data-framework's Issues

Resizing Images before lane detection [FEATURE]

Is your feature request related to a problem? Please describe.
When the running ADDF for the deployment 'ros-image-demo' with sample bag files, I get a similar error to this github issue during lane detection. The proposed solution was to resize images.

To test the solution, I resized the images to the resolution 1280x720 using a custom python script. Then I re-ran the sagemaker processing job by cloning it, but giving it a different source S3 uri.

The sagemaker job succeeded with the new image size and got proper results in terms of lane detection (see attached images)

Describe the solution you'd like
I resized the images to 1280x720 in a separate python script, I'm however not sure how to have ADDF do it by default before running the lane detection script.

Describe alternatives you've considered
N/A

Additional context
Add any other context or screenshots about the feature request here.

[FEATURE]ADDF 2.0 release

For ADDF 2.0 release:

• The Core and Optional modules from ADDF have been replicated to IDF by making them project agnostic (remove ADDF hardcoding) and having unit-tests coverage of > 80%
• Modified the manifests in ADDF repo to use gitPath feature for the above modules
• Performed a full destroy and deploy of IDF modules using the manifests from ADDF repo

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

[FEATURE] CloudWatch Alarm - SNS - Email

Is your feature request related to a problem? Please describe.
You might want to use CloudWatch Alarms to monitor certain aspects of deployed resources and send notifications via Email.

Describe the solution you'd like
I could see a lot of use for providing a module that contains the stack of CloudWatch Alarm with SNS and Email forwarding as ADDF (or even IDF) seedfarmer module.

Describe alternatives you've considered
Since I wanted to especially monitor a DLQ, forwarding the event to SNS would be not as nice as using CW Alarams, since it would remove the event from the DLQ.

Additional context
I wanted to use this for DLQ monitoring, but having a pre-integrated stack with CWA-SNS-Email would be enough.

[BUG] rosbag-webviz deploying issue

Describe the bug
modules/demo-only/rosbag-webviz had errors deploying

To Reproduce
Steps to reproduce the behavior:

follow the ADDF deployment guide:
git clone --origin upstream --branch main https://github.com/awslabs/autonomous-driving-data-framework
...
seedfarmer apply manifests/demo/deployment.yaml

Expected behavior
All selected modules deployed without issues

Screenshots

Additional context
It looks like a dependency issue in rosbag-webviz

[codeseeder-addf]:

[Container] 2023/01/30 13:50:19 Running command npm install
170 | npm ERR! code ERESOLVE
171 | npm ERR! ERESOLVE could not resolve
172 | npm ERR!
173 | npm ERR! While resolving: [email protected]
174 | npm ERR! Found: [email protected]
175 | npm ERR! node_modules/typescript
176 | npm ERR! dev typescript@"~3.9.7" from the root project
177 | npm ERR! peer typescript@">=2.7" from [email protected]
178 | npm ERR! node_modules/ts-node
179 | npm ERR! dev ts-node@"^9.0.0" from the root project
180 | npm ERR! peerOptional ts-node@">=9.0.0" from [email protected]
181 | npm ERR! node_modules/jest-config
182 | npm ERR! jest-config@"^29.3.1" from @jest/[email protected]
183 | npm ERR! node_modules/@jest/core
184 | npm ERR! @jest/core@"^29.3.1" from [email protected]
185 | npm ERR! node_modules/jest
186 | npm ERR! 1 more (jest-cli)
187 | npm ERR! 1 more (jest-cli)
188 | npm ERR!
189 | npm ERR! Could not resolve dependency:
190 | npm ERR! peer typescript@">=4.3" from [email protected]
191 | npm ERR! node_modules/ts-jest
192 | npm ERR! dev ts-jest@"^29.0.3" from the root project
193 | npm ERR!
194 | npm ERR! Conflicting peer dependency: [email protected]
195 | npm ERR! node_modules/typescript
196 | npm ERR! peer typescript@">=4.3" from [email protected]
197 | npm ERR! node_modules/ts-jest
198 | npm ERR! dev ts-jest@"^29.0.3" from the root project
199 | npm ERR!
200 | npm ERR! Fix the upstream dependency conflict, or retry
201 | npm ERR! this command with --force, or --legacy-peer-deps
202 | npm ERR! to accept an incorrect (and potentially broken) dependency resolution.
203 | npm ERR!
204 | npm ERR! See /root/.npm/eresolve-report.txt for a full report.
205 |
206 | npm ERR! A complete log of this run can be found in:
207 | npm ERR! /root/.npm/_logs/2023-01-30T13_50_19_881Z-debug-0.log
208 |
209 | [Container] 2023/01/30 13:50:20 Command did not exit successfully npm install exit status 1
210 | [Container] 2023/01/30 13:50:20 Phase complete: INSTALL State: FAILED_WITH_ABORT
211 | [Container] 2023/01/30 13:50:20 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: npm install. Reason: exit status 1
212

[BUG] Broken in link in main Readme

Describe the bug

In the page https://github.com/awslabs/autonomous-driving-data-framework/blob/main/README.md
the link Rosbag Scene Detection Module should point to https://github.com/awslabs/autonomous-driving-data-framework/blob/main/modules/analysis/rosbag-scene-detection/README.md

To Reproduce
Steps to reproduce the behavior:

visit homepage of repository and click on link

Expected behavior
Loading the right page

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

[BUG] Multiple deploy attempts for a successful deployment

Describe the bug
By following the blog post on AWS, the deploy command "seedfarmer apply ./manifests/demo/deployment.yaml" should deploy all the needed modules to your AWS account.

In my situation, from a macbook Pro with cdk version 2.20 and seed-farmer~=0.1.1, I had to run the deploy command multiple times because at each execution I was facing a similar error.

...
Traceback (most recent call last):
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/bin/seedfarmer", line 8, in <module>
    sys.exit(main())
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/__main__.py", line 518, in main
    cli()
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/__main__.py", line 70, in apply
    commands.apply(spec, dry_run, show_manifest)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_deployment_commands.py", line 454, in apply
    deploy_deployment(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_deployment_commands.py", line 383, in deploy_deployment
    _deploy_deployment_is_not_dry_run(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_deployment_commands.py", line 192, in _deploy_deployment_is_not_dry_run
    _ = list(workers.map(_exec_deploy, params))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result
    return self.__get_result()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_deployment_commands.py", line 177, in _exec_deploy
    return _execute_deploy(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_deployment_commands.py", line 111, in _execute_deploy
    return commands.deploy_module(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_module_commands.py", line 106, in deploy_module
    return _execute_module_commands(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/seedfarmer/commands/_module_commands.py", line 212, in _execute_module_commands
    _execute_module_commands(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/codeseeder.py", line 348, in wrapper
    build_info = _remote.run(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/_remote.py", line 102, in run
    build_info = _execute_codebuild(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/_remote.py", line 82, in _execute_codebuild
    return _wait_execution(
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/_remote.py", line 43, in _wait_execution
    for status in codebuild.wait(build_id=build_id):
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/services/codebuild.py", line 240, in wait
    build = fetch_build_info(build_id=build_id)
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/services/codebuild.py", line 189, in fetch_build_info
    phases=[
  File "/Users/andmoa/Documents/workspace/aws/artifacts/autonomous-driving-data-framework/.venv/lib/python3.9/site-packages/aws_codeseeder/services/codebuild.py", line 192, in <listcomp>
    status=None if "phaseStatus" not in p else BuildPhaseStatus(value=p["phaseStatus"]),
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/enum.py", line 360, in __call__
    return cls.__new__(cls, value)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/enum.py", line 677, in __new__
    raise ve_exc
ValueError: 'CLIENT_ERROR' is not a valid BuildPhaseStatus

In addition to the above error, in the main build project, in the build history there are very short builds with FAILED status.
By checking the logs in the Phase details tab, the following message appears:

ACCESS_DENIED: Service role arn:aws:iam::086314134900:role/addf-demo-integration-opensearch-proxy-c4fd8732 does not allow AWS CodeBuild to create Amazon CloudWatch Logs log streams for build arn:aws:codebuild:eu-west-1:086314134900:build/codeseeder-addf:115d0983-6fbe-4b28-a9fd-c30189935653. Error message: User: arn:aws:sts::086314134900:assumed-role/addf-demo-integration-opensearch-proxy-c4fd8732/AWSCodeBuild-115d0983-6fbe-4b28-a9fd-c30189935653 is not authorized to perform: logs:CreateLogStream on resource: arn:aws:logs:eu-west-1:086314134900:log-group:/aws/codebuild/codeseeder-addf:log-stream:codeseeder-vtqrmder/115d0983-6fbe-4b28-a9fd-c30189935653 because no identity-based policy allows the logs:CreateLogStream action

To Reproduce
Steps to reproduce the behavior:

follow blog steps
run deploy command
wait for output

Expected behavior
The deploy command should successfully complete for all modules, without the user needing to run the same command multiple times.

Screenshots

Additional context
Add any other context about the problem here.

[Feature] SageMaker MLOPs Modules

Is your feature request related to a problem? Please describe.
As a customer, I'd like to leverage the full capabilities of SageMaker for ML training for different models. To do this, I need to deploy a development environment for ML (SageMaker Studio), custom kernels for use during model development, custom images for use with SageMaker Processing & SageMaker Training jobs.

Finally, I need the ability to scale ML solutions for multiple model training pipelines and for this we should leverage SageMaker Projects, which needs a custom Module that allows designing and deploying custom organizational templates.

Describe the solution you'd like
Add support for a SageMaker native MLOPs solution including:

Module to Deploy SageMaker Studio
Module to Deploy a Custom SageMaker Project
Module to Build, Deploy and Register Custom Kernels for Studio

[BUG] YOLOP Lane Detection Image Scaling Issue

Describe the bug
When running the out of box VSI-DEMO MWAA Dag on the public Rosbag files from the blog post (s3://aws-autonomous-driving-datasets/test-vehicle-01/072021/small1__2020-11-19-16-21-22_4.bag) the SageMaker Processing job for Lane Detection fails with errors around img size (scaling). It seems to always be off by specific factor 0.8 (600->480), (1200->960)

To Reproduce

Deploy the ros-image-demo manifest
Copy the public bag file above into your account into the addf raw bucket (you need to copy it twice since the batch job needs min of 2 files (another bug)
Run the VSI sample data DAG with config similar to:

{
    "drives_to_process": {
        "test-vehicle-01": {
            "bucket": "<addf-raw-bucket>",
            "prefix": "test-vehicle-01/072021/"
        },
        "test-vehicle-02": {
            "bucket": "<addf-raw-bucket>",
            "prefix": "test-vehicle-02/072021/"
        }
    }
}

Check pipeline. Last step fails, sagemaker processing job with logs below

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

/opt/conda/lib/python3.8/site-packages/torchvision/io/image.py:11: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")

Using torch 1.10.2+cpu CPU
['/YOLOP/tools', '/opt/conda/lib/python38.zip', '/opt/conda/lib/python3.8', '/opt/conda/lib/python3.8/lib-dynload', '/opt/conda/lib/python3.8/site-packages', '/YOLOP']
=> creating runs/BddDataset/_2022-11-18-15-44
#015  0%|          | 0/202 [00:00<?, ?it/s]/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2156.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py:3631: UserWarning: Default upsampling behavior when mode=bilinear is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
  warnings.warn(
#015  0%|          | 0/202 [00:00<?, ?it/s]
image 1/202 /opt/ml/processing/input/image/frame_0005407_1605824495_875984259.png: 
Traceback (most recent call last):
  File "tools/detect_lanes.py", line 226, in <module>
    detect(cfg, opt)
  File "tools/detect_lanes.py", line 147, in detect
    img_det = show_seg_result(img_det, (da_seg_mask, ll_seg_mask), _, _, is_demo=True)
  File "/YOLOP/lib/utils/plot.py", line 57, in show_seg_result
    img[color_mask != 0] = img[color_mask != 0] * 0.5 + color_seg[color_mask != 0] * 0.5
    IndexError: boolean index did not match indexed array along dimension 0; dimension is 1200 but corresponding boolean dimension is 960

[BUG] Investigate Multi-Instance Deployment Issues (same account, different regions)

By definition, the deployment of multiple ADDF instances in a single AWS account in the same region or same account different regions should work.

Same account - Different region deployment issues:

addf-example-dev-core-eks-masterrole already exists (roles are global)

Custom::AWSCDKOpenIdConnectProviderCustomResourceProvider/Role (CustomAWSCDKOpenIdConnectProviderCustomResourceProviderRole517FED65) 
addf-example-dev-core-eks | 11/82 | 9:54:08 AM | DELETE_COMPLETE      | AWS::EFS::FileSystem                  | EFSFilesystem (EFSFilesystem073A610A) 

addf-example-dev-core-eks | 12/82 | 9:54:09 AM | ROLLBACK_COMPLETE    | AWS::CloudFormation::Stack            | addf-example-dev-core-eks 

Failed resources:

addf-example-dev-core-eks | 9:53:27 AM | CREATE_FAILED        | AWS::IAM::Role                        | ClusterAdminRole (ClusterAdminRole047D4FCA) addf-example-dev-core-eks-masterrole already exists

 ❌  addf-example-dev-core-eks failed: Error: The stack named addf-example-dev-core-eks failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE

    at waitForStackDeploy (/usr/lib/node_modules/aws-cdk/lib/api/util/cloudformation.ts:311:11)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at prepareAndExecuteChangeSet (/usr/lib/node_modules/aws-cdk/lib/api/deploy-stack.ts:376:26)
    at CdkToolkit.deploy (/usr/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:209:24)
    at initCommandLine (/usr/lib/node_modules/aws-cdk/lib/cli.ts:342:12)

[BUG] Getting validation error during deployment of 'ros-image-demo' (fix added)

Describe the bug
I was trying to deploy the 'ros-image-demo' deployment, and during deployment I was getting a validation error for creation of a dag role. I checked and it seems that the length of the role name created at line 135 exceeds the maximum length of 64 characters.

I altered the length of the name of the role, however I start getting an error that the resource already existed. It seems like this dag role is being created twice in the stack (in different templates), thus to resolve the issue, I commented out line 150. Once done, the deployment succeeds without any issues.

To Reproduce
Deploy the 'ros-image-demo' deployment till you get a validation error in one of the stacks where the name of a role exceeds the max character length of 64 characters

Expected behavior
The deployment should have succeeded without errors .

Screenshots

Additional context

Step Function to AWS Batch Sample Module

[FEATURE] Add Module/Example for Step Function to AWS Batch Workloads

Is your feature request related to a problem? Please describe.
Provide Amazon Step Functions as sample orchestrator in addition to Airflow/MWAA.

Describe the solution you'd like
Add example module for EventBridge -> Amazon Step Functions -> AWS Batch execution.

This is a common pattern in many AV/ADAS workloads such as resimulation with SILs for example.

[FEATURE] Pattern for Airflow starting EMR Clusters / kicking off Spark Applications

Describe the solution you'd like
Part of the Image Pipeline requires Spark applications. Therefore we need a module to spin up an EMR cluster in the Airflow Dag, or to leverage EMR Serverless.

Need to run latest version of EMR with Spark 3+ and Iceberg Support
Ability to spin up and terminate cluster in Ros-Image Airflow dag
Extend Rosbag-Image Dag with steps to spin up cluster and terminate
Use existing EKS module, potentially use the Airflow EMR on EKS operator and see if this works
Easy way to pass in EMR / Spark Configurations like:

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

[BUG] OpenSearch ProxyUsing IDMSv1

The EC2 instance is using IDMSv1, and we should default it to IDMSv2 in the CFN templates

[BUG] Opensearch stack is stuck in DELETE_FAILED status

Describe the bug
When we try to deploy and then remove modules/integration/ddb-to-opensearch/ module , it puts the opensearch stack addf-ros-image-demo-core-opensearch in DELETE_FAILED status. Thus, we not able to deploy modules/integration/ddb-to-opensearch/ again now. The DELETE_FAILED status is because AWS::EC2::SecurityGroup resource of stack is stuck in DELETE_FAILED, refer below message resource _sg-id_has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: _masked_; Proxy: null)

To Reproduce

Deploy ADDF with OpenSearch , refer manifest here https://github.com/manojrajpurohit/autonomous-driving-data-framework/blob/trigger-spark-on-eks/manifests/ros-image-demo/integration-modules.yaml
Remove manifests/ros-image-demo/optional-modules.yaml from ADDF manifest and deploy again, this will attempt to delete opensearch and put the stack in DELETE_FAILED state
Include opensearch module again in ADDF and deploy, Opensearch will not be deployed and remains in DELETE_FAILED state

Expected behavior
Opensearch stack should be deployed and deleted without entering DELTE_FAILED state

Screenshots

Additional context
I think this is because the security group is associated in Network interfaces, post which CloudFormation is unable to delete the security group because of the dependency. refer link https://aws.amazon.com/premiumsupport/knowledge-center/ec2-find-security-group-resources/

[BUG] Update aws-emr-launch library to newer version

Describe the bug
Update aws-emr-launch version to 2.0.1 to unblock use of rosbag-scene-detection module.

To Reproduce
Steps to reproduce the behavior:

deploy rosbag-scene-detection module and when the Step Function tries to create a EMR cluster it will fail.

Expected behavior
The EMR cluster should be created successfully.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
aws/aws-cdk#24842

[FEATURE]Improve example tf module

Is your feature request related to a problem? Please describe.
Improve the existing example-tf module to demonstrate how to use terraform.tfvars file and establish the wiring between root variables.tf, module level variables.tf and vars/terraform.tfvars per env

[BUG] - DDB-to-Opensearch Readme missing a parameter

The module modules/integration/ddb-to-opensearch is missing a parameter in the README - the PRIVATE_SUBNET_IDS (aka private-subnet-ids) is a required parameter but is not denoted.

Please add this parameter definition to the README of the module

[FEATURE]Refactor Rosbag Image pipeline w/ scene detection

Describe the solution you'd like
Refactor Rosbag Image pipeline w/ scene detection workflow, so we dont have to deploy rosbag-scene-detection module

Additional context
Add any other context or screenshots about the feature request here.

[Q1 2023] Example Terraform module

Provide a pattern of deploying Terraform IAC solutions using seedfarmer

[BUG] Replace EMR-Launch from stack

With the new emr-serverless module in IDF-modules, using emr-launch (from pypi) should be re-evaluated.

Remove emr-launch dependency from rosbag-scene-detection module
search all modules for emr-launch and remove

Split EFS from EKS Module

I just reviewed this module...you are correct that the EFS is indeed created. I would suggest leaving the CNI addon support and just moving the EFS creation to a different module altogether...and parameterize the rest (storage class, etc)

Example Terraform prereqs module

App Seeding Module (Seed Code, CodeCommit, CICD Pipeline)

[INVESTIGATION] OpenLineage support

Look into this blog post and supporting code to evaluate whether OpenLineage can be used in ADDF.

https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/
https://github.com/aws-samples/aws-mwaa-openlineage

Focus on:

which modules we currently have in ADDF that can be repurposed (ex. S3, VPC, EKS)
which modules need to be created (ex. Redshift Serverless)
which modules need to be modified, if any (ex MWAA and requirements) that MUST be backward compatible
which compute to be used for Marques and OpenLineage (suggest EKS)

[on hold] EKS based Ubuntu Desktop + NiceDCV using Kubectl port forwarding

@srinivasreddych adding more context:

AV/ADAS customers need to visualize sensor data collected in the vehicle. This includes the ability to visualize camera, LiDaR 3D Point Cloud, map and other raw sensor data.

Customers use ROS visualization tools, such as Foxglove Studio, to do this. We've built a solution that allows streaming ROS bag files from s3 directly to Foxglove Studio. However, the problem is that streaming these files via the internet is not ideal.

We need a solution that allows deploying a visualization developer instance into a VPC and having the bag files streamed via S3 Gateway endpoints instead. This instance should use NiceDCV to allow connecting to the Desktop + have all the required ROS tooling and Foxglove studio installed (Ubuntu required).

We have a module that allows spinning up exactly such an EC2 instance (https://github.com/awslabs/autonomous-driving-data-framework/tree/release/1.0.0-reinvent/modules/visualization/dev-instance). However, again EC2 is not ideal since we want to be able to create and destroy these visualization development instances on demand (potentially from a UI). For this we are looking to replicate the dev-instance module above on EKS instead, with PODs that deploy Ubuntu Desktop + NiceDCV & Foxglove Studio on demand and can be created and destroyed on demand.

[BUG] PySaprk Job fails on EMR-on-EKS with message Internal Failure in retrieving data

Describe the bug
When we try to submit a pyspark job on EMR-on-EKS cluster, the job fails with error message "Internal Failure in retrieving data".

To Reproduce

Deploy ADDF by including module modules/core/emr-on-eks/ in the manifest.
refer manifest yaml here -> https://github.com/manojrajpurohit/autonomous-driving-data-framework/blob/trigger-spark-on-eks/manifests/ros-image-demo/emr-modules.yaml
submit a pyspark script from cloud9 , you may use below command for reference
use command aws emr-containers list-virtual-clusters to get virtual cluster id

aws emr-containers start-job-run \
--virtual-cluster-id <virtual cluster id> \
--name scene_detection_manual \
--execution-role-arn <EMR execution role> \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
    "entryPoint": "s3://<bucket_name>/path/to/pyspark/script.py",
    "entryPointArguments": ["--batch-metadata-table-name addf-ros-image-demo-dags-aws-drive-tracking --batch-id 6_dec_2022 --bucket addf-ros-image-demo-curated-bucket-75fe6115 --region ap-south-1 --output-dynamo-table addf-ros-image-demo-dags-aws-scenes"],
    "sparkSubmitParameters": "--conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.driver.memory=2G --conf spark.executor.cores=2 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false --packages com.audienceproject:spark-dynamodb_2.12:1.1.1"
  }
}'

Expected behavior
The pyspark jobs should run successfully

Screenshots

Additional context

This execution was actually performed for scene detection , thus you may use this PySpark code. Thus, this entryPointArguments are provided to match the execution of this script. However, this script requires previous steps of ADDF to be completed with data in respective buckets and Dynamo tables thus you may run any simple pyspark script to reproduce this error.

Merge re:Invent workshop changes

Pull request for re:Invent workshop: Deploy the dev instance (niceDCV…), PR by [@oostha] (High Prio) review by [@dgraeber]

[BUG] OpenSearch Domain does not have module name embedded in name

When deploying the OpenSearch module, the domain that is created does not have the module name embedded...:

ex: domain addf-aws-solutions--074ff5b4

[BUG] modules/analysis/aws-batch-demo/modulestack.yaml too permissive

The modules/analysis/aws-batch-demo/modulestack.yaml contains the following permissions which should be addressed:

          - Action:
              - "secretsmanager:Get*"
              - "secretsmanager:DescribeSecret"
              - "secretsmanager:ListSecretVersionIds"
              - "kms:Decrypt*"
            Effect: Allow
            Resource: !Sub "arn:aws:secretsmanager:${AWS::Region}:${AWS::AccountId}:secret:*"

the kms:Decrypt action has to effect as there isn't a KMS Resource specified.
the Resource for secretsmanager doesn't follow least privilege best practices and should be scoped to an addf-prefixed secret or specific secret.

[WEB-APP] Deploy Sketch Wireframes on AWS

[Q4 2022] Web Application High-fidelity wireframes

For frontend we need additional security review. Start early to take this into account

[BUG]

Describe the bug
I was following the blog here and came up this with below error
An error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:iam::111111111111:user/temitayo is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::<111111111111>:role/seedfarmer-addf-toolchain-role

To Reproduce
git clone https://github.com/awslabs/autonomous-driving-data-framework.git
pip install -r requirements.txt
cp -R manifests/example-dev manifests/demo
source ./scripts/setup-secrets-example.sh
./scripts/setup-secrets-dockerhub.sh
seedfarmer apply ./manifests/demo/deployment.yaml

I also tried with release 1.2.1 and received a similar error

[BUG]rosbag scene detection no longer deploys

Describe the bug
When deploying rosbag-scene-detection , the cdk stack->Glue crawler resource fails with the error not able to find s3 bucket analysis-scene-detection. The ecs-stack.py` deploys the glue crawler which has an s3 path referenced incorrectly.

Update deployment_guide.md with new bootstrap info

[FEATURE] Deploy Visualization EC2 Instance

Is your feature request related to a problem? Please describe.
We've added an module for visualization ROS data on EC2 using NiceDCV. Add this to the ros-image-demo manifest to showcase how to configure and deploy the module

Describe the solution you'd like

Add manifest to deploy the modules/visualization/dev-instance module

[BUG] Update CDK versions in older modules

Some modules referring to an older CDK should be updated as it appears that python 3.8 is no longer supported by AWS Lambda (for custom resources).

autonomous-driving-data-framework/modules/examples/eb-sf-batch/
- [email protected]
autonomous-driving-data-framework/modules/integration/efs-on-eks/
- [email protected]
autonomous-driving-data-framework/modules/integration/fsx-lustre-on-eks/
- [email protected]

[FEATURE]Tests coverage for ADDF

Is your feature request related to a problem? Please describe.
Need test-coverage for qualifying for AWS Solutions

[BUG] Incorrect package "apache-airflow-providers-amazon" mentioned in requirements.txt for MWAA

Describe the bug
MWAA core module requirements.txt (https://github.com/awslabs/autonomous-driving-data-framework/blob/main/modules/core/mwaa/requirements/requirements.txt) uses latest version of package "apache-airflow[amazon]" that does not contain the references class "AwsBatchOperator" referenced in https://github.com/awslabs/autonomous-driving-data-framework/blob/9fd8224e7a44e501dc7a9a952d1f7e023229f718/modules/analysis/rosbag-image-pipeline/image_dags/ros_image_pipeline.py (line: 33)

To Reproduce
Steps to reproduce the behavior:

Deploy branch "release/1.0.0-reinvent" branch with core-->mwaa module in Workshop studio for ADDF Immersion Day

Expected behavior
Remove package apache-airflow[amazon] from requirements.txt to use the correct package installed from other modules.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

[BUG] sensorextraction/ros-to-parquet module failing at run time due to requests lib not installed

Describe the bug
requests lib is missing from requirements.txt in ros-to-parquet module at autonomous-driving-data-framework/modules/sensor-extraction/ros-to-parquet/src/main.py. This is causing the batch job to fail at runtime with the following error:

To Reproduce
Steps to reproduce the behavior:

deploy ros-image-demo
run airflow job
sensor-extraction.parquet-extraction-batch-job fails
view aws batch job logs

@timestamp	@message
2022-11-04 18:18:21.649	Traceback (most recent call last):
2022-11-04 18:18:21.649	File "main.py", line 24, in
2022-11-04 18:18:21.649	import requests
2022-11-04 18:18:21.649	ModuleNotFoundError: No module named 'requests'

Expected behavior

Module does not error and successfully converts ros topics to parquet.

[BUG] OpenSearch Proxy Module breaks when using passwords with special characters

Describe the bug
When using auto generated passwords from SecretsManager, the passwords may include special characters that unless properly escaped breaks the /modules/demo-only/opensearch-proxy module.

To Reproduce
Steps to reproduce the behavior:

Go SecretsManager and generate username and password complying to the module README
Use the Automated password generation to include special characters, generate a password with " or ' or $.
Deploy the module
Use EC2 Connect to ssh into machine
Check the cloud init logs cat /var/log/cloud-init-output.log
Grab the generated user-data script curl http://169.254.169.254/latest/user-data >> user-data.txt
Due to the unescaped bash special characters the user-data script will fail and the proxy module will not bootstrap properly

Expected behavior

Any password can be used and the user-data script and basic authentication should work

Logs

/var/log/cloud-init-output.log

cloud-init v. 19.3-45.amzn2 running 'modules:final' at Wed, 03 Aug 2022 16:17:33 +0000. Up 62.60 seconds.
download: s3://<removed>-eu-central-1/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh to tmp/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh
/var/lib/cloud/instance/scripts/part-001: line 6: unexpected EOF while looking for matching `"'
Aug 03 16:17:35 cloud-init[3222]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [2]
Aug 03 16:17:35 cloud-init[3222]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Aug 03 16:17:35 cloud-init[3222]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed

user-data.txt

#!/bin/bash
mkdir -p $(dirname '/tmp/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh')
aws s3 cp 's3://<removed>-eu-central-1/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh' '/tmp/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh'
set -e
chmod +x '/tmp/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh'
'/tmp/9db3c2c8b9cf8226e2576a982eac11f32c7e1bfcc97daa87d48d8c56a1b8dd65.sh' vpc-<removed>.eu-central-1.es.amazonaws.com foo [x1#x&"X-1.x1X(x}'X'X$]z=!lU'X;X[x

Note password above is only example generated by hand

[BUG] Misleading word in Module Manifest section

Describe the bug
I'm getting famillair with seedfarmer and I'm not sure if it's a bug or done on purpose but I think if "the name of the group" could be changed into "the name of the module". It would be helpful to underestand what this name means exactly. If I'm wrong just please add more context to this section in documentation.

To Reproduce
https://github.com/awslabs/seed-farmer/blob/54eac1d433c2e2f7eefafb055229814382eb2146/docs/source/manifests.md?plain=1#L247

Expected behavior
the name of the module

[BUG] Typos in ADDF deployment guide

Describe the bug
There are typos in the first two commands of the ADDF Deployment Guide Readme.
Neither https://github.com/awslabs/autonomouse-driving-data-framework URL, nor autonomouse-driving-data-framework folder exist (see the attached file).

To Reproduce
Execute the first two commands of the deployment guide:
git clone --origin upstream --branch release/1.0.0 https://github.com/awslabs/autonomouse-driving-data-framework
cd autonomouse-driving-data-framework

Expected behavior
Clone of the brunch and folder change

Screenshots

Additional context
NA

[BUG] MWAA version conflict

Describe the bug
When deploying core/mwaa, the dags for analysis/rosbag-image fail to load due to references to BatchCreateComputeEnvironmentOperator in dependency libraries.

Triaging this issue is related to the installation of the library amazon-airflow[amazon] in modules/core/mwaa/requirements/requirements.txt. A temporary fix was to remove this reference, but this breaks the dags located in modules/examples/example-spark-dags/example_spark_dags/citibike_all_dag.py

On main and on release/1.1.0 in order to support EMR-on-EKS from MWAA and have back-support for rosbag pipelines, we will need to parametrize the requirements file and allow the manifests to pass in which requirements file is needed for individual use cases (ex. one MWAA module manifest entry for rosbag-image, one MWAA module manifest entry for emr-on-eks).

This change needs to occur to support multiple versions of DAGs.

[Q1 2023]FSx for Lustre on EKS

Create 2 new modules:

support for Lustre on FSx
support for FSx-on-EKS

[FEATURE]Update the example manifests to use the `git` path

Update the manifests to use the git path feature of seedfarmer and consume the modules from IDF

Add Visualization Instance to ros-image-demo manifest

new issue

[BUG] vscode-on-eks README.md wrong title

Describe the bug
Please validate vscode-on-eks README.md title. It might be wrong. The current title is for another module - JupyterHub.

To Reproduce
Steps to reproduce the behavior:

open https://github.com/awslabs/autonomous-driving-data-framework/blob/main/modules/demo-only/vscode-on-eks/README.md and find the title (Deploying JupyterHub)

Expected behavior
Can be - Deploying VSCode IDE or similar.

Screenshots
Deploying JupyterHub

Additional context
not required

[FEATURE] Add pipeline for training YoloV5 model

Is your feature request related to a problem? Please describe.
The MLOps Modules allows us to do model training on AWS either via SageMaker or EKS. Add a sample training pipeline for training the model that is used by the object/lane detection pipelines to allow customers to improve the model over time.

Describe the solution you'd like

SageMaker Project Template that includes a SageMaker Pipeline to train the YoloV5 Model
KubeFlow Project + Pipeline that includes the training pipeline for training the YoloV5 Model

awslabs / autonomous-driving-data-framework Goto Github PK

autonomous-driving-data-framework's Introduction

Autonomous Driving Data Framework(ADDF)

Deployment Instructions

Different types of modules supported by ADDF

Use-case specific Modules

Optional Modules

Integration Modules

Simulation Modules

IDE Modules

Example Modules

Reporting Issues

Deployment FAQ

autonomous-driving-data-framework's People

Contributors

Stargazers

Watchers

Forkers

autonomous-driving-data-framework's Issues

Tasks

Recommend Projects

Recommend Topics

Recommend Org