Coder Social home page Coder Social logo

azure / aml-compute Goto Github PK

View Code? Open in Web Editor NEW
20.0 7.0 8.0 150 KB

GitHub Action that allows you to attach, create and scale Azure Machine Learning compute resources.

License: MIT License

Dockerfile 0.48% Shell 0.15% Python 99.37%
azure-machine-learning aml azure mlops machine-learning data-science

aml-compute's Introduction

Integration Test Lint and Test

GitHub Action for creating compute targets for Azure Machine Learning

Deprecation notice

This Action is deprecated. Instead, consider using the CLI (v2) to manage and interact with Azure Machine Learning compute in GitHub Actions.

Important: The CLI (v2) is not recommended for production use while in preview.

Usage

The actions for creating compute for Azure Machine Learning will allow you to create a new compute target on Azure Machine Learning using GitHub Actions.

Get started today with a free Azure account!

This repository contains a GitHub Action for creating and connecting to Azure Machine Learning compute resources, so you can later train or deploy machine learning models models remotely. If the compute target exists, it will connect to it, otherwise the action can create a new compute target based on the provided parameters. Currently, the action only supports Azure ML Clusters and AKS Clusters.

Dependencies on other GitHub Actions

  • Checkout Checkout your Git repository content into GitHub Actions agent.
  • aml-workspace This action requires an Azure Machine Learning workspace to be present. You can either create a new one or re-use an existing one using the action.

Utilize GitHub Actions and Azure Machine Learning to train and deploy a machine learning model

This action is one in a series of actions that can be used to setup an ML Ops process. We suggest getting started with one of our template repositories, which will allow you to create an ML Ops process in less than 5 minutes.

  1. Simple template repository: ml-template-azure

    Go to this template and follow the getting started guide to setup an ML Ops process within minutes and learn how to use the Azure Machine Learning GitHub Actions in combination. This template demonstrates a very simple process for training and deploying machine learning models.

  2. Advanced template repository: mlops-enterprise-template

    This template demonstrates how the actions can be extended to include the normal pull request approval process and how training and deployment workflows can be split. More enhancements will be added to this template in the future to make it more enterprise ready.

Example workflow for creating compute targets for Azure Machine Learning

name: My Workflow
on: [push, pull_request]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
    - name: Check Out Repository
      id: checkout_repository
      uses: actions/checkout@v2

    # AML Workspace Action
    - uses: Azure/aml-workspace@v1
      id: aml_workspace
      with:
        azure_credentials: ${{ secrets.AZURE_CREDENTIALS }}

    # AML Compute Action
    - uses: Azure/aml-compute@v1
      id: aml_compute
      with:
        # required inputs as secrets
        azure_credentials: ${{ secrets.AZURE_CREDENTIALS }}
        # optional
        parameters_file: "compute.json"

Inputs

Input Required Default Description
azure_credentials x - Output of az ad sp create-for-rbac --name <your-sp-name> --role contributor --scopes /subscriptions/<your-subscriptionId>/resourceGroups/<your-rg> --sdk-auth. This should be stored in your secrets
parameters_file "compute.json" We expect a JSON file in the .cloud/.azure folder in root of your repository specifying your Azure Machine Learning compute target details. If you have want to provide these details in a file other than "compute.json" you need to provide this input in the action.

azure_credentials (Azure Credentials)

Azure credentials are required to connect to your Azure Machine Learning Workspace. These may have been created for an action you are already using in your repository, if so, you can skip the steps below.

Install the Azure CLI on your computer or use the Cloud CLI and execute the following command to generate the required credentials:

# Replace {service-principal-name}, {subscription-id} and {resource-group} with your Azure subscription id and resource group name and any name for your service principle
az ad sp create-for-rbac --name {service-principal-name} \
                         --role contributor \
                         --scopes /subscriptions/{subscription-id}/resourceGroups/{resource-group} \
                         --sdk-auth

This will generate the following JSON output:

{
  "clientId": "<GUID>",
  "clientSecret": "<GUID>",
  "subscriptionId": "<GUID>",
  "tenantId": "<GUID>",
  (...)
}

Add this JSON output as a secret with the name AZURE_CREDENTIALS in your GitHub repository.

parameters_file (Parameter File)

The action tries to load a JSON file in the .cloud/.azure folder in your repository, which specifies details of your Azure Machine Learning compute target. By default, the action is looking for a file with the name compute.json. If your JSON file has a different name, you can specify it with this input parameter. Currently, the action only supports Azure ML Clusters and AKS Clusters. Note that none of these values are required and in the absence, defaults will be created with the repo name.

Sample files for AML and AKS clusters can be found in this repository in the folder .cloud/.azure. The JSON file can include the following parameters:

Common parameters
Parameter Required Allowed Values Default Description
name str <REPOSITORY_NAME> The name of the of the Compute object to retrieve or create. max characters is 16 and it can include letters, digits and dashes. It must start with a letter and end with a letter or number
compute_type (only for creating compute target) str: "amlcluster", "akscluster" - Specifies the type of compute target that should be created by the action if a compute target with the specified name was not found.
AML Cluster
Parameter Required Allowed Values Default Description
vm_size str: "Basic_A0", "Standard_DS3_v2", etc. "Standard_DS3_v2" The size of agent VMs. Note that not all sizes are available in all regions.
vm_priority str: "dedicated", "lowpriority" "dedicated" The VM priority.
min_nodes int: [0, inf[ 0 The minimum number of nodes to use on the cluster.
max_nodes int: [1, inf[ 4 The maximum number of nodes to use on the cluster.
idle_seconds_before_scaledown int: [0, inf[ 120 Node idle time in seconds before scaling down the cluster.
vnet_resource_group_name str null The name of the resource group where the virtual network is located.
vnet_name str null The name of the virtual network.
subnet_name str null The name of the subnet inside the VNet.
remote_login_port_public_access str: "Enabled", "Disabled", "NotSpecified" "NotSpecified" State of the public SSH port. "Disabled" indicates that the public ssh port is closed on all nodes of the cluster. "Enabled" indicates that the public ssh port is open on all nodes of the cluster. "NotSpecified" indicates that the public ssh port is closed on all nodes of the cluster if VNet is defined, else is open all public nodes. It can be this default value only during cluster creation time. After creation, it will be either enabled or disabled.
identity_type str: "SystemAssigned", "UserAssigned" null Specifies the type of identity that should be assigned to the AML Cluster. Supported is SystemAssigned or UserAssigned identity.
identity_id list[ str ] null User assigned identities.

Please visit this website for more details.

AKS Cluster
Parameter Required Allowed Values Default Description
agent_count int: [1, inf[ 3 The number of agents (VMs) to host containers.
vm_size str: "Standard_A1_v2", "Standard_D3_v2", etc. "Standard_D3_v2" The size of agent VMs.
location str: supported region location of workspace The location to provision cluster in.
service_cidr str null A CIDR notation IP range from which to assign service cluster IPs.
dns_service_ip str null Containers DNS server IP address.
docker_bridge_cidr str null A CIDR notation IP for Docker bridge.
cluster_purpose str: "DevTest", "FastProd" "FastProd" Targeted usage of the cluster. This is used to provision Azure Machine Learning components to ensure the desired level of fault-tolerance and QoS. "FastProd" will provision components to handle higher levels of traffic with production quality fault-tolerance. This will default the AKS cluster to have 3 nodes. "DevTest" will provision components at a minimal level for testing. This will default the AKS cluster to have 1 node.
vnet_resource_group_name str null The name of the resource group where the virtual network is located.
vnet_name str null The name of the virtual network.
subnet_name str null The name of the subnet inside the vnet.
ssl_cname str null A CName to use if enabling SSL validation on the cluster. Must provide all three CName, cert file, and key file to enable SSL validation.
ssl_cert_pem_file str null A file path to a file containing cert information for SSL validation. Must provide all three CName, cert file, and key file to enable SSL validation.
ssl_key_pem_file str null A file path to a file containing key information for SSL validation. Must provide all three CName, cert file, and key file to enable SSL validation.
load_balancer_type str: "PublicIp", "InternalLoadBalancer" "PublicIp" Load balancer type of AKS cluster.
load_balancer_subnet str equal to subnet_name Load balancer subnet of AKS cluster. It can be used only when Internal Load Balancer is used as load balancer type.

Please visit this website for more details.

Outputs

This action does not provide any outputs.

Environment variables

Certain parameters are considered secrets and should therefore be passed as environment variables from your secrets, if you want to use custom values.

Environment variable Required Allowed Values Default Description
ADMIN_USER_NAME str null The name of the administrator user account which can be used to SSH into nodes. This parameter is AML Cluster specific.
ADMIN_USER_PASSWORD str null The password of the administrator user account. This parameter is AML Cluster specific.
ADMIN_USER_SSH_KEY str null The SSH public key of the administrator user account. This parameter is AML Cluster specific.

Other Azure Machine Learning Actions

  • aml-workspace - Connects to or creates a new workspace
  • aml-compute - Connects to or creates a new compute target in Azure Machine Learning
  • aml-run - Submits a ScriptRun, an Estimator or a Pipeline to Azure Machine Learning
  • aml-registermodel - Registers a model to Azure Machine Learning
  • aml-deploy - Deploys a model and creates an endpoint for the model

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

aml-compute's People

Contributors

ashishonce avatar awmatheson avatar lostmygithubaccount avatar marvinbuss avatar microsoftopensource avatar pulkitaggarwl avatar vivishno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

aml-compute's Issues

Change default SKU

  • Change default SKU for AML Cluster
  • Change default SKU for AKS Cluster

Action is failing when deploying AKS in custom Vnet

Hi Team,

When we are trying to built aks cluster in custom vnet it is failing.We are passing below info in the compute.json. And when we tried to created the cluster manually with same configuration, it worked.

Compute.json:
{
"name": "testml12",
"compute_type": "akscluster",
"agent_count": 3,
"vm_size": "Standard_D3_v2",
"location": "northcentralus",
"service_cidr": "192.168.19.0/24",
"dns_service_ip": "192.168.19.10",
"docker_bridge_cidr": "172.19.0.1/16",
"cluster_purpose": "DevTest",
"vnet_resource_group_name": "vnet-rg",
"vnet_name": "ml-vnet-workerpool",
"subnet_name": "ml-subnet-1",
"load_balancer_type": "PublicIp"
}

Error:


"error": ***
    "message": "ComputeTargetNotFound: Compute Target with name testml12 not found in provided workspace"
***

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/code/main.py", line 140, in
main()
File "/code/main.py", line 128, in main
compute_target = create_aks_cluster(
File "/code/utils.py", line 86, in create_aks_cluster
aks_config = AksCompute.provisioning_configuration(
File "/usr/local/lib/python3.8/site-packages/azureml/core/compute/aks.py", line 255, in provisioning_configuration
config = AksProvisioningConfiguration(agent_count, vm_size, ssl_cname, ssl_cert_pem_file, ssl_key_pem_file,
File "/usr/local/lib/python3.8/site-packages/azureml/core/compute/aks.py", line 779, in init
self.validate_configuration()
File "/usr/local/lib/python3.8/site-packages/azureml/core/compute/aks.py", line 809, in validate_configuration
raise ComputeTargetException('Invalid configuration, not all virtual net information provided. To use '
azureml.exceptions._azureml_exception.ComputeTargetException: ComputeTargetException:
Message: Invalid configuration, not all virtual net information provided. To use a custom virtual net with aks, please provide vnet name, vnet resource group, subnet name, service cidr, dns service ip and docker bridge cidr
InnerException None
ErrorResponse


"error": ***
    "message": "Invalid configuration, not all virtual net information provided. To use a custom virtual net with aks, please provide vnet name, vnet resource group, subnet name, service cidr, dns service ip and docker bridge cidr"
***

Add support for Sovereign clouds like AzureUSGovernment

When using updated aml-workspace action (target_cloud branch) I get the following when running the aml-compute action.

Error:
##[error]Microsoft REST Authentication Error: Get Token request returned http error: 400 and server response: "error":"invalid_request","error_description":"AADSTS900382: Confidential Client is not supported in Cross Cloud request.\r\nTrace ID: 602fff1b-a2ef-4b88-b157-c12c330d8300\r\nCorrelation ID: 9b38af2a-baeb-4b80-b727-d1bcd300a67b\r\nTimestamp: 2020-06-26 19:38:33Z","error_codes":[900382],"timestamp":"2020-06-26 19:38:33Z","trace_id":"602fff1b-a2ef-4b88-b157-c12c330d8300","correlation_id":"9b38af2a-baeb-4b80-b727-d1bcd300a67b"

Same issue as aml-workspace Azure/aml-workspace#18
Please update for Sovereign clouds.

See PR in aml-workspace for fix.
Azure/aml-workspace#20

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.