microsoft / ray-on-aml Goto Github PK

Turning AML compute into Ray cluster

License: Other

HTML 0.94% Python 5.75% Dockerfile 0.04% Jupyter Notebook 93.27%

ray-on-aml's Introduction

Ray on Azure ML

This package enables you to use ray and ray's components such as dask on ray, ray[air], ray[data] on top of Azure ML's compute instance and compute cluster. With this, you can take advantage of both ray's distributed computing capabilities and Azure machine learning platform. For example you can run ray's distributed ML within AzureML's pipeline and on managed compute cluster.

With support for both interactive and job uses, you can do interactive development in client/interactive mode then operationalize with job mode.

[Updates 12/14/2022]

Support AML SDK v2

If you have AML SDK v2 for python in your environment, Ray-On-AML will detect the SDK and leverage AML SDK v2 packages
This package is still compatable with AML SDK v1.
If you have both v1 and v2, then v2 will be used as a default.

Better control of ray versions and ray packages by user

Users no longer need to use fixed ray packages that comes with Ray-On-AML. You can specify ray components and versions to use in getRay() method for interactive mode or include ray version and ray packages in your job environment/dependencies for job mode.

Ability to mount inputs and outputs to ray cluster (with AML v2) for interactive use

No more download or move larger volume of data from Data Lake to compute cluster for processing. Just mounting Data, you can access for read and write data.
Manage data using Data(Set) in AML, and use the name to mount for in/output
The path to the mounted folder can be used in ray client for ray to access data.

Support user define docker environment to greater customize ray environment

If you need greater control over the ray's run time environment, you can build the environment using Azure ML's environment

Setup & Quick Start Guide

Option 1: Run ray workload within an azure ml job (non-interactive mode)

Setup a azure ml compute cluster
Include ray-on-aml,azureml-defaults, azureml-mlflow and ray package(s) as job dependencies like below in conda or in your job's environment

channels:
- anaconda
- conda-forge
dependencies:
- python=3.8.5
- pip:
   - azureml-mlflow
   - azureml-defaults
   - ray-on-aml
   - ray[data]==2.2.0 #add ray packages and versions
   # ..other packages

In your job script, you ray cluster handler is available at the head node for you

if __name__ == "__main__":
    if ray: #in the headnode
        ray.init(address="auto")
        print(ray.cluster_resources())
        #Your ray logic follows

    else:
        print("in worker node, do nothing")

see example at job

There's no need for vnet setup.

If you like setup an interactive ray cluster to work with from a ray client or directly on the head node, follow the following setup:

Option 2: Use ray cluster interactively

You can setup a ray cluster and use it to develop and test interactively either from a head node or with a ray client For this, ray-on-aml relies on a AML Compute Instance (CI) as the head node or ray client machine and AML compute cluster as a complete remote ray cluster in case the CI is used as ray client only or ray cluster worker(s) in case the CI is used as head node.

Architecture for Interactive Mode

1. Setup resources

To setup this mode, you will need a compute instance, compute cluster and they need to be in the same vnet to communicate to each other. Review the following check list Checklist for service provisioning

[ ] Azure Machine Learning Workspace

[ ] Virtual network/Subnet

[ ] Network Security Group in/outbound

[ ] Create Compute Instance (CI) in the Virtual Network

[ ] Create Compute Cluster in the same Virtual Network

2. Select kernel

Use a python 3.7+ conda environment from Notebook in Azure Machine Learning Studio or Jupyter Notebook in Azure Machine Learning Compute Instance (CI).

3. Install library

Download and install ray-on-aml and ray packages in your notebook conda's environment

For example, following python command will download and install ray 2.2.0, Azure Machine Learning SDK v2 for python and other packages

pip install --upgrade ray==2.2.0 ray[air]==2.2.0 ray[data]==2.2.0 azure-ai-ml ray-on-aml

There are two modes to run Ray interactively

Client Mode
Run directly from the head node

4.1. Client mode

By default CI won't be part of Ray cluster but it will be used as a terminal to execute job on Ray running on Compute Cluster

from ray_on_aml.core import Ray_On_AML

ray_on_aml =Ray_On_AML(ml_client=ml_client, compute_cluster ="{COMPUTE_CLUSTER_NAME}")

# May take 7 mintues or longer. Check the AML run under ray_on_aml experiment for cluster status.  
ray = ray_on_aml.getRay(num_node=2,pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", 
"azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.12.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])

client = ray.init(f"ray://{ray_on_aml.headnode_private_ip}:10001")

If you ran above sample, make sure you have the same version of ray==2.2.0 in CI. If you don't specify pip_packages, ray[default] with the same version of ray installed in your CI will be used for the cluster Behind the scene, an Azure ML job is launched and create a remote ray cluster that your client connects to. After this check the resources with ray.cluster_resources() to see how much resource you have for your ray cluster.

4.2. Run at head node

This means CI is setup as header node in the cluster and a remote azure ml job is launched to provide worker nodes for the cluster . To enable this, set ci_is_head = True

from ray_on_aml.core import Ray_On_AML

ray_on_aml =Ray_On_AML(ml_client=ml_client, compute_cluster ="{COMPUTE_CLUSTER_NAME}")

# May take 7 mintues or longer. Check the AML run under ray_on_aml experiment for cluster status.  
# MODE II. CI as Ray cluster Header node
ray = ray_on_aml.getRay(ci_is_head=True, num_node=2)

Note: To install additional library, use pip_packages and onda_packages parameters. The ray cluster will request 2 nodes from AML if num_nodes is not specified.

5. (AML SDK v2 only) Mount Data(Set) to ray cluster

If you are using AML SDK v2, you can mount Data(Set) to Compute Cluster

from azure.ai.ml import command, Input, Output
from ray_on_aml.core import Ray_On_AML

ray_on_aml =Ray_On_AML(ml_client=ml_client, compute_cluster ="{COMPUTE_CLUSTER_NAME}")

inputs={
    "Input1": Input(
        type="uri_folder",
        path="azureml://datastores/{Data(Set)NAME}/paths/{FolderName}",
    )
}

outputs={
    "Output1": Output(
        type="uri_folder",
        path="azureml://datastores/{Data(Set)NAME}/paths/{FolderName}",
    ),
    "output2": Output(
        type="uri_folder",
        path="azureml://datastores/{Data(Set)NAME}/paths/{FolderName}",
    )
}

ray = ray_on_aml.getRay(inputs = inputs, outputs=outputs, num_node=2,
pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", 
"azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.2.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])

client = ray.init(f"ray://{ray_on_aml.headnode_private_ip}:10001")

6. Ray Dashboard

[Only when CI is used as head node ci_is_head=True ] The easiest way to view Ray dashboard is using the connection from VSCode for Azure ML. Open VSCode to your Compute Instance then open a terminal, type http://127.0.0.1:8265/ then ctrl+click to open the Ray Dashboard.

This trick tells VScode to forward port to your local machine without having to setup ssh port forwarding using VScode's extension on the CI.

When running ray in client mode or in job mode with Azure ML cluster, you will need to ssh into the head node and configure port forwarding to view Ray Dashboard

7. Shutdown ray cluster

IMPORTANT: To stop Compute Cluster, you must run shutdown function. And also note that, this function won't stop CI, it only shutdown CC

To shutdown cluster, run following

ray_on_aml.shutdown()

8. Specify Ray version and add other Ray and python packages

For Interactive cluster: You can use pip_packages and conda_packages arguments in getRay() function of the Ray_On_AML object to configure the ray's run time environment. You can also configure your own custom azure ml environment using environment argument in in getRay(). It can be azureml environmen object or name of the environment.

ray_on_aml =Ray_On_AML(ml_client=ml_client, compute_cluster ="{COMPUTE_CLUSTER_NAME}")

ray = ray_on_aml.getRay(inputs = inputs, outputs=outputs, num_node=2,
pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", 
"azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.2.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])

For Job cluster: simply add ray-on-aml and ray component(s) among other dependencies to your conda file of azure ml job or azure ml pipeline.

      - ray-on-aml==0.2.5
      - ray[air]==2.2.0

9. Quick start examples

Check out quick start examples to learn more

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Security

Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include Microsoft, Azure, DotNet, AspNet, Xamarin, and our GitHub organizations.

If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's definition of a security vulnerability, please report it to us as described below.

Reporting Security Issues

Please do not report security vulnerabilities through public GitHub issues.

Instead, please report them to the Microsoft Security Response Center (MSRC) at https://msrc.microsoft.com/create-report.

If you prefer to submit without logging in, send email to [email protected]. If possible, encrypt your message with our PGP key; please download it from the Microsoft Security Response Center PGP Key page.

You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at microsoft.com/msrc.

Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:

Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
Full paths of source file(s) related to the manifestation of the issue
The location of the affected source code (tag/branch/commit or direct URL)
Any special configuration required to reproduce the issue
Step-by-step instructions to reproduce the issue
Proof-of-concept or exploit code (if possible)
Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.

If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our Microsoft Bug Bounty Program page for more details about our active programs.

Data Collection

The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft’s privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

Information on managing Azure telemetry is available at https://azure.microsoft.com/en-us/privacy-data-management/.

Preferred Languages

We prefer all communications to be in English.

Policy

Microsoft follows the principle of Coordinated Vulnerability Disclosure.

ray-on-aml's People

Contributors

Stargazers

Watchers

Forkers

python-repository-hub henry-zeng test-mass-forker-org-1 ozlemakpinar balakreshnan best-cloud-practice-for-data-science gabland-msft xlegend1024 isabella232 jiahaoyao wahalulu grizzlybearg

ray-on-aml's Issues

compute instance died

Hi
I created cluster with 5 node Standard_DS3_v2 and compute instance Standard_DS3_v2 and followed the quick_use_case.ipynb . in the middle of running I lost cluster and then it killed the compute instance because of memory issue

Workers all run on same node.

Hello, I've got a bug where I start a job with multiple nodes and multiple workers using ray_on_aml

When I start a job with placement_strategy="SPREAD", print(ray.get_runtime_context().node_id) reads as the same node for each worker - it should be different nodes for each worker.
When I start a job with placement_strategy="STRICT_SPREAD", the job reads pending and does not start.
Notably, ray.cluster_resources() shows that one node is alive and one node is not when I start the job.

    ws = Workspace.from_config()
    ray_on_aml = Ray_On_AML(ws=ws, compute_cluster="gpu-cluster", maxnode=2)
    ray = ray_on_aml.getRay(ci_is_head=True, num_node=2)

    trainer = TorchTrainer(
        train_func,
        train_loop_config={
            "num_epochs": 10,
            "feature_idx": 0,
            "feature_dim": 1433,
            "label_idx": 1,
            "label_dim": 1,
            "num_classes": 7,
        },
        run_config=RunConfig(),
        scaling_config=ScalingConfig(
            num_workers=2, placement_strategy="SPREAD"
        ),
    )
    result = trainer.fit()
    ray_on_aml.shutdown()

Can't use @ray.remote when running in job

I'm new to Ray, but basically, I don't appear to be able to do this:

ray_on_aml = Ray_On_AML()
ray = ray_on_aml.getRay()


@ray.remote
def remote_function():
    return 1


if ray:
    for _ in range(1000):
        ray.get(remote_function.remote())

The reason being that ray_on_ml.getRay() doesn't return anything when it is run on a worker, meaning that ray is NoneType, and that @ray.remote therefore throws:

Traceback (most recent call last):
  File "preprocess.py", line 13, in <module>
    @ray.remote
AttributeError: 'NoneType' object has no attribute 'remote'

Hope this makes sense.

Bump version to ray 1.13.0

The current code is not installing dependencies properly in the cluster.

When doing:

from azureml.core import Workspace, Experiment, Environment,ScriptRunConfig
# from azureml.widgets import RunDetails
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.environment import Environment
from ray_on_aml.core import Ray_On_AML
import time

ws = Workspace.from_config()
ray_on_aml = Ray_On_AML(ws=ws, 
                        compute_cluster ="ray-final-test", 
                        maxnode=max_cluster_nodes,  
ray_on_aml.getRay()

Runtime error appears:

RuntimeError: Version mismatch: The cluster was started with:
    Ray: 1.13.0
    Python: 3.8.5
This process on node 10.0.0.5 was started with:
    Ray: 1.12.0
    Python: 3.8.5

Job somehow seen as interactive mode

Hello,

I'm trying to run the following code as AML job:

from ray_on_aml.core import Ray_On_AML
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.logger import pretty_print
from ray.tune.registry import register_env


from sim import SimpleAdder


def env_creator(env_config):
    return SimpleAdder(env_config)


register_env("simple_adder", env_creator)

if __name__ == "__main__":
    ray_on_aml = Ray_On_AML()
    ray = ray_on_aml.getRay()

    if ray:
        print("head node detected")
        ray.init(address="auto")
        print(ray.cluster_resources())
        algo = (
            PPOConfig()
            .rollouts(num_rollout_workers=1)
            .resources(num_gpus=0)
            .environment(env="simple_adder")
            .build()
        )
        for i in range(10):
            result = algo.train()
            print(pretty_print(result))

            if i % 5 == 0:
                checkpoint_dir = algo.save()
                print(f"Checkpoint saved in directory {checkpoint_dir}")

    else:
        print("in worker node")

But, it fails with the following error indicating that it somehow thinks to be in interactive mode:

/bin/bash: /azureml-envs/azureml_d9f69b31ab72fa234d4e7d56f0a1c374/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Traceback (most recent call last):
  File "main.py", line 17, in <module>
    ray_on_aml = Ray_On_AML()
  File "/azureml-envs/azureml_d9f69b31ab72fa234d4e7d56f0a1c374/lib/python3.8/site-packages/ray_on_aml/core.py", line 68, in __init__
    raise Exception("For interactive use, please pass ML client for azureml sdk v2 or workspace for azureml sdk v1 and compute cluster name to the init")
Exception: For interactive use, please pass ML client for azureml sdk v2 or workspace for azureml sdk v1 and compute cluster name to the init

The AML environment is created in the Azure UI using the option create from conda and the conda env is defined by:

channels:
  - anaconda
  - conda-forge
dependencies:
  - python=3.8.5
  - pip=22.3.1
  - pip:
      - azureml-mlflow
      - azureml-defaults
      - ray-on-aml
      - 'ray[data]==2.3.0'
      - 'ray[rllib]==2.3.0'
      - gymnasium
      - numpy==1.24.2

Unable to initialize cluster

Hi @james-tn ,

Copying the issue from: james-tn/ray-on-aml#24 with some modifications.

Thank you for this library. We are trying to use this library using the example code (https://github.com/microsoft/ray-on-aml/blob/master/examples/quick_start_examples.ipynb) in an interactive environment in Azure ML.
The Jupyter notebook is a Python 3.8 Azure ML notebook. We are using the latest version of ray-on-aml 0.2.1

from azureml.core import Workspace, Run, Environment
from ray_on_aml.core import Ray_On_AML
ws = Workspace.from_config()
ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ='ray-test', additional_pip_packages=['lightgbm_ray', 'sklearn'], maxnode=4)
ray = ray_on_aml.getRay()

The image builds correctly on Azure ML. However, the cluster doesn't turn on. Below is what we see in the notebook:

Cancel active AML runs if any
Shutting down ray if any
Found existing cluster ray-test
Creating new Environment ray-0.2.1-5974090952704054762
Waiting for cluster to start
......................
{'memory': 3001307136.0,
 'CPU': 2.0,
 'object_store_memory': 1500653568.0,
 'node:10.54.42.20': 1.0}

And the following error inside the ray_on_aml experiment:

jars files are not copied, probably due to packages such as raydp is not installed


[2022-06-10T04:14:04.526543] The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 900.0 seconds
2 items cleaning up...
Cleanup took 0.11278533935546875 seconds
Traceback (most recent call last):
  File "source_file.py", line 100, in <module>
    startRay(master_ip)
  File "source_file.py", line 43, in startRay
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -3] Temporary failure in name resolution

This error comes with both True and False for ci_is_head. All machines are inside the same VNET.

We are also facing an error while running as a job. Scripts below:

ray_test.py

import logging
from ray_on_aml.core import Ray_On_AML
from adlfs import AzureBlobFileSystem
import ray

logging.info(f'Initializing Ray')
ray_on_aml = Ray_On_AML()
logging.info(f'Getting head node')
ray = ray_on_aml.getRay()
logging.info(f'Retrieved head node')

if __name__ == "__main__":
    logging.info(f'Initializing file system')
    abfs = AzureBlobFileSystem(account_name="azureopendatastorage", container_name="isdweatherdatacontainer")

    if ray:  # in the headnode
        logging.info(f'Read parquet data')
        data = ray.data.read_parquet(["az://isdweatherdatacontainer/ISDWeather/year=2015/"], filesystem=abfs)
        logging.info(f'Read parquet data finished')
        pass
        # logic to use Ray for distributed ML training, tunning or distributed data transformation with Dask

    else:
        print("in worker node")

ray_trigger.py

from azureml.core import ScriptRunConfig, Experiment, Environment
from azureml.core.runconfig import DockerConfiguration, RunConfiguration
import azure_init, submit_wait_for_completion

ENV_NAME = 'Ray_Test'
workspace, datastore, compute_cluster = azure_init(cluster_name="ray-test")
docker_config = DockerConfiguration(use_docker=True)
env = Environment.from_conda_specification(name=ENV_NAME, file_path="ray_conda_env.yml")
env.docker.base_image = "mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04:20220329.v1"
aml_run_config_ml = RunConfiguration(communicator='OpenMpi')

aml_run_config_ml.node_count = 4
aml_run_config_ml.target = compute_cluster
aml_run_config_ml.environment = env
aml_run_config_ml.docker = docker_config

src = ScriptRunConfig(source_directory='.', script='ray_test.py', run_config=aml_run_config_ml)

experiment_name = 'Ray_Test'
experiment = Experiment(workspace=workspace, name=experiment_name)
run, details = submit_wait_for_completion(src, experiment, {}, show_output=True,
                                          wait_post_processing=False)

ray_conda_env.yml

channels:
  - anaconda
  - conda-forge
dependencies:
  - python=3.8.1
  - pip:
      - azureml-mlflow==1.41.0
      - ray-on-aml==0.2.1
      - protobuf==3.20.1
      - azureml-defaults==1.41.0
  - matplotlib
  - pip < 20.3
name: azureml_cfc9e96c7b0b43301a0ba4c6bd3548e5

We get the following error:

This is an MPI job. Rank:0
Script type = None
[2022-06-10T08:33:48.187712] Entering Run History Context Manager.
[2022-06-10T08:33:48.205558] Writing error with error_code ServiceError and error_hierarchy ServiceError/ImportError to hosttool error file located at /mnt/batch/tasks/workitems/7cf850e9-7ee8-40ed-847c-c5042bef5d51/job-1/ray_test_1654848910__11088471-ba23-4b28-a130-ee11d896e099/wd/runTaskLetTask_error.json
Starting the daemon thread to refresh tokens in background for process with pid = 156
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_manager_injector.py", line 452, in <module>
    execute_with_context(cm_objects, options.invocation)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_manager_injector.py", line 132, in execute_with_context
    stack.enter_context(wrapper)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/_vendor_contextlib2.py", line 356, in enter_context
    result = _cm_type.__enter__(cm)
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_manager_injector.py", line 80, in __enter__
    self.context_manager.__enter__()
  File "/mnt/batch/tasks/shared/LS_root/jobs/ml-poc-workspace/azureml/ray_test_1654848910_399771ac/wd/azureml/Ray_Test_1654848910_399771ac/azureml-setup/context_managers.py", line 384, in __enter__
    self.history_context = get_history_context_manager(**self.history_config)
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/_tracking.py", line 167, in get_history_context_manager
    py_wd_cm = get_py_wd()
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/_tracking.py", line 304, in get_py_wd
    return PythonWorkingDirectory.get()
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/history/_tracking.py", line 274, in get
    from azureml._history.utils.filesystem import PythonFS
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_history/utils/filesystem.py", line 8, in <module>
    from azureml._restclient.constants import RUN_ORIGIN
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_restclient/__init__.py", line 7, in <module>
    from .rest_client import RestClient
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azureml/_restclient/rest_client.py", line 12, in <module>
    from msrest.service_client import ServiceClient
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/__init__.py", line 28, in <module>
    from .configuration import Configuration
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/configuration.py", line 38, in <module>
    from .universal_http.requests import (
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/universal_http/__init__.py", line 53, in <module>
    from ..exceptions import ClientRequestError, raise_with_traceback
  File "/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/msrest/exceptions.py", line 31, in <module>
    from azure.core.exceptions import SerializationError, DeserializationError
ImportError: cannot import name 'SerializationError' from 'azure.core.exceptions' (/azureml-envs/azureml_99373199ba07e6c57a5dae087d393b12/lib/python3.8/site-packages/azure/core/exceptions.py)

Let me know in case anything wrong with our setup or if this is an issue with the library.

Thanks a lot!

Publish 0.22 to pipy

Hi,

The last version that can be installed from PyPI is 0.21, this one does not have incorporated the ray start arguments which is kind of a need if you are running ray interactively (as you will have to play with the system config to not saturate the disk space of a compute instances, which is limited to 120GBs)

Could you please publish it there?

BR
E

Ray_on_AML doesn't support running from a python 3.10 environment or newer

If you run Ray_on_AML with python 3.10 or newer, all jobs in AML will fail because the environment creation fails. This is probably due to the create environment function inheriting the python version from the user (and not allowing it to be changed through kwargs) and conda not supporting python 3.10 without updating some libraries beforehand.

Expected behaviour:

Either being able to specify python version of environment in case the python version we want running on the cloud is different than the user's.
or Not relying on a conda environment and just a python docker image with a requirements file.

Ray cluster dashboard: Not able see to ray cluster jobs in dashboard

Hi
We could launch the Ray cluster dashboard but we can not see jobs running under ray worker's cluster on dashboard.
As I can see job progress on the ray client on dashboard. See the attached screenshot in which yellow job is progressive on ray client but cluster workers jobs are not moving. which conclude that ray jobs are not running on ray cluster's worker by looking dashboard (not sure)

Even in below article, job progress can see only on ray client machine, not ray worker.
https://github.com/microsoft/ray-on-aml

We are using @ray.remote tag to running the ray job.

Please clarify my doubt or give me sample in which i can see job progress on ray cluster's workers

Thanks

Ray workers exit immediately

When running the following script as an AML job, the worker node exits immediately, stating that the run has completed successfully.

import logging
import time

import ray
from ray_on_aml.core import Ray_On_AML


logging.basicConfig(level=logging.INFO)

ray_on_aml = Ray_On_AML()
master = ray_on_aml.getRay()


@ray.remote
def slow_function():
    time.sleep(60)
    return 1


if master:
    for _ in range(1000):
        ray.get(slow_function.remote())

Logs:

NFO:root:workder node detected
INFO:root:- env: MASTER_ADDR: 10.0.0.7
INFO:root:- my rank is 1
INFO:root:- my ip is 10.0.0.5
INFO:root:- master is 10.0.0.7
INFO:root:free disk space on /tmp
Filesystem     1024-blocks    Used Available Capacity Mounted on
overlay           57534560 2821292  51760976       6% /
INFO:root:ray start --address=10.0.0.7:6379
INFO:root:Start ray successfully
2022-05-16 23:10:54,619	INFO scripts.py:852 -- Local node IP: 10.0.0.5
2022-05-16 23:10:56,742	WARNING services.py:1994 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2147483648 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=8.89gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2022-05-16 23:10:56,978 I 126 126] global_state_accessor.cc:357: This node has an IP address of 10.0.0.5, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.
2022-05-16 23:10:56,980	SUCC scripts.py:864 -- --------------------
2022-05-16 23:10:56,980	SUCC scripts.py:865 -- Ray runtime started.
2022-05-16 23:10:56,980	SUCC scripts.py:866 -- --------------------
2022-05-16 23:10:56,981	INFO scripts.py:868 -- To terminate the Ray runtime, run
2022-05-16 23:10:56,981	INFO scripts.py:869 --   ray stop


[2022-05-16T23:10:57.967413] The experiment completed successfully. Finalizing run...
INFO:__main__:Exiting context: TrackUserError
INFO:__main__:Exiting context: RunHistory
Cleaning up all outstanding Run operations, waiting 900.0 seconds
1 items cleaning up...
Cleanup took 1.086500883102417 seconds
INFO:__main__:Exiting context: ProjectPythonPath
[2022-05-16T23:11:00.227254] Finished context manager injector.

ERROR: Communicators are not supported for local runs.

I'm getting the following error when using sample code from https://github.com/james-tn/ray-on-aml/blob/master/examples/quick_use_cases.ipynb

Here is the code from cell causing error

ray_cluster  = Ray_On_AML(ws=ws, compute_cluster ="head-gpu", maxnode=4, 
additional_pip_packages=['torch==1.10.0', 'torchvision', 'sklearn', 'pyspark','gym==0.2.1','dm-tree','scikit-image','opencv-python','tensorflow']) 
aml_run_config_ml = RunConfiguration(communicator='OpenMpi')
rayEnv = Environment.from_conda_specification(name = "RLEnv",
                                             file_path = "job/ray_job_env.yml")
aml_run_config_ml.target = ray_cluster
aml_run_config_ml.environment = rayEnv
aml_run_config_ml.node_count = 1
src = ScriptRunConfig(source_directory='../super_cabs_project/job',
                    script='rl_job.py',
                    run_config = aml_run_config_ml,
                   )

run = Experiment(ws, "supercabs-v0-exp1").submit(src)
RunDetails(run).show()

Head GPU is a cluster of size STANDARD_NC6

Error trace

---------------------------------------------------------------------------
ExperimentExecutionException              Traceback (most recent call last)
<ipython-input-31-0318424ca459> in <module>
     12                    )
     13 
---> 14 run = Experiment(ws, "supercabs-v0-exp1").submit(src)
     15 RunDetails(run).show()

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/core/experiment.py in submit(self, config, tags, **kwargs)
    218         submit_func = get_experiment_submit(config)
    219         with self._log_context("submit config {}".format(config.__class__.__name__)):
--> 220             run = submit_func(config, self.workspace, self.name, **kwargs)
    221         if tags is not None:
    222             run.set_tags(tags)

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/core/script_run_config.py in submit(script_run_config, workspace, experiment_name, run_id, _parent_run_id, credential_passthrough)
     61     collect_datasets_usage(module_logger, _SCRIPT_RUN_SUBMIT_ACTIVITY, inputs,
     62                            workspace, run_config.target)
---> 63     run = _commands.start_run(project, run_config,
     64                               telemetry_values=script_run_config._telemetry_values,
     65                               run_id=run_id, parent_run_id=_parent_run_id)

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/_execution/_commands.py in start_run(project_object, run_config_object, run_id, injected_files, telemetry_values, parent_run_id, prepare_only, check)
    115         if prepare_only and check:
    116             raise ExperimentExecutionException("Can not check preparation of local targets")
--> 117         return _start_internal_local_cloud(project_object, run_config_object,
    118                                            **shared_start_run_kwargs)
    119     else:

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/_execution/_commands.py in _start_internal_local_cloud(project_object, run_config_object, prepare_only, custom_target_dict, run_id, injected_files, telemetry_values, parent_run_id)
    266 
    267             response = ClientBase._execute_func(requests.post, uri, files=files, headers=headers)
--> 268             _raise_request_error(response, "starting run")
    269 
    270             invocation_zip_path = os.path.join(project_temp_dir, "invocation.zip")

/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/_execution/_commands.py in _raise_request_error(response, action)
    568         # response.text is a JSON from execution service.
    569         response_message = get_http_exception_response_string(response)
--> 570         raise ExperimentExecutionException(response_message)
    571 
    572 

ExperimentExecutionException: ExperimentExecutionException:
	Message: {
    "error_details": {
        "componentName": "execution",
        "correlation": {
            "operation": "d3d66b641ba30e6e8d67a3c5877cc31c",
            "request": "09f00e75e0315d0f"
        },
        "environment": "eastus",
        "error": {
            "code": "UserError",
            "innerError": {
                "code": "BadArgument",
                "innerError": {
                    "code": "CommunicatorNotSupportedForLocalRuns"
                }
            },
            "message": "Communicators are not supported for local runs.",
            "messageFormat": "Communicators are not supported for local runs."
        },
        "location": "eastus",
        "time": "2022-04-18T12:11:12.6365215+00:00"
    },
    "status_code": 400,

Unable to initialize cluster

I am not able to initialize my cluster for ray using ray-on-aml version 0.2.4. I'm running a notebook in the Python 3.8 AzureML environment. Using the following piece of code:

from ray_on_aml.core import Ray_On_AML

ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ="CC-RayWorker-CPU-DS12-v2")

# May take 7 mintues or longer. Check the AML run under ray_on_aml experiment for cluster status.  
ray = ray_on_aml.getRay(ci_is_head=True, num_node=2,pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", "azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.12.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])

While the compute instance initializes successfully, the ray_on_aml job fails in the cluster with the following error:

Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.2714250087738037 seconds
Traceback (most recent call last):
  File "source_file.py", line 175, in <module>
    startRayMaster()
  File "source_file.py", line 103, in startRayMaster
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -2] Name or service not known

Retrying due to transient client side error HTTPSConnectionPool(host='westus-0.in.applicationinsights.azure.com', port=443): Max retries exceeded with url: /v2.1/track (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ee8697220>: Failed to establish a new connection: [Errno -2] Name or service not known')).
2023-02-16 13:21:17,476	INFO usage_lib.py:516 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-02-16 13:21:17,476	INFO scripts.py:702 -- Local node IP: 10.62.79.24
2023-02-16 13:21:19,380	SUCC scripts.py:739 -- --------------------
2023-02-16 13:21:19,380	SUCC scripts.py:740 -- Ray runtime started.
2023-02-16 13:21:19,380	SUCC scripts.py:741 -- --------------------
2023-02-16 13:21:19,380	INFO scripts.py:743 -- Next steps
2023-02-16 13:21:19,381	INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2023-02-16 13:21:19,381	INFO scripts.py:747 --   ray start --address='10.62.79.24:6379'
2023-02-16 13:21:19,381	INFO scripts.py:763 -- Alternatively, use the following Python code:
2023-02-16 13:21:19,381	INFO scripts.py:765 -- import ray
2023-02-16 13:21:19,381	INFO scripts.py:769 -- ray.init(address='auto')
2023-02-16 13:21:19,381	INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2023-02-16 13:21:19,381	INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following
2023-02-16 13:21:19,381	INFO scripts.py:789 -- Python code:
2023-02-16 13:21:19,381	INFO scripts.py:791 -- import ray
2023-02-16 13:21:19,381	INFO scripts.py:792 -- ray.init(address='ray://<head_node_ip_address>:10001')
2023-02-16 13:21:19,381	INFO scripts.py:801 -- To see the status of the cluster, use
2023-02-16 13:21:19,381	INFO scripts.py:802 --   ray status
2023-02-16 13:21:19,381	INFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.
2023-02-16 13:21:19,381	INFO scripts.py:820 -- To terminate the Ray runtime, run
2023-02-16 13:21:19,381	INFO scripts.py:821 --   ray stop

I have this entire setup within a VNet and all the compute resources have been created in the same subnet. Due to certain policies, I am forced to enable 'No Public IP'(npip) on my computes.

Could this be an issue due to my setup - npip or NSG? Or is it something to do with the library? Please help mitigate this.

Thank you

Ray and python version mismatch

Using the following code,

from ray_on_aml.core import Ray_On_AML

ray_on_aml = Ray_On_AML(ml_client=ml_client, compute_cluster='rl-agents-cluster')

ray = ray_on_aml.getRay(ci_is_head=True,
    num_node=2,
    pip_packages=[
        "tensorflow==2.11"
    ]
)

I am getting this error:

RuntimeError: Version mismatch: The cluster was started with:
    Ray: 2.2.0
    Python: 3.8.10
This process on node 10.2.1.4 was started with:
    Ray: 2.3.0
    Python: 3.8.5

I checked the source code and the python version is assigned from platform.python_version(). Why is it possible that the python version does not match?

On the other hand, ray version, the ray version of the cluster should be the same of the head node, if it is not specified, shouldn't it?

Any thoughts on why this is happening?

Thanks in advance.

How to save a checkpoint after training a model for deployment ?

I am able to run the scripts on ray cluster with multiple worker nodes for training an RLLIb model. However. I would want to deploy the model and the model can be queried for optimal action, given a state as an input.

I tried using the save_checkoint method, however, I don't see any model being saved anywhere, I gave the file path the same as the script folder. However I also don't see any errors when running the script