pachyderm / aml Goto Github PK

AML planning repo

Shell 28.46% Python 20.63% HCL 50.91%

aml's Introduction

Pachyderm-AzureML integration: Private Preview Instructions

Please note: This repo should only be used by Pachyderm-approved private preview users. Please contact [email protected] to become a private preview user. This is a preview, do not use it for production!

Architecture

Requirements

How to set up AML with Pachyderm:

You need to be running Linux/MacOS/WSL on your local machine
Azure CLI
Terraform 1.0+

Step 0 - Enable custom datastores on your Azure subscription ID

Email [email protected] with your Azure Subscription ID and we'll arrange for custom datastores to be enabled on your account. This allows the rest of the instructions to work.

Step 1 - Deploy stack

Clone this repo.

git clone https://github.com/pachyderm/aml

cd aml

Log into Azure:

az login

Choose the Azure region where you want to deploy all resources:

export TF_VAR_location="East US"

Note: if you're deploying with an existing AzureML workspace, the location above should match where your workspace is.

Optionally, specify the type of data you want to store.

FileDataset: matches all files, for unstructured data OR

TabularDataset:

JSON Lines format (jsonl, matches *.jsonl and aggregates into a single table, in which case all json lines files in a given Pachyderm repo must have compatible schemas)
CSV format (delmited, matches *.csv and aggregates into a single table - in which case all csv files in a given Pachyderm repo must have compatible schemas)

For example:

export TF_VAR_pachyderm_syncer_mode="files" # or "jsonl" or "delimited"

Now we'll deploy a Kubernetes cluster, install Pachyderm on it, then start the Syncer VM.

Option 1: Automatically create a new AzureML workspace and resource group:

bash scripts/setup.sh

Note: if you get errors about exceeding quota, try a different region by configuring TF_VAR_location

Option 2: Integrate Pachyderm with an existing AzureML workspace:

If you're attaching AzureML-Pachyderm to an existing AzureML workspace, specify the resource group that the target AzureML workspace is in here, as well as specifying the workspace name:

export TF_VAR_existing_resource_group_name="existing-resource-group"
export TF_VAR_existing_workspace_name="existing-workspace"
bash scripts/setup.sh

(You can also create a new AzureML workspace in an existing resource group by only specifying TF_VAR_existing_resource_group_name but not TF_VAR_existing_workspace_name.)

Option 3: Only create a new Syncer VM, and integrate exiting Azure ML workspace with existing Pachyderm cluster

We will only create a new VM for the Syncer, and adopt existing Pachyderm and AML infrastructure. You will need to copy the Terraform code to a fresh new directory.

mkdir syncer1  # recommend naming this as syncer-$workspace_name
cp terraform/*.tf syncer1
cp -R terraform/out/ syncer1/out  # copy kubeconfig, env.sh and helmvalues, which setup.sh depends on

Setup appropriate environment variables.

export TERRAFORM_WD=syncer1
export TF_VAR_skip_pachyderm_deploy=1
export TF_VAR_existing_resource_group_name="existing-resource-group"
export TF_VAR_existing_workspace_name="existing-workspace"

Run the setup script and wait for the Syncer VM to come online.

bash scripts/setup.sh

Note: the Syncer VM is based on a Marketplace VM image built using packer. For more info go to VM image docs

How to connect to the Syncer VM

The default username on the Syncer VM is pachyderm. This is useful for debugging and quickly fixing issues that might appear in the Syncer. You can also use the syncer's builtin pachctl to operate Pachyderm.

# From the root aml repo directory
ssh pachyderm@$(cd terraform; terraform output -raw instance_ip)

Step 2 - Update rslex on your AML Compute

Install a custom built version of the azureml-dataprep-rslex library.

Note: this step will no longer be necessary after Microsoft releases an official library with the Pachyderm integration built-in.

From an AML notebook (create a new file in the "Notebooks" tab), connect to the compute instance you want to use with Pachyderm (creating one through the UI if necessary), and run:

!curl -sSL https://raw.githubusercontent.com/pachyderm/aml/main/scripts/install-rslex-custom.sh | bash

Restart the Python Kernel for your notebook after the installation completes, for the changes to take effect.

Note: there might be some errors related to incompatible package versions, you can simply ignore those.

Step 3 - Connect to Pachyderm

Install pachctl

We need to get the kubeconfig from Terraform so that we can authenticate against the remote K8s cluster.

From your aml repo, run:

(cd terraform; terraform output -raw kube_config) > kubeconfig
export KUBECONFIG=$(pwd)/kubeconfig
pachctl config import-kube aml -k $(cd terraform; terraform output -raw kube_context) --overwrite && pachctl config set active-context aml
pachctl version

You should see that your local pachctl is able to connect to your Pachyderm cluster. You can now insert data as described in the tutorial.

Tutorial

This tutorial uses structured JSON data, which requires configuring TF_VAR_pachyderm_syncer_mode="jsonl".

First, create a Pachyderm repo:

pachctl create repo poker

Next, add some data:

cat <<EOF | pachctl put file poker@master:/poker.jsonl
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
EOF

Then, go to the Datasets page in AML and observe that Pachyderm commits are automatically populated in AML as Dataset versions! For a specific dataset version, click Consume and copy and paste the code into an AML notebook. Run it, and note that the data is visible.

The Consume code should look something like:

from azureml.core import Workspace, Dataset

subscription_id = ''
resource_group = ''
workspace_name = ''

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='Pachyderm repo poker - jsonl')
dataset.to_pandas_dataframe()

Note: if you get errors, double check 1) the version of your azureml-dataprep libraries and make sure you followed Step 2 and 2) the data you stored is valid.

Lets create a new version of the data:

cat <<EOF | pachctl put file poker@master:/poker.jsonl
{"name": "Albert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Joey", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "Luke", "wins": []}
{"name": "Alysha", "wins": [["three of a kind", "5♣"]]}
EOF

Now re-run the Consume code and show that it's updated. As the a-ha moment, go back to the previous version and add version="1" to Dataset.get_by_name() and show that you see the old version of the data - a-ha! Data versioning & reproducibility!

More Examples

Migrate data from Blob/ADLS Gen2 to Pachyderm

aml's People

Contributors

Stargazers

Watchers

Forkers

azeltov

aml's Issues

Implement `list_directory()` and `get_entry()` in `stream_handler.rs` so that we can support mounting.

We can just use S3 library to list directories. Something like

s3.list(prefix, delimiter="/")

To test this, we need to configure the environment variable

RSLEX_DIRECT_VOLUME_MOUNT=true

Searcher does not recurse on subdirectories given a glob pattern

Given a pachyderm repo with sub directories containing files and possible more sub directories, the Searcher cannot recurse into these sub directories currently.

Change read path to address specific versions of the data in rslex

remove ApplyCredential, replace with String

Strange folder structure

when using dataset.download(), the files gets downloaded into a bonkers directory structure like http%blah

Publish the VM into the public marketplace!

After #8 is done and tested, click the Go Live button on https://partner.microsoft.com/en-us/dashboard/commercial-marketplace/offers/bb037439-c75e-4024-8a25-ba35a3f51663/overview!

Record demo video!

Fix killing the port-forward at the right time

Currently the port-forward subprocesses is started once and never killed. There's commented out code for killing it but currently it kills it too soon.

I've observed the port-forwards becoming unusable after a time, and needing manual restarting. pkill -f port-forward

However, ideally we could use pidfiles or something to leave the port-forward running between instantiations, so we don't need to make a new portforward instance for every single request we make to pachyderm. There must be a good balance. Maybe we can do a portforward per execution of a lariat script?

We should eliminate the sleep statements as well as it probably makes rslex-pachyderm unusably slow for real workloads. When creating a new portforward we should rapidly poll for the port being open with a timeout, rather than just assuming that it will be open after N seconds.

Rewrite kubectl port-forward in rust

We may need to not have to ship the kubectl binary along with the rslex code, probably. In order to do that, we need to implement kubectl port-forward in rust.

Or! Figure out how we can reliably and securely access pachyderm on a Kube cluster.

Or! Bundle the kubectl binary in the wheel file, or downloading it on-demand?

Reference: https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/portforward/portforward.go

Documentation for end users (private preview edition)

Suggest we publish this as an mkdocs site on readthedocs perhaps? Or just a markdown file in a public repo? Or integrate into the public Pachyderm docs?

For future, also consider a Pachyderm marketing site page about the integration.

Hide kubectl port-forward outputs

Currently when a user executes rslex-pachyderm code via Python, they see kubectl outputs like:

Forwarding from 127.0.0.1:30600 -> 600
Forwarding from [::1]:30600 -> 600
Handling connection for 30600

These are printed to stdout by kubcectl. We should redirect them to /dev/null

Probably remove `blob_dto.rs` - it's all about parsing XML ourselves

This code seems to be all about parsing azure blob responses, but we delegate all the S3 XML parsing to s3 lib, so, we shouldn't need this code.

ADLS Gen 2 spout

It's increasingly obvious that ADLS gen 2 is a must-have for Microsoft sales folks. Need a spout that copies ADLS gen 2 data into pachyderm as it comes in.

How we can do ADLS gen 2 spouts into pachyderm-AML: get the terraform to:

set up an ADLS gen 2 storage account (i.e. "enable hierarchical namespace when creating a storage account")
create a service bus instance and a topic within it
configure an ADLS gen 2 event to push the appropriate events to the service bus
have terraform output appropriate credentials for both of the above that we can pass into pachyderm (have terraform write them into a k8s secret? or can we get the pachyderm spout to inherit an Azure service account somehow?)
create a pachyderm spout which reads from the queue and uses the notifications to go and download data from ADLS gen 2 into pachyderm (with caveats below)
tada! demo dropping some json into ADLS gen 2 and the spout runs and a pachyderm commit appears magically as an immutable AML dataset version that the data scientist can read - this is cool because ADLS gen 2 doesn't itself support versioning

making it production-grade will be hard, considerations:

what happens when there's data in ADLS before the spout starts listening
what happens when the spout gets disconnected
what happens when there's terabytes of data and billions of files
what happens if there are conflicting changes on both sides

Support deploying syncer for existing Pachyderm instance

Add terraform variables to allow a syncer to "adopt" an existing Pachyderm cluster, rather than creating a new one every time.

This is so that we can support connecting one data source to multiple AML workspaces.

Resolve kubeconfig credential from custom datastore

[Syncer] Fix `UnboundLocalError: local variable 'ds_new' referenced before assignment`

Line https://github.com/pachyderm/aml/blob/main/syncer/sync.py#L107 can fail when we don't create a new ds_new object, which raises UnboundLocalError: local variable 'ds_new' referenced before assignment. For example, when the mode is configured to be something other than files or jsonl.

Tasks:

check if mode is a valid value in the set of [files, jsonl]
ensure to only call ds_new.add_tags when in a ds_new object is in scope

systemd unit for install.sh

Consider switching `rslex-pachyderm` to use `rslex-http-stream` rather than the copied over `rslex-http-stream-old`

In the rebase from hell (TM), I ended up just copying over the old rslex-http-stream and renaming it. This is cruft which should probably be removed before we open the PR.

test with large number of files (pagination)

In particular,

let list_bucket_result = blob_list[0].clone();

in searcher.rs
makes us suspicious!

Test if pkill -f port-forward is working

Anecdotally, I saw the port-forward get stuck and rerunning the test script didn't kill it. Manually killing it and rerunning the test script did work however

Fix/delete failing tests

Make syncer run port-forward more reliably

i.e. run it in a systemd unit rather than just backgrounding it in the install script!

Bundle kubectl binary in the wheel file

Right now rslex-pachyderm depends on having kubectl installed. Instead, bundle it in the python wheel file (or, if that's hard, download it from the internet on-demand).

We won't need this for users using the workaround in #23 but we will need it once Microsoft are bundling and publishing rslex officially (as our installer script which installs kubectl won't run in that case).

Update terraform code to deploy the marketplace VM

Now that the marketplace offer is working, we should be able to update the terraform code to refer to our marketplace VM (after #8 is done), possibly also adding a marketplace agreement resource.

Then users can automatically terraform an AML/pachyderm solution which pays us money!

Ref: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/virtual_machine#name

Support migrating data from ADLS Gen2 to Pachyderm

We want to support Azure customers who use ADSL Gen2.

As a first pass, we can use Pachyderm pipelines to ingest data from ADLS Gen2. The pipeline would be a cron pipeline that runs on a cadence, and has authority to access ADLS Gen2.

After this is implemented, we can work on a more sophisticated variant described here #41

Bug when running edges tutorial

Run through: https://github.com/pachyderm/aml#advanced-using-pachctl-locally

Then do:

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/1.13.x/examples/opencv/edges.json

Run like:

$ cat foo.py
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '04701c5f-d635-4103-a3a9-0d74aa3ddc51'
resource_group = 'luke-testing'
workspace_name = 'extant'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='Pachyderm repo edges')
dataset.download(target_path='.', overwrite=False)

Add pdb to /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/dataprep/api/_rslex_executor.py:

 35                 import pdb; pdb.set_trace()
 36  ->             (batches, num_partitions, stream_columns) = ex.execute(

Exception: Failed with execution error: error in streaming from input data sources
        ExecutionError(StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None)))
=> error in streaming from input data sources
        StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None))
=> unexpected error
        Unknown("Could not deserialize result \n custom: missing field `Name`", None)

Write unit tests for read path

essentially, rewrite test_pachyderm.py as a rust test

https://github.com/Azure/rslex/blob/6179b74ec010f804e4a40d276099f2d02a937bc2/language_integrations/azureml-dataprep-rslex/test_pachyderm.py

I deleted this file when tidying up the PR, but really we should rewrite it as a rust test

CI for aml repo

Maybe only worth doing this once we've demonstrated that someone is willing to pay for the solution.

actually run terraform against azure
writes commit to pach
asserts that dataset versions show up in AML
asserts that data can be read from them using an appropriately configured AML compute instance

Consider applying the diff from `rslex-azure-storage/blob` in the rebase of `rslex-pachyderm`

In the rebase from hell (TM), I ended up not applying the changes that had been made to the code that the pachyderm handler was based on. It might be fine that the pachyderm handler remains based on an old version of the code, but let's look at this and think about it a little more.

This is true for both the rslex-azure-storage/blob code and the rslex-http code. Ask Luke for more details...

Write internal documentation for how to set up a complete development environment for the end-to-end stack

running terraform to deploy a stack for an individual developer
- https://github.com/pachyderm/azureml-demo-syncer/blob/terraform/setup.md (this will eventually become the end user facing docs too)
setting up rslex build environment on an AML VM and how to iterate
- https://github.com/azure/rslex/tree/pachyderm_take_3#option-2-install-required-tools
running the packer build script and publishing updated VM images to the marketplace

Iterate on PR with Andrei until it's merged

Suggest jumping on a video call with him to do this live, as much as possible, to reduce latency / back and forth

Update to Pachyderm 2.0

Figure out how to log from rust code when executed via datasets SDK

Currently we have to return errors to debug anything, and then run the code from the python debugger, like this:

        return Err(StreamError::InvalidInput {
            message: format!("HELLO FROM PACHYDERM HANDLER! I can haz credential {:#?}", credential.clone()),
            source: None,
        });

and then in _rslex_executor.py manually place import pdb; pdb.set_trace()

Something in the python side of the SDK is swallowing stdout println! from the rust code somehow.

[rslex] Improve how we pass `uri` into `request_builder`

In rslex-pachyderm/src/pachyderm_stream_handler/request_builder.rs

impl RequestBuilder {
    /// Build a RequestBuilder from uri and SyncRecord.
    /// Return InvalidInput if uri is ill-formated.
    pub fn new(uri: &str, credential: String) -> StreamResult<RequestBuilder> {
        // Sometimes we get instantiated with uri like
        // commit.branch.repo/foo, sometimes like
        // http://localhost:30600/commit.branch.repo/foo workaround the latter
        // case by special-casing it for now.
        let the_uri = uri.strip_prefix("http://localhost:30600/").unwrap_or(uri);
...

rewrite test_pachyderm.py as a rust test

https://github.com/Azure/rslex/blob/6179b74ec010f804e4a40d276099f2d02a937bc2/language_integrations/azureml-dataprep-rslex/test_pachyderm.py

I deleted this file when tidying up the PR, but really we should rewrite it as a rust test

(optional) Figure out how to pass custom data into the VM

decided not to do

https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data

We'll need to:

Build a VM image with Provisioning.DecodeCustomData enabled and which starts the syncer in a systemd unit
Have the syncer read the data from /var/lib/waagent/CustomData and write it to a persistent location, and read it on startup

In particular, the custom data for the syncer will include:

# azureml instance
export AZURE_SUBSCRIPTION_ID="${data.azurerm_client_config.current.subscription_id}"
export AZURE_RESOURCE_GROUP="${azurerm_resource_group.main.name}"
export AZURE_ML_WORKSPACE_NAME="${azurerm_machine_learning_workspace.example.name}"

# storage for pachyderm
export AZURE_STORAGE_CONTAINER="${azurerm_storage_container.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_NAME="${azurerm_storage_account.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_KEY="${azurerm_storage_account.pachyderm.primary_access_key}"

Or alternatively, we can use the existing rsync env.sh file approach but just with our marketplace VM.

Convince ourselves that `TODO: what is this for exactly?` comment is harmless

Make syncer not recreate datasets versions every time it's restarted

Right now every time you restart the syncer, it creates a new dataset versions for every commit in every pachyderm repo, even if they've already been created. We need to list the existing dataset versions when we start up, and avoid recreating ones that have already been created.

It's actually kinda useful that it does this for debugging right now.

Build custom images of pachyderm w/bugfix & update install script to deploy it

Basically build docker images of pachyderm from post- pachyderm/pachyderm#6293 1.13.x branch

See: https://pachyderm.slack.com/archives/CEY70V55G/p1623681261101000

Submit a PR to Andrei

https://github.com/Azure/rslex/pull/5

Package py36 and py38 versions of rslex-pachyderm into a shell script we can `!` install from an AML notebook for v0

Prior to MSFT accepting our PR and rolling out the updated version of rslex globally, we should have an easy way for users to try out the stuff.

Build and publish the rslex wheel from AML VMs with the correct versions of python and create a quick shell script that users can ! install from an AML notebook to upgrade their rslex to a pachyderm-compatible one.

Update VM image with the real syncer (packer)

Update the packer code so that it does the job of this script: https://github.com/pachyderm/aml/blob/main/scripts/install.sh

And remove that script from the aml repo.

The behavior of the syncer VM can be the syncer itself crashloops (under systemd) until terraform copies the env.sh file into the VM.

Run this https://github.com/pachyderm/azureml-demo-syncer/blob/terraform/setup.md#packer to build the image.

Update README with instruction on how to activate private preview in your account (needs collaboration with Microsoft!)

Andrei mentioned that the current instructions in the README in this repo won't actually work without Microsoft allowlisting the custom datastores feature for a user (I think). Work with Microsoft to figure out a workflow for enabling this when we have a private preview user who wants to try it out, and update the README accordingly.

Solution:

two scenarios for existing pachyderm cluster

Pachyderm cluster was created outside of the current Terraform context, and the user can provide their existing workspaces

add a variable to indicate existing pachyderm cluster
use this variable to avoid creating a new pachyderm cluster
user would have to provide their own kubeconfig and copy it to the new syncer

export TF_VAR_existing_resource_group_name="resources-8cfbb924"
export TF_VAR_existing_pachyderm_cluster_name="pachyderm-8cfbb924"
bash scripts/setup.sh

Pachyderm cluster was created inside the current Terraform context

In this case, we only need to create a new syncer per workspace.
We should configure Terraform to be able to create N syncers given N workspaces.