The aml's discuss from pachyderm

Update terraform code to deploy the marketplace VM

Now that the marketplace offer is working, we should be able to update the terraform code to refer to our marketplace VM (after #8 is done), possibly also adding a marketplace agreement resource.

Then users can automatically terraform an AML/pachyderm solution which pays us money!

Ref: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/virtual_machine#name

Strange folder structure

when using dataset.download(), the files gets downloaded into a bonkers directory structure like http%blah

Implement `list_directory()` and `get_entry()` in `stream_handler.rs` so that we can support mounting.

We can just use S3 library to list directories. Something like

s3.list(prefix, delimiter="/")

To test this, we need to configure the environment variable

RSLEX_DIRECT_VOLUME_MOUNT=true

rewrite test_pachyderm.py as a rust test

https://github.com/Azure/rslex/blob/6179b74ec010f804e4a40d276099f2d02a937bc2/language_integrations/azureml-dataprep-rslex/test_pachyderm.py

I deleted this file when tidying up the PR, but really we should rewrite it as a rust test

Write unit tests for read path

essentially, rewrite test_pachyderm.py as a rust test

https://github.com/Azure/rslex/blob/6179b74ec010f804e4a40d276099f2d02a937bc2/language_integrations/azureml-dataprep-rslex/test_pachyderm.py

I deleted this file when tidying up the PR, but really we should rewrite it as a rust test

Hide kubectl port-forward outputs

Currently when a user executes rslex-pachyderm code via Python, they see kubectl outputs like:

Forwarding from 127.0.0.1:30600 -> 600
Forwarding from [::1]:30600 -> 600
Handling connection for 30600

These are printed to stdout by kubcectl. We should redirect them to /dev/null

Test current VM in marketplace to check if nginx comes up

Fix killing the port-forward at the right time

Currently the port-forward subprocesses is started once and never killed. There's commented out code for killing it but currently it kills it too soon.

I've observed the port-forwards becoming unusable after a time, and needing manual restarting. pkill -f port-forward

However, ideally we could use pidfiles or something to leave the port-forward running between instantiations, so we don't need to make a new portforward instance for every single request we make to pachyderm. There must be a good balance. Maybe we can do a portforward per execution of a lariat script?

We should eliminate the sleep statements as well as it probably makes rslex-pachyderm unusably slow for real workloads. When creating a new portforward we should rapidly poll for the port being open with a timeout, rather than just assuming that it will be open after N seconds.

Flatten out the `vec![ListBucketResult]` returned by `bucket.list()`

Tidy up branch, remove unnecessary code

Also delete print statements :)
make all the warnings go away :)
make the test code compile again :)

systemd unit for install.sh

Figure out how to log from rust code when executed via datasets SDK

Currently we have to return errors to debug anything, and then run the code from the python debugger, like this:

        return Err(StreamError::InvalidInput {
            message: format!("HELLO FROM PACHYDERM HANDLER! I can haz credential {:#?}", credential.clone()),
            source: None,
        });

and then in _rslex_executor.py manually place import pdb; pdb.set_trace()

Something in the python side of the SDK is swallowing stdout println! from the rust code somehow.

Build custom images of pachyderm w/bugfix & update install script to deploy it

Basically build docker images of pachyderm from post- pachyderm/pachyderm#6293 1.13.x branch

See: https://pachyderm.slack.com/archives/CEY70V55G/p1623681261101000

Publish latest version of rslex-pachyderm for testing

The current version installed by the install script in this repo isn't the latest working code

Probably remove `blob_dto.rs` - it's all about parsing XML ourselves

This code seems to be all about parsing azure blob responses, but we delegate all the S3 XML parsing to s3 lib, so, we shouldn't need this code.

Bug when running edges tutorial

Run through: https://github.com/pachyderm/aml#advanced-using-pachctl-locally

Then do:

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/1.13.x/examples/opencv/edges.json

Run like:

$ cat foo.py
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '04701c5f-d635-4103-a3a9-0d74aa3ddc51'
resource_group = 'luke-testing'
workspace_name = 'extant'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='Pachyderm repo edges')
dataset.download(target_path='.', overwrite=False)

Add pdb to /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/dataprep/api/_rslex_executor.py:

 35                 import pdb; pdb.set_trace()
 36  ->             (batches, num_partitions, stream_columns) = ex.execute(

Exception: Failed with execution error: error in streaming from input data sources
        ExecutionError(StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None)))
=> error in streaming from input data sources
        StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None))
=> unexpected error
        Unknown("Could not deserialize result \n custom: missing field `Name`", None)

Support deploying syncer for existing Pachyderm instance

Add terraform variables to allow a syncer to "adopt" an existing Pachyderm cluster, rather than creating a new one every time.

This is so that we can support connecting one data source to multiple AML workspaces.

Change read path to address specific versions of the data in rslex

Consider switching `rslex-pachyderm` to use `rslex-http-stream` rather than the copied over `rslex-http-stream-old`

In the rebase from hell (TM), I ended up just copying over the old rslex-http-stream and renaming it. This is cruft which should probably be removed before we open the PR.

Searcher does not recurse on subdirectories given a glob pattern

Given a pachyderm repo with sub directories containing files and possible more sub directories, the Searcher cannot recurse into these sub directories currently.

Resolve kubeconfig credential from custom datastore

[Syncer] Fix `UnboundLocalError: local variable 'ds_new' referenced before assignment`

Line https://github.com/pachyderm/aml/blob/main/syncer/sync.py#L107 can fail when we don't create a new ds_new object, which raises UnboundLocalError: local variable 'ds_new' referenced before assignment. For example, when the mode is configured to be something other than files or jsonl.

Tasks:

check if mode is a valid value in the set of [files, jsonl]
ensure to only call ds_new.add_tags when in a ds_new object is in scope

Package py36 and py38 versions of rslex-pachyderm into a shell script we can `!` install from an AML notebook for v0

Prior to MSFT accepting our PR and rolling out the updated version of rslex globally, we should have an easy way for users to try out the stuff.

Build and publish the rslex wheel from AML VMs with the correct versions of python and create a quick shell script that users can ! install from an AML notebook to upgrade their rslex to a pachyderm-compatible one.

Make syncer run port-forward more reliably

i.e. run it in a systemd unit rather than just backgrounding it in the install script!

Record demo video!

Test if pkill -f port-forward is working

Anecdotally, I saw the port-forward get stuck and rerunning the test script didn't kill it. Manually killing it and rerunning the test script did work however

CI for aml repo

Maybe only worth doing this once we've demonstrated that someone is willing to pay for the solution.

actually run terraform against azure
writes commit to pach
asserts that dataset versions show up in AML
asserts that data can be read from them using an appropriately configured AML compute instance

remove ApplyCredential, replace with String

Deployment uses standard storage class

I see that the current deployment uses a standard storage class instead of premium which is the recommended type for Pachyderm.

Also note that If you don't choose a higher size PV, Azure won't meet the iops requirement for Pachyderm. So you need about 512GB PV. See a warning in the Pachyderm docs.

ADLS Gen 2 spout

It's increasingly obvious that ADLS gen 2 is a must-have for Microsoft sales folks. Need a spout that copies ADLS gen 2 data into pachyderm as it comes in.

How we can do ADLS gen 2 spouts into pachyderm-AML: get the terraform to:

set up an ADLS gen 2 storage account (i.e. "enable hierarchical namespace when creating a storage account")
create a service bus instance and a topic within it
configure an ADLS gen 2 event to push the appropriate events to the service bus
have terraform output appropriate credentials for both of the above that we can pass into pachyderm (have terraform write them into a k8s secret? or can we get the pachyderm spout to inherit an Azure service account somehow?)
create a pachyderm spout which reads from the queue and uses the notifications to go and download data from ADLS gen 2 into pachyderm (with caveats below)
tada! demo dropping some json into ADLS gen 2 and the spout runs and a pachyderm commit appears magically as an immutable AML dataset version that the data scientist can read - this is cool because ADLS gen 2 doesn't itself support versioning

making it production-grade will be hard, considerations:

what happens when there's data in ADLS before the spout starts listening
what happens when the spout gets disconnected
what happens when there's terabytes of data and billions of files
what happens if there are conflicting changes on both sides

Rewrite kubectl port-forward in rust

We may need to not have to ship the kubectl binary along with the rslex code, probably. In order to do that, we need to implement kubectl port-forward in rust.

Or! Figure out how we can reliably and securely access pachyderm on a Kube cluster.

Or! Bundle the kubectl binary in the wheel file, or downloading it on-demand?

Reference: https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/portforward/portforward.go

Submit a PR to Andrei

https://github.com/Azure/rslex/pull/5

Bundle kubectl binary in the wheel file

Right now rslex-pachyderm depends on having kubectl installed. Instead, bundle it in the python wheel file (or, if that's hard, download it from the internet on-demand).

We won't need this for users using the workaround in #23 but we will need it once Microsoft are bundling and publishing rslex officially (as our installer script which installs kubectl won't run in that case).

Make syncer not recreate datasets versions every time it's restarted

Right now every time you restart the syncer, it creates a new dataset versions for every commit in every pachyderm repo, even if they've already been created. We need to list the existing dataset versions when we start up, and avoid recreating ones that have already been created.

It's actually kinda useful that it does this for debugging right now.

[rslex] Improve how we pass `uri` into `request_builder`

In rslex-pachyderm/src/pachyderm_stream_handler/request_builder.rs

impl RequestBuilder {
    /// Build a RequestBuilder from uri and SyncRecord.
    /// Return InvalidInput if uri is ill-formated.
    pub fn new(uri: &str, credential: String) -> StreamResult<RequestBuilder> {
        // Sometimes we get instantiated with uri like
        // commit.branch.repo/foo, sometimes like
        // http://localhost:30600/commit.branch.repo/foo workaround the latter
        // case by special-casing it for now.
        let the_uri = uri.strip_prefix("http://localhost:30600/").unwrap_or(uri);
...

Convince ourselves that `TODO: what is this for exactly?` comment is harmless

Iterate on PR with Andrei until it's merged

Suggest jumping on a video call with him to do this live, as much as possible, to reduce latency / back and forth

Programmatically define multiple AML workspaces.

Problem: how do we enable users to have multiple Azure ML workspaces, but talk to a single Pachyderm datastore?

We want to create a new syncer for each AML workspace.

Solution:

two scenarios for existing pachyderm cluster

Pachyderm cluster was created outside of the current Terraform context, and the user can provide their existing workspaces

add a variable to indicate existing pachyderm cluster
use this variable to avoid creating a new pachyderm cluster
user would have to provide their own kubeconfig and copy it to the new syncer

export TF_VAR_existing_resource_group_name="resources-8cfbb924"
export TF_VAR_existing_pachyderm_cluster_name="pachyderm-8cfbb924"
bash scripts/setup.sh

Pachyderm cluster was created inside the current Terraform context

In this case, we only need to create a new syncer per workspace.
We should configure Terraform to be able to create N syncers given N workspaces.

(optional) Figure out how to pass custom data into the VM

decided not to do

https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data

We'll need to:

Build a VM image with Provisioning.DecodeCustomData enabled and which starts the syncer in a systemd unit
Have the syncer read the data from /var/lib/waagent/CustomData and write it to a persistent location, and read it on startup

In particular, the custom data for the syncer will include:

# azureml instance
export AZURE_SUBSCRIPTION_ID="${data.azurerm_client_config.current.subscription_id}"
export AZURE_RESOURCE_GROUP="${azurerm_resource_group.main.name}"
export AZURE_ML_WORKSPACE_NAME="${azurerm_machine_learning_workspace.example.name}"

# storage for pachyderm
export AZURE_STORAGE_CONTAINER="${azurerm_storage_container.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_NAME="${azurerm_storage_account.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_KEY="${azurerm_storage_account.pachyderm.primary_access_key}"

Or alternatively, we can use the existing rsync env.sh file approach but just with our marketplace VM.

Update README with instruction on how to activate private preview in your account (needs collaboration with Microsoft!)

Andrei mentioned that the current instructions in the README in this repo won't actually work without Microsoft allowlisting the custom datastores feature for a user (I think). Work with Microsoft to figure out a workflow for enabling this when we have a private preview user who wants to try it out, and update the README accordingly.

test with large number of files (pagination)

In particular,

let list_bucket_result = blob_list[0].clone();

in searcher.rs
makes us suspicious!

Write internal documentation for how to set up a complete development environment for the end-to-end stack

running terraform to deploy a stack for an individual developer
- https://github.com/pachyderm/azureml-demo-syncer/blob/terraform/setup.md (this will eventually become the end user facing docs too)
setting up rslex build environment on an AML VM and how to iterate
- https://github.com/azure/rslex/tree/pachyderm_take_3#option-2-install-required-tools
running the packer build script and publishing updated VM images to the marketplace

Documentation for end users (private preview edition)

Suggest we publish this as an mkdocs site on readthedocs perhaps? Or just a markdown file in a public repo? Or integrate into the public Pachyderm docs?

For future, also consider a Pachyderm marketing site page about the integration.

Support migrating data from ADLS Gen2 to Pachyderm

We want to support Azure customers who use ADSL Gen2.

As a first pass, we can use Pachyderm pipelines to ingest data from ADLS Gen2. The pipeline would be a cron pipeline that runs on a cadence, and has authority to access ADLS Gen2.

After this is implemented, we can work on a more sophisticated variant described here #41

Update VM image with the real syncer (packer)

Update the packer code so that it does the job of this script: https://github.com/pachyderm/aml/blob/main/scripts/install.sh

And remove that script from the aml repo.

The behavior of the syncer VM can be the syncer itself crashloops (under systemd) until terraform copies the env.sh file into the VM.

Run this https://github.com/pachyderm/azureml-demo-syncer/blob/terraform/setup.md#packer to build the image.

Consider applying the diff from `rslex-azure-storage/blob` in the rebase of `rslex-pachyderm`

In the rebase from hell (TM), I ended up not applying the changes that had been made to the code that the pachyderm handler was based on. It might be fine that the pachyderm handler remains based on an old version of the code, but let's look at this and think about it a little more.

This is true for both the rslex-azure-storage/blob code and the rslex-http code. Ask Luke for more details...

Publish the VM into the public marketplace!

After #8 is done and tested, click the Go Live button on https://partner.microsoft.com/en-us/dashboard/commercial-marketplace/offers/bb037439-c75e-4024-8a25-ba35a3f51663/overview!

pachyderm / aml Goto Github PK

aml's Issues

Recommend Projects

Recommend Topics

Recommend Org