pachyderm / aml Goto Github PK
View Code? Open in Web Editor NEWAML planning repo
AML planning repo
After #8 is done and tested, click the Go Live button on https://partner.microsoft.com/en-us/dashboard/commercial-marketplace/offers/bb037439-c75e-4024-8a25-ba35a3f51663/overview!
Currently when a user executes rslex-pachyderm code via Python, they see kubectl outputs like:
Forwarding from 127.0.0.1:30600 -> 600
Forwarding from [::1]:30600 -> 600
Handling connection for 30600
These are printed to stdout by kubcectl. We should redirect them to /dev/null
I deleted this file when tidying up the PR, but really we should rewrite it as a rust test
Right now rslex-pachyderm depends on having kubectl installed. Instead, bundle it in the python wheel file (or, if that's hard, download it from the internet on-demand).
We won't need this for users using the workaround in #23 but we will need it once Microsoft are bundling and publishing rslex officially (as our installer script which installs kubectl won't run in that case).
i.e. run it in a systemd unit rather than just backgrounding it in the install script!
Problem: how do we enable users to have multiple Azure ML workspaces, but talk to a single Pachyderm datastore?
We want to create a new syncer for each AML workspace.
Solution:
two scenarios for existing pachyderm cluster
export TF_VAR_existing_resource_group_name="resources-8cfbb924"
export TF_VAR_existing_pachyderm_cluster_name="pachyderm-8cfbb924"
bash scripts/setup.sh
Basically build docker images of pachyderm from post- pachyderm/pachyderm#6293 1.13.x
branch
See: https://pachyderm.slack.com/archives/CEY70V55G/p1623681261101000
Suggest jumping on a video call with him to do this live, as much as possible, to reduce latency / back and forth
Suggest we publish this as an mkdocs site on readthedocs perhaps? Or just a markdown file in a public repo? Or integrate into the public Pachyderm docs?
For future, also consider a Pachyderm marketing site page about the integration.
I see that the current deployment uses a standard storage class instead of premium which is the recommended type for Pachyderm.
Also note that If you don't choose a higher size PV, Azure won't meet the iops requirement for Pachyderm. So you need about 512GB PV. See a warning in the Pachyderm docs.
In rslex-pachyderm/src/pachyderm_stream_handler/request_builder.rs
impl RequestBuilder {
/// Build a RequestBuilder from uri and SyncRecord.
/// Return InvalidInput if uri is ill-formated.
pub fn new(uri: &str, credential: String) -> StreamResult<RequestBuilder> {
// Sometimes we get instantiated with uri like
// commit.branch.repo/foo, sometimes like
// http://localhost:30600/commit.branch.repo/foo workaround the latter
// case by special-casing it for now.
let the_uri = uri.strip_prefix("http://localhost:30600/").unwrap_or(uri);
...
Also delete print statements :)
make all the warnings go away :)
make the test code compile again :)
In the rebase from hell (TM), I ended up not applying the changes that had been made to the code that the pachyderm handler was based on. It might be fine that the pachyderm handler remains based on an old version of the code, but let's look at this and think about it a little more.
This is true for both the rslex-azure-storage/blob
code and the rslex-http
code. Ask Luke for more details...
We can just use S3 library to list directories. Something like
s3.list(prefix, delimiter="/")
To test this, we need to configure the environment variable
RSLEX_DIRECT_VOLUME_MOUNT=true
Anecdotally, I saw the port-forward get stuck and rerunning the test script didn't kill it. Manually killing it and rerunning the test script did work however
when using dataset.download()
, the files gets downloaded into a bonkers directory structure like http%blah
It's increasingly obvious that ADLS gen 2 is a must-have for Microsoft sales folks. Need a spout that copies ADLS gen 2 data into pachyderm as it comes in.
How we can do ADLS gen 2 spouts into pachyderm-AML: get the terraform to:
making it production-grade will be hard, considerations:
This code seems to be all about parsing azure blob responses, but we delegate all the S3 XML parsing to s3 lib, so, we shouldn't need this code.
Line https://github.com/pachyderm/aml/blob/main/syncer/sync.py#L107 can fail when we don't create a new ds_new
object, which raises UnboundLocalError: local variable 'ds_new' referenced before assignment
. For example, when the mode is configured to be something other than files
or jsonl
.
Tasks:
mode
is a valid value in the set of [files, jsonl]
ds_new.add_tags
when in a ds_new
object is in scopeAndrei mentioned that the current instructions in the README in this repo won't actually work without Microsoft allowlisting the custom datastores feature for a user (I think). Work with Microsoft to figure out a workflow for enabling this when we have a private preview user who wants to try it out, and update the README accordingly.
running terraform to deploy a stack for an individual developer
setting up rslex build environment on an AML VM and how to iterate
running the packer build script and publishing updated VM images to the marketplace
Now that the marketplace offer is working, we should be able to update the terraform code to refer to our marketplace VM (after #8 is done), possibly also adding a marketplace agreement resource.
Then users can automatically terraform an AML/pachyderm solution which pays us money!
Ref: https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/virtual_machine#name
Currently the port-forward subprocesses is started once and never killed. There's commented out code for killing it but currently it kills it too soon.
I've observed the port-forwards becoming unusable after a time, and needing manual restarting. pkill -f port-forward
However, ideally we could use pidfiles or something to leave the port-forward running between instantiations, so we don't need to make a new portforward instance for every single request we make to pachyderm. There must be a good balance. Maybe we can do a portforward per execution of a lariat script?
We should eliminate the sleep statements as well as it probably makes rslex-pachyderm unusably slow for real workloads. When creating a new portforward we should rapidly poll for the port being open with a timeout, rather than just assuming that it will be open after N seconds.
The current version installed by the install script in this repo isn't the latest working code
In the rebase from hell (TM), I ended up just copying over the old rslex-http-stream
and renaming it. This is cruft which should probably be removed before we open the PR.
Update the packer code so that it does the job of this script: https://github.com/pachyderm/aml/blob/main/scripts/install.sh
And remove that script from the aml repo.
The behavior of the syncer VM can be the syncer itself crashloops (under systemd) until terraform copies the env.sh
file into the VM.
Run this https://github.com/pachyderm/azureml-demo-syncer/blob/terraform/setup.md#packer to build the image.
essentially, rewrite test_pachyderm.py as a rust test
I deleted this file when tidying up the PR, but really we should rewrite it as a rust test
We want to support Azure customers who use ADSL Gen2.
As a first pass, we can use Pachyderm pipelines to ingest data from ADLS Gen2. The pipeline would be a cron pipeline that runs on a cadence, and has authority to access ADLS Gen2.
After this is implemented, we can work on a more sophisticated variant described here #41
Given a pachyderm repo with sub directories containing files and possible more sub directories, the Searcher cannot recurse into these sub directories currently.
Currently we have to return errors to debug anything, and then run the code from the python debugger, like this:
return Err(StreamError::InvalidInput {
message: format!("HELLO FROM PACHYDERM HANDLER! I can haz credential {:#?}", credential.clone()),
source: None,
});
and then in _rslex_executor.py
manually place import pdb; pdb.set_trace()
Something in the python side of the SDK is swallowing stdout println!
from the rust code somehow.
Maybe only worth doing this once we've demonstrated that someone is willing to pay for the solution.
decided not to do
https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data
We'll need to:
Provisioning.DecodeCustomData
enabled and which starts the syncer in a systemd unit/var/lib/waagent/CustomData
and write it to a persistent location, and read it on startupIn particular, the custom data for the syncer will include:
# azureml instance
export AZURE_SUBSCRIPTION_ID="${data.azurerm_client_config.current.subscription_id}"
export AZURE_RESOURCE_GROUP="${azurerm_resource_group.main.name}"
export AZURE_ML_WORKSPACE_NAME="${azurerm_machine_learning_workspace.example.name}"
# storage for pachyderm
export AZURE_STORAGE_CONTAINER="${azurerm_storage_container.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_NAME="${azurerm_storage_account.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_KEY="${azurerm_storage_account.pachyderm.primary_access_key}"
Or alternatively, we can use the existing rsync env.sh
file approach but just with our marketplace VM.
We may need to not have to ship the kubectl
binary along with the rslex code, probably. In order to do that, we need to implement kubectl port-forward
in rust.
Or! Figure out how we can reliably and securely access pachyderm on a Kube cluster.
Or! Bundle the kubectl binary in the wheel file, or downloading it on-demand?
Reference: https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/portforward/portforward.go
Add terraform variables to allow a syncer to "adopt" an existing Pachyderm cluster, rather than creating a new one every time.
This is so that we can support connecting one data source to multiple AML workspaces.
Prior to MSFT accepting our PR and rolling out the updated version of rslex globally, we should have an easy way for users to try out the stuff.
Build and publish the rslex wheel from AML VMs with the correct versions of python and create a quick shell script that users can !
install from an AML notebook to upgrade their rslex to a pachyderm-compatible one.
Run through: https://github.com/pachyderm/aml#advanced-using-pachctl-locally
Then do:
pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/1.13.x/examples/opencv/edges.json
Run like:
$ cat foo.py
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset
subscription_id = '04701c5f-d635-4103-a3a9-0d74aa3ddc51'
resource_group = 'luke-testing'
workspace_name = 'extant'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='Pachyderm repo edges')
dataset.download(target_path='.', overwrite=False)
Add pdb to /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/dataprep/api/_rslex_executor.py
:
35 import pdb; pdb.set_trace()
36 -> (batches, num_partitions, stream_columns) = ex.execute(
Exception: Failed with execution error: error in streaming from input data sources
ExecutionError(StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None)))
=> error in streaming from input data sources
StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None))
=> unexpected error
Unknown("Could not deserialize result \n custom: missing field `Name`", None)
Right now every time you restart the syncer, it creates a new dataset versions for every commit in every pachyderm repo, even if they've already been created. We need to list the existing dataset versions when we start up, and avoid recreating ones that have already been created.
It's actually kinda useful that it does this for debugging right now.
In particular,
let list_bucket_result = blob_list[0].clone();
in searcher.rs
makes us suspicious!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.