Coder Social home page Coder Social logo

aml's Introduction

GitHub release GitHub license GoDoc Go Report Card Slack Status CLA assistant

Pachyderm โ€“ Automate data transformations with data versioning and lineage

Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations across any type of data. Our unique approach provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking. Pachyderm delivers the ultimate CI/CD engine for data.

Features

  • Data-driven pipelines automatically trigger based on detecting data changes.
  • Immutable data lineage with data versioning of any data type.
  • Autoscaling and parallel processing built on Kubernetes for resource orchestration.
  • Uses standard object stores for data storage with automatic deduplication.
  • Runs across all major cloud providers and on-premises installations.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, run Pachyderm locally or you can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

aml's People

Contributors

albscui avatar lukemarsden avatar tybritten avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

azeltov

aml's Issues

Bug when running edges tutorial

Run through: https://github.com/pachyderm/aml#advanced-using-pachctl-locally

Then do:

pachctl create pipeline -f https://raw.githubusercontent.com/pachyderm/pachyderm/1.13.x/examples/opencv/edges.json

Run like:

$ cat foo.py
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '04701c5f-d635-4103-a3a9-0d74aa3ddc51'
resource_group = 'luke-testing'
workspace_name = 'extant'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='Pachyderm repo edges')
dataset.download(target_path='.', overwrite=False)

Add pdb to /anaconda/envs/azureml_py38/lib/python3.8/site-packages/azureml/dataprep/api/_rslex_executor.py:

 35                 import pdb; pdb.set_trace()
 36  ->             (batches, num_partitions, stream_columns) = ex.execute(
Exception: Failed with execution error: error in streaming from input data sources
        ExecutionError(StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None)))
=> error in streaming from input data sources
        StreamError(Unknown("Could not deserialize result \n custom: missing field `Name`", None))
=> unexpected error
        Unknown("Could not deserialize result \n custom: missing field `Name`", None)

ADLS Gen 2 spout

It's increasingly obvious that ADLS gen 2 is a must-have for Microsoft sales folks. Need a spout that copies ADLS gen 2 data into pachyderm as it comes in.

How we can do ADLS gen 2 spouts into pachyderm-AML: get the terraform to:

  1. set up an ADLS gen 2 storage account (i.e. "enable hierarchical namespace when creating a storage account")
  2. create a service bus instance and a topic within it
  3. configure an ADLS gen 2 event to push the appropriate events to the service bus
  4. have terraform output appropriate credentials for both of the above that we can pass into pachyderm (have terraform write them into a k8s secret? or can we get the pachyderm spout to inherit an Azure service account somehow?)
  5. create a pachyderm spout which reads from the queue and uses the notifications to go and download data from ADLS gen 2 into pachyderm (with caveats below)
  6. tada! demo dropping some json into ADLS gen 2 and the spout runs and a pachyderm commit appears magically as an immutable AML dataset version that the data scientist can read - this is cool because ADLS gen 2 doesn't itself support versioning

making it production-grade will be hard, considerations:

  • what happens when there's data in ADLS before the spout starts listening
  • what happens when the spout gets disconnected
  • what happens when there's terabytes of data and billions of files
  • what happens if there are conflicting changes on both sides

[rslex] Improve how we pass `uri` into `request_builder`

In rslex-pachyderm/src/pachyderm_stream_handler/request_builder.rs

impl RequestBuilder {
    /// Build a RequestBuilder from uri and SyncRecord.
    /// Return InvalidInput if uri is ill-formated.
    pub fn new(uri: &str, credential: String) -> StreamResult<RequestBuilder> {
        // Sometimes we get instantiated with uri like
        // commit.branch.repo/foo, sometimes like
        // http://localhost:30600/commit.branch.repo/foo workaround the latter
        // case by special-casing it for now.
        let the_uri = uri.strip_prefix("http://localhost:30600/").unwrap_or(uri);
...

Strange folder structure

when using dataset.download(), the files gets downloaded into a bonkers directory structure like http%blah

(optional) Figure out how to pass custom data into the VM

decided not to do

https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data

We'll need to:

  1. Build a VM image with Provisioning.DecodeCustomData enabled and which starts the syncer in a systemd unit
  2. Have the syncer read the data from /var/lib/waagent/CustomData and write it to a persistent location, and read it on startup

In particular, the custom data for the syncer will include:

# azureml instance
export AZURE_SUBSCRIPTION_ID="${data.azurerm_client_config.current.subscription_id}"
export AZURE_RESOURCE_GROUP="${azurerm_resource_group.main.name}"
export AZURE_ML_WORKSPACE_NAME="${azurerm_machine_learning_workspace.example.name}"

# storage for pachyderm
export AZURE_STORAGE_CONTAINER="${azurerm_storage_container.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_NAME="${azurerm_storage_account.pachyderm.name}"
export AZURE_STORAGE_ACCOUNT_KEY="${azurerm_storage_account.pachyderm.primary_access_key}"

Or alternatively, we can use the existing rsync env.sh file approach but just with our marketplace VM.

Programmatically define multiple AML workspaces.

Problem: how do we enable users to have multiple Azure ML workspaces, but talk to a single Pachyderm datastore?

We want to create a new syncer for each AML workspace.

Solution:

two scenarios for existing pachyderm cluster

  1. Pachyderm cluster was created outside of the current Terraform context, and the user can provide their existing workspaces
  • add a variable to indicate existing pachyderm cluster
  • use this variable to avoid creating a new pachyderm cluster
  • user would have to provide their own kubeconfig and copy it to the new syncer
export TF_VAR_existing_resource_group_name="resources-8cfbb924"
export TF_VAR_existing_pachyderm_cluster_name="pachyderm-8cfbb924"
bash scripts/setup.sh
  1. Pachyderm cluster was created inside the current Terraform context
  • In this case, we only need to create a new syncer per workspace.
  • We should configure Terraform to be able to create N syncers given N workspaces.

Consider applying the diff from `rslex-azure-storage/blob` in the rebase of `rslex-pachyderm`

In the rebase from hell (TM), I ended up not applying the changes that had been made to the code that the pachyderm handler was based on. It might be fine that the pachyderm handler remains based on an old version of the code, but let's look at this and think about it a little more.

This is true for both the rslex-azure-storage/blob code and the rslex-http code. Ask Luke for more details...

Hide kubectl port-forward outputs

Currently when a user executes rslex-pachyderm code via Python, they see kubectl outputs like:

Forwarding from 127.0.0.1:30600 -> 600
Forwarding from [::1]:30600 -> 600
Handling connection for 30600

These are printed to stdout by kubcectl. We should redirect them to /dev/null

Write internal documentation for how to set up a complete development environment for the end-to-end stack

[Syncer] Fix `UnboundLocalError: local variable 'ds_new' referenced before assignment`

Line https://github.com/pachyderm/aml/blob/main/syncer/sync.py#L107 can fail when we don't create a new ds_new object, which raises UnboundLocalError: local variable 'ds_new' referenced before assignment. For example, when the mode is configured to be something other than files or jsonl.

Tasks:

  • check if mode is a valid value in the set of [files, jsonl]
  • ensure to only call ds_new.add_tags when in a ds_new object is in scope

Make syncer not recreate datasets versions every time it's restarted

Right now every time you restart the syncer, it creates a new dataset versions for every commit in every pachyderm repo, even if they've already been created. We need to list the existing dataset versions when we start up, and avoid recreating ones that have already been created.

It's actually kinda useful that it does this for debugging right now.

Figure out how to log from rust code when executed via datasets SDK

Currently we have to return errors to debug anything, and then run the code from the python debugger, like this:

        return Err(StreamError::InvalidInput {
            message: format!("HELLO FROM PACHYDERM HANDLER! I can haz credential {:#?}", credential.clone()),
            source: None,
        });

and then in _rslex_executor.py manually place import pdb; pdb.set_trace()

Something in the python side of the SDK is swallowing stdout println! from the rust code somehow.

Bundle kubectl binary in the wheel file

Right now rslex-pachyderm depends on having kubectl installed. Instead, bundle it in the python wheel file (or, if that's hard, download it from the internet on-demand).

We won't need this for users using the workaround in #23 but we will need it once Microsoft are bundling and publishing rslex officially (as our installer script which installs kubectl won't run in that case).

Fix killing the port-forward at the right time

Currently the port-forward subprocesses is started once and never killed. There's commented out code for killing it but currently it kills it too soon.

I've observed the port-forwards becoming unusable after a time, and needing manual restarting. pkill -f port-forward

However, ideally we could use pidfiles or something to leave the port-forward running between instantiations, so we don't need to make a new portforward instance for every single request we make to pachyderm. There must be a good balance. Maybe we can do a portforward per execution of a lariat script?

We should eliminate the sleep statements as well as it probably makes rslex-pachyderm unusably slow for real workloads. When creating a new portforward we should rapidly poll for the port being open with a timeout, rather than just assuming that it will be open after N seconds.

Support migrating data from ADLS Gen2 to Pachyderm

We want to support Azure customers who use ADSL Gen2.

As a first pass, we can use Pachyderm pipelines to ingest data from ADLS Gen2. The pipeline would be a cron pipeline that runs on a cadence, and has authority to access ADLS Gen2.

After this is implemented, we can work on a more sophisticated variant described here #41

CI for aml repo

Maybe only worth doing this once we've demonstrated that someone is willing to pay for the solution.

  • actually run terraform against azure
  • writes commit to pach
  • asserts that dataset versions show up in AML
  • asserts that data can be read from them using an appropriately configured AML compute instance

Test if pkill -f port-forward is working

Anecdotally, I saw the port-forward get stuck and rerunning the test script didn't kill it. Manually killing it and rerunning the test script did work however

Documentation for end users (private preview edition)

Suggest we publish this as an mkdocs site on readthedocs perhaps? Or just a markdown file in a public repo? Or integrate into the public Pachyderm docs?

For future, also consider a Pachyderm marketing site page about the integration.

Deployment uses standard storage class

I see that the current deployment uses a standard storage class instead of premium which is the recommended type for Pachyderm.

Also note that If you don't choose a higher size PV, Azure won't meet the iops requirement for Pachyderm. So you need about 512GB PV. See a warning in the Pachyderm docs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.