datarevenue-berlin / openmlops Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
When I try to follow the instructions at https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-open-source-production-mlops-architecture-aws.md
I got to the step running: kubectl get svc -n ambassador
But I got the following return:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ambassador LoadBalancer 172.20.12.231 pending 443:30209/TCP 55m
ambassador-admin ClusterIP 172.20.48.162 none 8877/TCP,8005/TCP 55m
Ideally everything should be abstracted away but I know that's impossible.
I assume that things like prefect, MLFlow, Dask, Feast, Seldon cannot be abstracted away. I mean I need to learn them from their tutorials in order to use this repo. Which I find acceptable.
But I really don't want to dive into Terraform, Kubernetes, and especially AWS services as it seems there are lots of things to learn and it could be a big rabbit hole. Am I understanding this correctly?
To ask it differently, what are the basic knowledge requirements needed to use this repo (and fix bugs when it happens)?
I am trying to follow the tutorial Set up your local minikube machine learning architecture for experimenting hands-on with the components of OpenMLOps. Unfortunately, I am not able to follow all the instructions to the end.
Disclaimer: I am new to Terraform and Kubernetes.
To begin with, I tried to start minikube in two ways,
minikube start --kubernetes-version=v1.17.17
In case 1., I could not get minikube to start running. Here is the entire output of the command.
In the latter (without specifying the version), I believe minikube defaults to v1.22.2 as per the output I get from kubectl version
. In this case, I am able to get minikube running and start a tunnel as well:
Status:
machine: minikube
pid: 10105
route: 10.96.0.0/12 -> 192.168.49.2
minikube: Running
services: [mlflow]
errors:
minikube: no errors
router: no errors
loadbalancer emulator: no errors
All ok until I get to the step terraform apply -var-file=my_vars.tfvars
, which outputs the following at the end (the only 3 changes not able to complete):
module.ambassador[0].helm_release.ambassador[0]: Creating...
module.dask-jupyterhub.helm_release.dask-jupyterhub: Creating...
module.prefect-server.helm_release.prefect-server: Creating...
╷
│ Error: failed to install CRD crds/filter.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│
│ with module.ambassador[0].helm_release.ambassador[0],
│ on modules/ambassador/main.tf line 1, in resource "helm_release" "ambassador":
│ 1: resource "helm_release" "ambassador" {
│
╵
╷
│ Error: failed to install CRD crds/daskclusters.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│
│ with module.dask-jupyterhub.helm_release.dask-jupyterhub,
│ on modules/dask-jupyterhub/main.tf line 1, in resource "helm_release" "dask-jupyterhub":
│ 1: resource "helm_release" "dask-jupyterhub" {
│
╵
╷
│ Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
│
│ with module.prefect-server.helm_release.prefect-server,
│ on modules/prefect-server/main.tf line 1, in resource "helm_release" "prefect-server":
│ 1: resource "helm_release" "prefect-server" {
│
╵
Why is this happening?
Hello,
I followed the tutorial https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/basic-usage-of-jupyter-mlflow-and-prefect.md
But when I get to mlflow.sklearn.log_model(lr, "model")
, I get the error NoCredentialsError: Unable to locate credentials
. If I understand correctly, MLFlow does not have the AWS credential to save files to S3 bucket? I thought the bucket was created by Terraform script, so shouldn't it already have the credential? How should I go and fix this? Thanks
Hello, enjoying the MLOps ecosystem. Great work!
I am following the minikube set up and was able to operate mlflow and jupyter hub but could not connect to graphql endpoint from prefect with the error: "Couldn't connect to Prefect Server at http://localhost:4200/graphql"
I have made sure that I am using install_locally=True
, aws=False
and the graphql site at http:localhost:4200
is up and running.
I have also executed this command to forward the graphql 4200 port:
kubectl port-forward -n prefect svc/prefect-server-apollo 4200
But for some reason it is still not connecting.
Any tips to troublehsoot?
Good afternoon,
In the prefect configuration step, I get the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_164/2133135275.py in <module>
42 flow_run_id = prefect_client.create_flow_run(flow_id=training_flow_id, run_name=f "run {prefect_project_name}")
43
---> 44 create_prefect_flow()
/tmp/ipykernel_164/2133135275.py in create_prefect_flow()
30 storage = S3(s3_bucket)
31
---> 32 session_token = get_prefect_token()
33 prefect_client = Client(api_server=prefect_url, api_token=session_token)
34 schedule = IntervalSchedule(interval=timedelta(minutes=2))
/tmp/ipykernel_164/2133135275.py in get_prefect_token()
14 r = requests.get(auth_url)
15 jsn = r.json()
---> 16 action_url = jsn["methods"]["methods"]["password"]["config"]["action"]
17 data = {"identifier": username, "password": password}
18 headers = {"Accept": "application/json", "Content-Type": "application/json"}
KeyError: 'methods'
in the response you don't get the key "methods".
Response example
{
"id": "bad96217-aac0-4456-8ae7-54467b4c3813",
"type": "api",
"expires_at": "2021-08-23T14:00:44.432803488Z",
"issued_at": "2021-08-23T13:50:44.432803488Z",
"request_url": "http://mlops.mydomain.com/self-service/login/api",
"ui": {
"action": "https://mlops.mydomain.com/.ory/kratos/public/self-service/login?flow=bad96217-aac0-4456-8ae4-54467b4c323e2",
"method": "POST",
"nodes": [
{
"type": "input",
"group": "default",
"attributes": {
"name": "csrf_token",
"type": "hidden",
"value": "",
"required": true,
"disabled": false
},
"messages": null,
"meta": {}
},
{
"type": "input",
"group": "password",
"attributes": {
"name": "password_identifier",
"type": "text",
"value": "",
"required": true,
"disabled": false
},
"messages": null,
"meta": {
"label": {
"id": 1070004,
"text": "ID",
"type": "info"
}
}
},
{
"type": "input",
"group": "password",
"attributes": {
"name": "password",
"type": "password",
"required": true,
"disabled": false
},
"messages": null,
"meta": {
"label": {
"id": 1070001,
"text": "Password",
"type": "info"
}
}
},
{
"type": "input",
"group": "password",
"attributes": {
"name": "method",
"type": "submit",
"value": "password",
"disabled": false
},
"messages": null,
"meta": {
"label": {
"id": 1010001,
"text": "Sign in",
"type": "info",
"context": {}
}
}
}
]
},
"created_at": "2021-08-23T13:50:44.433949Z",
"updated_at": "2021-08-23T13:50:44.433949Z",
"forced": false
}
It would be very usefully for other tools (and users) to download models and embedings that were previously created.
We would be interested in automatically downloading both the spec2vec and MS2Deepscore models and embeddings from the most recent GNPS dataset.
When I try to follow the instruction at https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-open-source-production-mlops-architecture-aws.md
I got to the step running: terraform apply -var-file=my_vars.tfvars
But I got: Error creating S3 bucket: BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
│
│ with aws_s3_bucket.mlflow_artifact_root,
│ on main.tf line 11, in resource "aws_s3_bucket" "mlflow_artifact_root":
│ 11: resource "aws_s3_bucket" "mlflow_artifact_root"
If I delete the S3 bucket and rerun the command, I got: Error loading state: S3 bucket does not exist.
So maybe it was using that bucket, but then later try to create the bucket again?
As far as I know, Airflow has a higher degree of freedom than MLflow, know that many functions can be used. However, OpenMLOps uses MLflow, so I would like to know what part of MLflow was used for.
Hi, I was wondering whether there is plan to support Ray for distributed training. I want an ML platform that supports both deep learning and traditional models, and looks like Dask does not have a very support for distributed deep learning. Thanks
i am following this tutorial and getting this error in this statement "terraform apply -var-file=my_vars.tfvars"
https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-minikube-cluster.md
following the versions:
Terraform v1.0.8
kubectl server version v1.21.5 and client version v1.21.5, which includes:
minikube version: v1.23.2
Minikube start --kubernetes-version=v1.21.5 --vm-driver = hyperv
I am attempting to set up the minikube cluster locally on a Windows machine.
When I attempt to run with the provided my_vars.tfvars as shown in the instruction,
aws = false
db_username = "mlflow-db-user"
db_password = "mlflow-db-pasword"
hostname = "myambassador.com"
ory_kratos_cookie_secret = "secret"
ory_kratos_db_password = "password"
install_metrics_server = false
install_feast = false
install_seldon = false
prefect_create_tenant_enabled = false
jhub_proxy_secret_token = "IfYouDecideToUseJhubProxyYouShouldChangeThisValueToARandomString"
enable_ory_authentication = false
oauth2_providers = []
mlflow_artifact_root = "/tmp"
install_locally = true
I get the following errors.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
kubernetes_service_account.daskhub-sa: Destroying... [id=daskhub/daskhub-sa]
kubernetes_service_account.daskhub-sa: Destruction complete after 0s
kubernetes_service_account.daskhub-sa: Creating...
module.ambassador[0].helm_release.ambassador[0]: Creating...
module.dask-jupyterhub.helm_release.dask-jupyterhub: Creating...
module.dask.helm_release.dask: Creating...
module.postgres.helm_release.postgres: Creating...
module.prefect-server.helm_release.prefect-server: Creating...
kubernetes_service_account.daskhub-sa: Still creating... [10s elapsed]
module.dask.helm_release.dask: Still creating... [10s elapsed]
kubernetes_service_account.daskhub-sa: Still creating... [20s elapsed]
module.dask.helm_release.dask: Still creating... [20s elapsed]
module.dask.helm_release.dask: Still creating... [30s elapsed]
module.dask.helm_release.dask: Still creating... [40s elapsed]
module.dask.helm_release.dask: Still creating... [50s elapsed]
module.dask.helm_release.dask: Still creating... [1m0s elapsed]
module.dask.helm_release.dask: Still creating... [1m10s elapsed]
module.dask.helm_release.dask: Creation complete after 1m19s [id=dask]
╷
│ Error: Waiting for default secret of "daskhub/daskhub-sa" to appear
│
│ with kubernetes_service_account.daskhub-sa,
│ on main.tf line 12, in resource "kubernetes_service_account" "daskhub-sa":
│ 12: resource "kubernetes_service_account" "daskhub-sa" {
│
╵
╷
│ Error: failed to install CRD crds/filter.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│
│ with module.ambassador[0].helm_release.ambassador[0],
│ on modules\ambassador\main.tf line 1, in resource "helm_release" "ambassador":
│ 1: resource "helm_release" "ambassador" {
│
╵
╷
│ Error: failed to install CRD crds/daskclusters.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│
│ with module.dask-jupyterhub.helm_release.dask-jupyterhub,
│ on modules\dask-jupyterhub\main.tf line 1, in resource "helm_release" "dask-jupyterhub":
│ 1: resource "helm_release" "dask-jupyterhub" {
│
╵
╷
│ Error: chart "postgresql" version "10.9.1" not found in https://charts.bitnami.com/bitnami repository
│
│ with module.postgres.helm_release.postgres,
│ on modules\postgres\main.tf line 2, in resource "helm_release" "postgres":
│ 2: resource "helm_release" "postgres" {
│
╵
╷
│ Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
│
│ with module.prefect-server.helm_release.prefect-server,
│ on modules\prefect-server\main.tf line 1, in resource "helm_release" "prefect-server":
│ 1: resource "helm_release" "prefect-server" {
│
Any direction would be much appreciated.
Hello, and thank you for this great resource. I'm running into an error while going through the architecture set-up tutorial for AWS. I'm able to successfully run everything including "terraform plan -var-file=my_vars.tfvars". I get the following error when I run "terraform apply -var-file=my_vars.tfvars". Let me know if there's any other helpful information I could provide:
Hi, OpenMLOps project is awesome and our research team wants to use it to adopt MLOps practices. We have successfully set up OpenMLOps on AWS. From my understanding, the development (data preprocessing, training models, etc.) and deployment processes will both run on AWS, this allows us to form the big loop (retrieved raw data -> processed data -> trained models -> served models -> monitoring -> problems -> triggers -> retrieving new data -> processed data -> new trained model -> ... ). However, during the development process, we use our local servers for EDA, preprocessing data, coding, and conducting experiments.
I have two questions
Thanks!
Just a question.
Where are GPUs for training and serving?
Did I miss something obvious?
Thanks
│ Error: error creating EKS Cluster (eks-mlops): InvalidParameterException: unsupported Kubernetes version
│ {
│ RespMetadata: {
│ StatusCode: 400,
│ RequestID: "21cfc835-7fe3-4c5c-9aa3-ed0e64ff1d56"
│ },
│ ClusterName: "eks-mlops",
│ Message_: "unsupported Kubernetes version"
│ }
│
│ with module.mlops-architecture-eks.module.eks.aws_eks_cluster.this[0],
│ on .terraform\modules\mlops-architecture-eks.eks\cluster.tf line 9, in resource "aws_eks_cluster" "this":
│ 9: resource "aws_eks_cluster" "this" {
│
it would be great to have also the default values of url and domains when setting up in a local environment with the defaults my_vars.tfvars defined in the minikube guide. So for example:
how these
aws = false
db_username = "mlflow-db-user"
db_password = "mlflow-db-pasword"
hostname = "myambassador.com"
ory_kratos_cookie_secret = "secret"
ory_kratos_db_password = "password"
install_metrics_server = false
install_feast = false
install_seldon = false
prefect_create_tenant_enabled = false
jhub_proxy_secret_token = "IfYouDecideToUseJhubProxyYouShouldChangeThisValueToARandomString"
enable_ory_authentication = false
oauth2_providers = []
mlflow_artifact_root = "/tmp"
match with
domain = "mlops.ritza-route53.com" # the domain where you are hosting Open MLOps
username = "[email protected]" # the username you used to register on Open MLOps
password = "DkguP5GsB9yiPk8" # the password you used to register on Open MLOps
s3_bucket = "another-mlops-bucket" # the S3 bucket you specified when setting up Open MLOps
prefect_project_name = "wine-quality-project" # you can use what you want here
docker_image = "drtools/prefect:wine-classifier-3" # any docker image that has the required Python dependencies
or also
how to reach in the minikube case:
=> https://mlops.example.com/profile/auth/registration
=> https://jupyter.mlops.example.com
that would be great thanks!
Hi
I tried the tutorial here: https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-minikube-cluster.md
Look like there is no this service http://[EXTERNAL_IP_OF_JUPYTER_HUB_PROXY_PUBLIC]:80
Here my load balancers:
ambassador ambassador LoadBalancer 10.105.12.93 10.105.12.93 80:32714/TCP,443:32385/TCP 5h20m
mlflow mlflow LoadBalancer 10.96.12.12 10.96.12.12 5000:32132/TCP 3h11m
prefect prefect-server-ui LoadBalancer 10.105.217.102 10.105.217.102 8080:31497/TCP 3h4m
mlflow and prefect services can be accessed. But the ambassador's EXTERNAL-IP can’t be reached.
Thanks
I was following this guide: https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-open-source-production-mlops-architecture-aws.md#configuring-the-my_varstfvars-file and have a few questions.
Do I need to set additional_aws_users
field in my_vars.tfvars
to an empty list if I just have only one user (me)?
If I am the only user, do I need to fill in oauth2_providers
? Or do I just need to make it an empty list? If I need to create a new OAuth application, what should I fill in the Authorization callback URL
text box?
If I set both 2 attributes above to be empty lists and run terraform init
, it asks me to enter an S3 bucket name. What should I fill in this? If it's related to the bucket_name
field in the my_vars.tfvars
file, I've already set it.
Are there any plans on adding a tutorial with all the steps to correctly stop all the architecture? Or a set of scripts to perform this.
This will be helpful for people starting to experiment with OpenMLOps on AWS, or individual practitoners or students that still can't afford to have all this architecture running all the time. So if they are on a budget they can run experiments, try the architecture, and then stop everything.
OpenMLOps has no license, according to GitHub[1]:
...without a license, the default copyright laws apply, meaning that you (Data Revenue) retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work.
So no one external to Data Revenue can use this repo.
In many organisations notebooks are not preferred for production and conventional IDEs and scripts dominate production. How can we accommodate framework agnostic development environments in this project ala VS Code etc.
I know there's a ton of conditional nuance to even asking how much the AWS resources will cost, but I'm asking in the spirit of a baseline for people who've wondered the same. Assuming basic weekday usage by a team of 2 data scientists (or perhaps based on DataRevenue's own consumption load) how much would the services provisioned cost monthly?
When trying to set up a minikube cluster with mlflow, mlflow is not able to connect to its dedicated psql db.
I specified the username and password in the variables, however it seems to ignore them. I cant use any other user besides the default "postgres" otherwise it says the role doesnt exist.
Edit: Also the default password "postgres" sadly doesnt work
Here are the logs of the postgres pod:
postgresql 10:01:18.61
postgresql 10:01:18.62 Welcome to the Bitnami postgresql container
postgresql 10:01:18.62 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql
postgresql 10:01:18.62 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql/issues
postgresql 10:01:18.63
postgresql 10:01:18.83 INFO ==> ** Starting PostgreSQL setup **
postgresql 10:01:18.85 INFO ==> Validating settings in POSTGRESQL_* env vars..
postgresql 10:01:18.86 INFO ==> Loading custom pre-init scripts...
postgresql 10:01:18.92 INFO ==> Initializing PostgreSQL database...
postgresql 10:01:18.97 INFO ==> pg_hba.conf file not detected. Generating it...
postgresql 10:01:18.97 INFO ==> Generating local authentication configuration
postgresql 10:01:23.24 INFO ==> Starting PostgreSQL in background...
postgresql 10:01:23.39 INFO ==> Changing password of postgres
postgresql 10:01:23.56 INFO ==> Configuring replication parameters
postgresql 10:01:23.67 INFO ==> Configuring fsync
postgresql 10:01:23.85 INFO ==> Loading custom scripts...
postgresql 10:01:23.87 INFO ==> Enabling remote connections
postgresql 10:01:23.92 INFO ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped
postgresql 10:01:24.16 INFO ==> ** PostgreSQL setup finished! **
postgresql 10:01:24.18 INFO ==> ** Starting PostgreSQL **
2023-01-13 10:01:24.236 GMT [1] LOG: pgaudit extension initialized
2023-01-13 10:01:24.237 GMT [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-01-13 10:01:24.237 GMT [1] LOG: listening on IPv6 address "::", port 5432
2023-01-13 10:01:24.258 GMT [1] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-01-13 10:01:24.337 GMT [130] LOG: database system was shut down at 2023-01-13 10:01:24 GMT
2023-01-13 10:01:24.344 GMT [1] LOG: database system is ready to accept connections
2023-01-13 10:01:53.073 GMT [159] FATAL: password authentication failed for user "postgres"
2023-01-13 10:01:53.073 GMT [159] DETAIL: Password does not match for user "postgres".
Connection matched pg_hba.conf line 1: "host all all 0.0.0.0/0 md5"
2023-01-13 10:01:53.254 GMT [160] FATAL: password authentication failed for user "postgres"
2023-01-13 10:01:53.254 GMT [160] DETAIL: Password does not match for user "postgres".
Connection matched pg_hba.conf line 1: "host all all 0.0.0.0/0 md5"
2023-01-13 10:01:53.574 GMT [161] FATAL: password authentication failed for user "postgres"
2023-01-13 10:01:53.574 GMT [161] DETAIL: Password does not match for user "postgres".
Hello,
I have a problem following the steps in the tutorial, during the first terraform apply in the creation of the resource : module.mlops-architecture-eks.helm_release.autoscaler I always get the following error:
│ Error: Post "http://localhost/api/v1/namespaces/kube-system/configmaps": dial tcp 127.0.0.1.1:80: connect: connection refused.
│
│ with module.mlops-architecture-eks.module.eks.kubernetes_config_map.aws_auth[0],
│ on .terraform/modules/mlops-architecture-eks.eks/aws_auth.tf line 63, in resource "kubernetes_config_map" "aws_auth":
│ 63: resource "kubernetes_config_map" "aws_auth" {.
│
╵
I am not trying to contest this project, genuinely want to understand the reasoning/philosophy behind doing something from scratch when there are much more (relatively) mature alternatives like Kubeflow, which seem to do very similar things.
It would be great if the maintainers of OpenMLOps could compare it with Kubeflow and other alternatives and shine some light on the philosophy of this one.
I reckon it would be helpful for people to make more informed decisions :)
Thanks for sharing your OpenMLOps framework. I hope to find some guidance to configure this framework to make it work on our local Kubernetes cluster. What modifications need to be done for each component?
I'm trying to install OpenMLOps in my AWS account following this tutorial, but I ran into the error, shown above and below, when running terraform init
. The tutorial doesn't include provisioning the S3 bucket before this step, and presumably it's created from the terraform scripts. What am I missing?
Is there an Azure version of the setup guide forthcoming. Seems like an interesting architecture but we’re an Azure house so not sure how much in the way of changes would be required to implement this.
Would you be able to explain how to set up oauth2_providers on github. Not sure what goes in the url links
I followed steps and I got this error when I do:
terraform apply -var-file=my_vars.tfvars
Error: error creating EKS Cluster (eks-mlops): InvalidParameterException: unsupported Kubernetes version
I dont know what is the problem here and how to debug
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.