datarevenue-berlin / openmlops Goto Github PK

View Code? Open in Web Editor NEW

699.0 699.0 101.0 9.33 MB

License: MIT License

HCL 89.22% Dockerfile 0.56% Shell 0.15% Jsonnet 1.76% Jupyter Notebook 8.30%

openmlops's People

Contributors

Stargazers

Watchers

Forkers

ritza-co slack0 eliekawerk klysman08 awesome-archive ayo-faks minhxode k5tuck huangweiboy2 animeshinvinci bigrlab tana8m rxflamel derekstrickland edgbr hmeshaow louisguitton polya20 atrakic forestlzj pfriasf nisheethjaiswal minmax-fund trungquy datastark fg91 sanpme66 offchan42 convect-bot spazm jsaribeirolopes akashmavle5 chunmk helghast79 fdroessler sramirez atdavidpark bgalvao surenraju-zz ibonder doytsujin dduzgun-security pomcho555 te-565 orban muzammal-naseer 2series hisato-kawaji mohitsethi gcheron jamindy techno1731 odin-group teashawn deployment-mldl andreihirvi soroushtaheri scivico swadhin maneeshs marqueurs404 gkranasinghe rayanpoudyal patrices yaruyi nguyenvantien2009 dgatlin inmeta aruncserecs cni-kbj cttsai1985 timinovvo kuaikuai yubobo iliyangochev gr8adakron mtahir19 bmwas phamthanhtu310702 dayanandsagark zeyaddeeb kburgess23 sayandigital abderrahmanskiredj aditikhare007 kontist ppalmes abiodun-ayodeji martik4 valuedate jdebnath21 herrmannntag dimomohit mihatrajbaric lapnd senad96 altynbaev taltaf913 brunoscaglione rishabhksharma

openmlops's Issues

<pending> EXTERNAL-IP when linking the domain to the load balancer

When I try to follow the instructions at https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-open-source-production-mlops-architecture-aws.md

I got to the step running: kubectl get svc -n ambassador

But I got the following return:

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ambassador LoadBalancer 172.20.12.231 pending 443:30209/TCP 55m
ambassador-admin ClusterIP 172.20.48.162 none 8877/TCP,8005/TCP 55m

Question: Do I need to learn Terraform, Kubernetes, AWS services?

Ideally everything should be abstracted away but I know that's impossible.

Do I need to learn what Terraform does? E.g. by taking its tutorial
Kubernetes?
AWS services? Will this thing be abstracted from me if there's error happening?

I assume that things like prefect, MLFlow, Dask, Feast, Seldon cannot be abstracted away. I mean I need to learn them from their tutorials in order to use this repo. Which I find acceptable.
But I really don't want to dive into Terraform, Kubernetes, and especially AWS services as it seems there are lots of things to learn and it could be a big rabbit hole. Am I understanding this correctly?

To ask it differently, what are the basic knowledge requirements needed to use this repo (and fix bugs when it happens)?

Local minikube tutorial failure

I am trying to follow the tutorial Set up your local minikube machine learning architecture for experimenting hands-on with the components of OpenMLOps. Unfortunately, I am not able to follow all the instructions to the end.

Disclaimer: I am new to Terraform and Kubernetes.

To begin with, I tried to start minikube in two ways,

with minikube start --kubernetes-version=v1.17.17
and without a specified kubernetes version.

In case 1., I could not get minikube to start running. Here is the entire output of the command.

In the latter (without specifying the version), I believe minikube defaults to v1.22.2 as per the output I get from kubectl version. In this case, I am able to get minikube running and start a tunnel as well:

Status:
        machine: minikube
        pid: 10105
        route: 10.96.0.0/12 -> 192.168.49.2
        minikube: Running
        services: [mlflow]
    errors: 
                minikube: no errors
                router: no errors
                loadbalancer emulator: no errors

All ok until I get to the step terraform apply -var-file=my_vars.tfvars, which outputs the following at the end (the only 3 changes not able to complete):

module.ambassador[0].helm_release.ambassador[0]: Creating...
module.dask-jupyterhub.helm_release.dask-jupyterhub: Creating...
module.prefect-server.helm_release.prefect-server: Creating...
╷
│ Error: failed to install CRD crds/filter.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│ 
│   with module.ambassador[0].helm_release.ambassador[0],
│   on modules/ambassador/main.tf line 1, in resource "helm_release" "ambassador":
│    1: resource "helm_release" "ambassador" {
│ 
╵
╷
│ Error: failed to install CRD crds/daskclusters.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│ 
│   with module.dask-jupyterhub.helm_release.dask-jupyterhub,
│   on modules/dask-jupyterhub/main.tf line 1, in resource "helm_release" "dask-jupyterhub":
│    1: resource "helm_release" "dask-jupyterhub" {
│ 
╵
╷
│ Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
│ 
│   with module.prefect-server.helm_release.prefect-server,
│   on modules/prefect-server/main.tf line 1, in resource "helm_release" "prefect-server":
│    1: resource "helm_release" "prefect-server" {
│ 
╵

Why is this happening?

NoCredentialsError Error with MLflow

Hello,
I followed the tutorial https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/basic-usage-of-jupyter-mlflow-and-prefect.md

But when I get to mlflow.sklearn.log_model(lr, "model"), I get the error NoCredentialsError: Unable to locate credentials. If I understand correctly, MLFlow does not have the AWS credential to save files to S3 bucket? I thought the bucket was created by Terraform script, so shouldn't it already have the credential? How should I go and fix this? Thanks

Couldn't connect to Prefect Server at http://localhost:4200/graphql

Hello, enjoying the MLOps ecosystem. Great work!

I am following the minikube set up and was able to operate mlflow and jupyter hub but could not connect to graphql endpoint from prefect with the error: "Couldn't connect to Prefect Server at http://localhost:4200/graphql"

I have made sure that I am using install_locally=True, aws=False and the graphql site at http:localhost:4200 is up and running.

I have also executed this command to forward the graphql 4200 port:

kubectl port-forward -n prefect svc/prefect-server-apollo 4200

But for some reason it is still not connecting.

Any tips to troublehsoot?

I keep getting a timeout error

"ambassador", "dask-jupyter" and "ory-kratos" gets timed out during creation.

Problem following basic usage of jupyter mlflow and prefect tutorial

Good afternoon,

In the prefect configuration step, I get the following error:

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_164/2133135275.py in <module>
     42 flow_run_id = prefect_client.create_flow_run(flow_id=training_flow_id, run_name=f "run {prefect_project_name}")
     43 
---> 44 create_prefect_flow()

/tmp/ipykernel_164/2133135275.py in create_prefect_flow()
     30 storage = S3(s3_bucket)
     31 
---> 32 session_token = get_prefect_token()
     33 prefect_client = Client(api_server=prefect_url, api_token=session_token)
     34 schedule = IntervalSchedule(interval=timedelta(minutes=2))

/tmp/ipykernel_164/2133135275.py in get_prefect_token()
     14 r = requests.get(auth_url)
     15 jsn = r.json()
---> 16 action_url = jsn["methods"]["methods"]["password"]["config"]["action"]
     17 data = {"identifier": username, "password": password}
     18 headers = {"Accept": "application/json", "Content-Type": "application/json"}

KeyError: 'methods'

in the response you don't get the key "methods".

Response example

{
    "id": "bad96217-aac0-4456-8ae7-54467b4c3813",
    "type": "api",
    "expires_at": "2021-08-23T14:00:44.432803488Z",
    "issued_at": "2021-08-23T13:50:44.432803488Z",
    "request_url": "http://mlops.mydomain.com/self-service/login/api",
    "ui": {
        "action": "https://mlops.mydomain.com/.ory/kratos/public/self-service/login?flow=bad96217-aac0-4456-8ae4-54467b4c323e2",
        "method": "POST",
        "nodes": [
            {
                "type": "input",
                "group": "default",
                "attributes": {
                    "name": "csrf_token",
                    "type": "hidden",
                    "value": "",
                    "required": true,
                    "disabled": false
                },
                "messages": null,
                "meta": {}
            },
            {
                "type": "input",
                "group": "password",
                "attributes": {
                    "name": "password_identifier",
                    "type": "text",
                    "value": "",
                    "required": true,
                    "disabled": false
                },
                "messages": null,
                "meta": {
                    "label": {
                        "id": 1070004,
                        "text": "ID",
                        "type": "info"
                    }
                }
            },
            {
                "type": "input",
                "group": "password",
                "attributes": {
                    "name": "password",
                    "type": "password",
                    "required": true,
                    "disabled": false
                },
                "messages": null,
                "meta": {
                    "label": {
                        "id": 1070001,
                        "text": "Password",
                        "type": "info"
                    }
                }
            },
            {
                "type": "input",
                "group": "password",
                "attributes": {
                    "name": "method",
                    "type": "submit",
                    "value": "password",
                    "disabled": false
                },
                "messages": null,
                "meta": {
                    "label": {
                        "id": 1010001,
                        "text": "Sign in",
                        "type": "info",
                        "context": {}
                    }
                }
            }
        ]
    },
    "created_at": "2021-08-23T13:50:44.433949Z",
    "updated_at": "2021-08-23T13:50:44.433949Z",
    "forced": false
}

Timed out when creating module.mlops-architecture-eks.helm_release.autoscaler

I got this error when I did

terraform apply -var-file=my_vars.tfvars

then i installed aws-iam-authenticator but i got the same error.

Download previously created models and embedding files

It would be very usefully for other tools (and users) to download models and embedings that were previously created.
We would be interested in automatically downloading both the spec2vec and MS2Deepscore models and embeddings from the most recent GNPS dataset.

Error creating S3 bucket: BucketAlreadyOwnedByYou

When I try to follow the instruction at https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-open-source-production-mlops-architecture-aws.md

I got to the step running: terraform apply -var-file=my_vars.tfvars

But I got: Error creating S3 bucket: BucketAlreadyOwnedByYou: Your previous request to create the named bucket succeeded and you already own it.
│
│ with aws_s3_bucket.mlflow_artifact_root,
│ on main.tf line 11, in resource "aws_s3_bucket" "mlflow_artifact_root":
│ 11: resource "aws_s3_bucket" "mlflow_artifact_root"

If I delete the S3 bucket and rerun the command, I got: Error loading state: S3 bucket does not exist.

So maybe it was using that bucket, but then later try to create the bucket again?

Question: Why did you use mlflow?

As far as I know, Airflow has a higher degree of freedom than MLflow, know that many functions can be used. However, OpenMLOps uses MLflow, so I would like to know what part of MLflow was used for.

Support for Ray

Hi, I was wondering whether there is plan to support Ray for distributed training. I want an ML platform that supports both deep learning and traditional models, and looks like Dask does not have a very support for distributed deep learning. Thanks

minikube local installation error

i am following this tutorial and getting this error in this statement "terraform apply -var-file=my_vars.tfvars"

https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-minikube-cluster.md

following the versions:

Terraform v1.0.8
kubectl server version v1.21.5 and client version v1.21.5, which includes:
    minikube version: v1.23.2

Minikube start --kubernetes-version=v1.21.5 --vm-driver = hyperv

@sixhobbits @spazm @pedrocwb @timotk

Terraform build is failing

I am attempting to set up the minikube cluster locally on a Windows machine.

When I attempt to run with the provided my_vars.tfvars as shown in the instruction,

aws = false
db_username = "mlflow-db-user"
db_password = "mlflow-db-pasword"
hostname = "myambassador.com"
ory_kratos_cookie_secret = "secret"
ory_kratos_db_password = "password"
install_metrics_server = false
install_feast = false
install_seldon = false
prefect_create_tenant_enabled = false
jhub_proxy_secret_token = "IfYouDecideToUseJhubProxyYouShouldChangeThisValueToARandomString"
enable_ory_authentication = false
oauth2_providers = []
mlflow_artifact_root = "/tmp"
install_locally = true

I get the following errors.


Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

kubernetes_service_account.daskhub-sa: Destroying... [id=daskhub/daskhub-sa]
kubernetes_service_account.daskhub-sa: Destruction complete after 0s
kubernetes_service_account.daskhub-sa: Creating...
module.ambassador[0].helm_release.ambassador[0]: Creating...
module.dask-jupyterhub.helm_release.dask-jupyterhub: Creating...
module.dask.helm_release.dask: Creating...
module.postgres.helm_release.postgres: Creating...
module.prefect-server.helm_release.prefect-server: Creating...
kubernetes_service_account.daskhub-sa: Still creating... [10s elapsed]
module.dask.helm_release.dask: Still creating... [10s elapsed]
kubernetes_service_account.daskhub-sa: Still creating... [20s elapsed]
module.dask.helm_release.dask: Still creating... [20s elapsed]
module.dask.helm_release.dask: Still creating... [30s elapsed]
module.dask.helm_release.dask: Still creating... [40s elapsed]
module.dask.helm_release.dask: Still creating... [50s elapsed]
module.dask.helm_release.dask: Still creating... [1m0s elapsed]
module.dask.helm_release.dask: Still creating... [1m10s elapsed]
module.dask.helm_release.dask: Creation complete after 1m19s [id=dask]
╷
│ Error: Waiting for default secret of "daskhub/daskhub-sa" to appear
│
│   with kubernetes_service_account.daskhub-sa,
│   on main.tf line 12, in resource "kubernetes_service_account" "daskhub-sa":
│   12: resource "kubernetes_service_account" "daskhub-sa" {
│
╵
╷
│ Error: failed to install CRD crds/filter.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│
│   with module.ambassador[0].helm_release.ambassador[0],
│   on modules\ambassador\main.tf line 1, in resource "helm_release" "ambassador":
│    1: resource "helm_release" "ambassador" {
│
╵
╷
│ Error: failed to install CRD crds/daskclusters.yaml: unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
│
│   with module.dask-jupyterhub.helm_release.dask-jupyterhub,
│   on modules\dask-jupyterhub\main.tf line 1, in resource "helm_release" "dask-jupyterhub":
│    1: resource "helm_release" "dask-jupyterhub" {
│
╵
╷
│ Error: chart "postgresql" version "10.9.1" not found in https://charts.bitnami.com/bitnami repository
│
│   with module.postgres.helm_release.postgres,
│   on modules\postgres\main.tf line 2, in resource "helm_release" "postgres":
│    2: resource "helm_release" "postgres" {
│
╵
╷
│ Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1beta1"
│
│   with module.prefect-server.helm_release.prefect-server,
│   on modules\prefect-server\main.tf line 1, in resource "helm_release" "prefect-server":
│    1: resource "helm_release" "prefect-server" {
│

Any direction would be much appreciated.

Unsupported Kubernetes Version Error

Hello, and thank you for this great resource. I'm running into an error while going through the architecture set-up tutorial for AWS. I'm able to successfully run everything including "terraform plan -var-file=my_vars.tfvars". I get the following error when I run "terraform apply -var-file=my_vars.tfvars". Let me know if there's any other helpful information I could provide:

How to move the development process to local server?

Hi, OpenMLOps project is awesome and our research team wants to use it to adopt MLOps practices. We have successfully set up OpenMLOps on AWS. From my understanding, the development (data preprocessing, training models, etc.) and deployment processes will both run on AWS, this allows us to form the big loop (retrieved raw data -> processed data -> trained models -> served models -> monitoring -> problems -> triggers -> retrieving new data -> processed data -> new trained model -> ... ). However, during the development process, we use our local servers for EDA, preprocessing data, coding, and conducting experiments.
I have two questions

Could we use OpenMLOps to orchestrate the workflow and tracking experiments on our local server?
How to deploy this workflow to AWS later on?

Thanks!

Why is there no mention of GPUs in this repo?

Just a question.
Where are GPUs for training and serving?

Did I miss something obvious?

Thanks

I'm trying to follow the setup guide and I get a kubernetes version error

│ Error: error creating EKS Cluster (eks-mlops): InvalidParameterException: unsupported Kubernetes version
│ {
│   RespMetadata: {
│     StatusCode: 400,
│     RequestID: "21cfc835-7fe3-4c5c-9aa3-ed0e64ff1d56"
│   },
│   ClusterName: "eks-mlops",
│   Message_: "unsupported Kubernetes version"
│ }
│
│   with module.mlops-architecture-eks.module.eks.aws_eks_cluster.this[0],
│   on .terraform\modules\mlops-architecture-eks.eks\cluster.tf line 9, in resource "aws_eks_cluster" "this":
│    9: resource "aws_eks_cluster" "this" {
│

Update the tutorial Creating a basic machine learning system on Open MLOps with sample values from local minikube

it would be great to have also the default values of url and domains when setting up in a local environment with the defaults my_vars.tfvars defined in the minikube guide. So for example:

how these
aws = false
db_username = "mlflow-db-user"
db_password = "mlflow-db-pasword"
hostname = "myambassador.com"
ory_kratos_cookie_secret = "secret"
ory_kratos_db_password = "password"
install_metrics_server = false
install_feast = false
install_seldon = false
prefect_create_tenant_enabled = false
jhub_proxy_secret_token = "IfYouDecideToUseJhubProxyYouShouldChangeThisValueToARandomString"
enable_ory_authentication = false
oauth2_providers = []
mlflow_artifact_root = "/tmp"

match with
domain = "mlops.ritza-route53.com" # the domain where you are hosting Open MLOps
username = "[email protected]" # the username you used to register on Open MLOps
password = "DkguP5GsB9yiPk8" # the password you used to register on Open MLOps
s3_bucket = "another-mlops-bucket" # the S3 bucket you specified when setting up Open MLOps

prefect_project_name = "wine-quality-project"         # you can use what you want here
docker_image = "drtools/prefect:wine-classifier-3"    # any docker image that has the required Python dependencies

or also
how to reach in the minikube case:
=> https://mlops.example.com/profile/auth/registration
=> https://jupyter.mlops.example.com

that would be great thanks!

Release Minikube deployment repository publicly

The ambassador's EXTERNAL-IP can’t be reached (minikube cluster)

Hi
I tried the tutorial here: https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-minikube-cluster.md

Look like there is no this service http://[EXTERNAL_IP_OF_JUPYTER_HUB_PROXY_PUBLIC]:80

Here my load balancers:
ambassador ambassador LoadBalancer 10.105.12.93 10.105.12.93 80:32714/TCP,443:32385/TCP 5h20m
mlflow mlflow LoadBalancer 10.96.12.12 10.96.12.12 5000:32132/TCP 3h11m
prefect prefect-server-ui LoadBalancer 10.105.217.102 10.105.217.102 8080:31497/TCP 3h4m

mlflow and prefect services can be accessed. But the ambassador's EXTERNAL-IP can’t be reached.

Thanks

A few questions after following the setup guide

I was following this guide: https://github.com/datarevenue-berlin/OpenMLOps/blob/master/tutorials/set-up-open-source-production-mlops-architecture-aws.md#configuring-the-my_varstfvars-file and have a few questions.

Do I need to set additional_aws_users field in my_vars.tfvars to an empty list if I just have only one user (me)?
If I am the only user, do I need to fill in oauth2_providers? Or do I just need to make it an empty list? If I need to create a new OAuth application, what should I fill in the Authorization callback URL text box?
If I set both 2 attributes above to be empty lists and run terraform init, it asks me to enter an S3 bucket name. What should I fill in this? If it's related to the bucket_name field in the my_vars.tfvars file, I've already set it.

Tutorial: How to stop all the components correctly

Are there any plans on adding a tutorial with all the steps to correctly stop all the architecture? Or a set of scripts to perform this.

This will be helpful for people starting to experiment with OpenMLOps on AWS, or individual practitoners or students that still can't afford to have all this architecture running all the time. So if they are on a budget they can run experiments, try the architecture, and then stop everything.

error creating Kubernetes cluster with AWS tutorial

Hi all, I am trying to follow the AWS tutorial on setting OpenMLOps and I am getting the bellow error when I run terraform apply.

any help with that?
Thanks in advance

OpenMLOps has no license

OpenMLOps has no license, according to GitHub[1]:

...without a license, the default copyright laws apply, meaning that you (Data Revenue) retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work.

So no one external to Data Revenue can use this repo.

Jupyter alternatives

In many organisations notebooks are not preferred for production and conventional IDEs and scripts dominate production. How can we accommodate framework agnostic development environments in this project ala VS Code etc.

Approximate monthly AWS costs

I know there's a ton of conditional nuance to even asking how much the AWS resources will cost, but I'm asking in the spirit of a baseline for people who've wondered the same. Assuming basic weekday usage by a team of 2 data scientists (or perhaps based on DataRevenue's own consumption load) how much would the services provisioned cost monthly?

Mlflow cant login to postgres DB

When trying to set up a minikube cluster with mlflow, mlflow is not able to connect to its dedicated psql db.
I specified the username and password in the variables, however it seems to ignore them. I cant use any other user besides the default "postgres" otherwise it says the role doesnt exist.

Edit: Also the default password "postgres" sadly doesnt work

Here are the logs of the postgres pod:

postgresql 10:01:18.61 
postgresql 10:01:18.62 Welcome to the Bitnami postgresql container
postgresql 10:01:18.62 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql
postgresql 10:01:18.62 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql/issues
postgresql 10:01:18.63 
postgresql 10:01:18.83 INFO  ==> ** Starting PostgreSQL setup **
postgresql 10:01:18.85 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql 10:01:18.86 INFO  ==> Loading custom pre-init scripts...
postgresql 10:01:18.92 INFO  ==> Initializing PostgreSQL database...
postgresql 10:01:18.97 INFO  ==> pg_hba.conf file not detected. Generating it...
postgresql 10:01:18.97 INFO  ==> Generating local authentication configuration
postgresql 10:01:23.24 INFO  ==> Starting PostgreSQL in background...
postgresql 10:01:23.39 INFO  ==> Changing password of postgres
postgresql 10:01:23.56 INFO  ==> Configuring replication parameters
postgresql 10:01:23.67 INFO  ==> Configuring fsync
postgresql 10:01:23.85 INFO  ==> Loading custom scripts...
postgresql 10:01:23.87 INFO  ==> Enabling remote connections
postgresql 10:01:23.92 INFO  ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped
postgresql 10:01:24.16 INFO  ==> ** PostgreSQL setup finished! **

postgresql 10:01:24.18 INFO  ==> ** Starting PostgreSQL **
2023-01-13 10:01:24.236 GMT [1] LOG:  pgaudit extension initialized
2023-01-13 10:01:24.237 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-01-13 10:01:24.237 GMT [1] LOG:  listening on IPv6 address "::", port 5432
2023-01-13 10:01:24.258 GMT [1] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-01-13 10:01:24.337 GMT [130] LOG:  database system was shut down at 2023-01-13 10:01:24 GMT
2023-01-13 10:01:24.344 GMT [1] LOG:  database system is ready to accept connections
2023-01-13 10:01:53.073 GMT [159] FATAL:  password authentication failed for user "postgres"
2023-01-13 10:01:53.073 GMT [159] DETAIL:  Password does not match for user "postgres".
	Connection matched pg_hba.conf line 1: "host     all             all             0.0.0.0/0               md5"
2023-01-13 10:01:53.254 GMT [160] FATAL:  password authentication failed for user "postgres"
2023-01-13 10:01:53.254 GMT [160] DETAIL:  Password does not match for user "postgres".
	Connection matched pg_hba.conf line 1: "host     all             all             0.0.0.0/0               md5"
2023-01-13 10:01:53.574 GMT [161] FATAL:  password authentication failed for user "postgres"
2023-01-13 10:01:53.574 GMT [161] DETAIL:  Password does not match for user "postgres".

Deployment problem

Hello,
I have a problem following the steps in the tutorial, during the first terraform apply in the creation of the resource : module.mlops-architecture-eks.helm_release.autoscaler I always get the following error:

│ Error: Post "http://localhost/api/v1/namespaces/kube-system/configmaps": dial tcp 127.0.0.1.1:80: connect: connection refused.
│
│ with module.mlops-architecture-eks.module.eks.kubernetes_config_map.aws_auth[0],
│ on .terraform/modules/mlops-architecture-eks.eks/aws_auth.tf line 63, in resource "kubernetes_config_map" "aws_auth":
│ 63: resource "kubernetes_config_map" "aws_auth" {.
│
╵

How does this compare to Kubeflow?

I am not trying to contest this project, genuinely want to understand the reasoning/philosophy behind doing something from scratch when there are much more (relatively) mature alternatives like Kubeflow, which seem to do very similar things.

It would be great if the maintainers of OpenMLOps could compare it with Kubeflow and other alternatives and shine some light on the philosophy of this one.

I reckon it would be helpful for people to make more informed decisions :)

How to configure running this framework on local Kubernetes cluster?

Thanks for sharing your OpenMLOps framework. I hope to find some guidance to configure this framework to make it work on our local Kubernetes cluster. What modifications need to be done for each component?

Initializing terraform error: `Failed to get existing workspaces: S3 bucket does not exist.`

I'm trying to install OpenMLOps in my AWS account following this tutorial, but I ran into the error, shown above and below, when running terraform init. The tutorial doesn't include provisioning the S3 bucket before this step, and presumably it's created from the terraform scripts. What am I missing?

Error: error creating EKS Cluster (eks-mlops): InvalidParameterException: unsupported Kubernetes version

I dont know what is the problem here and how to debug