Coder Social home page Coder Social logo

jomariya23156 / sales-forecast-mlops-at-scale Goto Github PK

View Code? Open in Web Editor NEW
34.0 4.0 13.0 9.49 MB

Full-stack Highly Scalable Cloud-native Machine Learning system for demand forecasting with realtime data streaming, inference, retraining loop, and more

License: MIT License

Python 43.84% Dockerfile 5.98% Shell 0.32% Jupyter Notebook 49.86%
airflow cloud-native data-stream demand-forecasting docker fastapi full-stack google-cloud helm kafka

sales-forecast-mlops-at-scale's Introduction

Sales Forecast MLOps at Scale

▶️ Highly scalable Cloud-native Machine Learning system ◀️

Table of contents

Overview

"Sales Forecast MLOps at Scale" delivers a full-stack, production-ready solution designed to streamline the entire sales forecasting system – from development and deployment to continuous improvement. It offers flexible deployment options, supporting both on-premises environments (Docker Compose, Kubernetes) and cloud-based setups (Kubernetes, Helm), ensuring adaptability to your infrastructure.

Demo on YouTube: https://youtu.be/PwV8TIsMEME

Key Features

  • Dual-Mode Inference: Supports both batch and online inference modes, providing adaptability to various use cases and real-time prediction needs.
  • Automated Forecast Generation: Airflow DAGs orchestrate weekly model training and batch predictions, with the ability for on-demand retraining based on the latest data.
  • Data-Driven Adaptability: Kafka handles real-time data streaming, enabling the system to incorporate the latest sales information into predictions. Models are retrained on demand to maintain accuracy.
  • Scalable Pipeline and Training: Leverages Spark and Ray for efficient data processing and distributed model training, ensuring the system can handle large-scale datasets and training.
  • Transparent Monitoring: Ray and Grafana provide visibility into training performance, while Prometheus enables system-wide monitoring.
  • User-Friendly Interface: Streamlit offers a clear view of predictions. MLflow tracks experiments and model versions, ensuring reproducibility and streamlined updates.
  • Best-Practices Serving: Robust serving stack with Nginx, Gunicorn, and FastAPI for reliable and performant model deployment.
  • CI/CD Automation: GitHub Actions streamline the build and deployment process, automatically pushing images to Docker Hub and GCP.
  • Cloud-native, Scalability and Flexibility: Kubernetes and Google Cloud Platform ensure adaptability to growing data and workloads. The open-source foundation (Docker, Ray, FastAPI, etc.) offers customization and extensibility.

Tools / Technologies

Note: Most of the service ports can be found and customized in the .env file at the root of this repository (or values.yaml and sfmlops-helm/templates/global-configmap.yaml for Kubernetes and Helm).

Development environment

  1. Docker (ref: Docker version 24.0.6, build ed223bc)
  2. Kubernetes (ref: v1.27.2 (via Docker Desktop))
  3. Helm (ref: v3.14.3)

How things work

  1. After you start up the system, the data producer will read and store the data of the last 5 months from services/data-producer/datasets/rossman-store-sales/train_exclude_last_10d.csv to Postgres. It does this by modifying the last date of the data to be YESTERDAY. Afterward, it will keep publishing new messages (from train_only_last_19d.csv in the same directory), technically TODAY data, to a Kafka topic every 10 seconds (infinite loop).
  2. There are two main DAGs in Airflow:
    1. Daily DAG:
      >> Ingest data from this Kafka topic
      >> Process and transform with Spark Streaming
      >> Store it in Postgres
    2. Weekly DAG:
      >> Pull the last four months of sales data from Postgres
      >> Use it for training new Prophet models, with Ray (1,1115 models in total), which are tracked and registered by MLflow
      >> Use these newly trained models to predict the forecast of the upcoming week (next 7 days)
      >> Store the forecasts in Postgres (another table)
  3. During training, you can monitor your system and infrastructure with Grafana and Prometheus.
  4. By default, the data stream from topic sale_rossman_store gets stored in rossman_sales table and forecast results in forecast_results table, you can use pgAdmin to access it.
  5. After the previous steps are executed successfully, you/users can now access the Streamlit website proxied by Nginx.
  6. This website fetches the latest 7 predictions (technically, the next 7 days) for each store and each product and displays them in a good-looking line chart (thanks to Altair)
  7. From the website, users can view sales forecast of any product from any store. Notice that the subtitle of the chart contains the model ID and version.
  8. Since these forecasts are made weekly, whether users access this website on Monday or Wednesday, they will see the same chart. If, during the week, the users somehow feel like the forecast prediction is out of sync or outdated, they can trigger retraining for a specific model of that product and store.
  9. When the users click a retrain button, the website will submit a model training job to the training service which then calls Ray to retrain this model. The retraining is pretty fast, usually done in under a minute, and it follows the same training strategy as the weekly training DAG (but of course, with the newest data possible).
  10. Right after retraining is done, users can select a number of future days to predict and click a forecast button to request the forecasting service to use the latest model to make forecasts.
  11. The result of new forecasts is then displayed in the line chart below. Notice that the model version number increased! Yoohoo! (note: For simplicity, this new forecast result won't be stored anywhere.)

How to setup

Prerequisites: Docker, Kubernetes, and Helm

With Docker Compose

  1. (Optional) In case you want to build (not pulling images):
    docker-compose build
    
  2. docker-compose -f docker-compose.yml -f docker-compose-airflow.yml up -d
    
  3. Sometimes it can freeze or fail the first time, especially if your machine is not that high in spec (like mine T_T). But you can wait a second, try the last command again and it should start up fine.
  4. That's it!

Note: Most of the services' restart is left unspecified, so they won't restart on failures (because sometimes it's quite resource-consuming during development, you see I have a poor laptop lol).

With Kubernetes/Helm (Local cluster)

The system is quite large and heavy... I recommend running it locally just for setup testing purposes. Then if it works, just go off to the cloud if you want to play around longer OR stick with Docker Compose (it went smoother in my case)

  1. Install Helm
    bash install-helm.sh
    
  2. Create airflow namespace:
    kubectl create namespace airflow
    
  3. Deploy the main chart:
    1. Fetch all dependencies
      cd sfmlops-helm
      helm dependency build
      
    2. helm -n mlops upgrade --install sfmlops-helm ./ --create-namespace -f values.yaml -f values-ray.yaml
      
  4. Deploy Kafka:
    1. (1st time only)
      helm repo add bitnami https://charts.bitnami.com/bitnami
      
    2. helm -n kafka upgrade --install kafka-release oci://registry-1.docker.io/bitnamicharts/kafka --create-namespace --version 23.0.7 -f values-kafka.yaml
      
  5. Deploy Airflow:
    1. (1st time only)
      helm repo add apache-airflow https://airflow.apache.org
      
    2. helm -n airflow upgrade --install airflow apache-airflow/airflow --create-namespace --version 1.13.1 -f values-airflow.yaml
      
    3. Sometimes, you might get a timeout error from this command (if you do, it means your machine spec is too poor for this system (like mine lol)). It's totally fine. Just keep checking the status with kubectl, if all resources start up correctly, go with it otherwise try running the command again.
  6. Deploy Prometheus and Grafana:
    1. (1st time only)
      helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
      
    2. helm -n monitoring upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack  --create-namespace --version 57.2.0 -f values-kube-prometheus.yaml
      
    3. Forward port for Grafana:
      kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
      
      OR assign grafana.service.type: LoadBalancer in values-kube-prometheus.yaml
    4. One of the good things about kube-prometheus-stack is that it comes with many pre-installed/pre-configured dashboards for Kubernetes. Feel free to explore!
  7. That's it! Enjoy your highly scalable Machine Learning system for Sales forecasting! ;)

Note: If you want to change namespace kafka and/or release name kafka-release of Kafka, please also change them in values.yaml and KAFKA_BOOTSTRAP_SERVER env var in values-airflow.yaml. They are also used in templating.

Note 2: In Docker Compose, Ray has already been configured to pull the embedded dashboards from Grafana. But in Kubernetes, this process involves a lot more manual steps. So, I intentionally left it undone for ease of setup of this project. You can follow the guide here if you want to anyway.

With Kubernetes/Helm (on GCP)

Prerequisites: GKE Cluster (Standard cluster, NOT Autopilot), Artifact Registry, Service Usage API, gcloud cli

  1. Follow this Medium blog. Instead of using the default Service Account (as done in the blog), I recommend creating a new Service Account with Owner role for a quick and dirty run (but of course, please consult your cloud engineer if you have security concerns).
  2. Download your Service Account's JSON key
  3. Activate your Service Account:
    gcloud auth activate-service-account --key-file=<PATH_TO_JSON_KEY>
    
  4. Connect local kubectl to cloud:
    gcloud container clusters get-credentials <GKE_CLUSTER_NAME> --zone <GKE_ZONE> --project <PROJECT_NAME>
    
  5. Now kubectl (and helm) will work in the context of the GKE environment.
  6. Follow the steps in With Kubernetes/Helm (Local cluster) section
  7. If you face a timeout error when running helm commands for airflow or the system struggles to set up and work correctly, I recommend trying to upgrade your machine type in the cluster.

Note: For the machine type of node pool in the GKE cluster, from experiments, e2-medium (default) is not quite enough, especially for Airflow and Ray. In my case, I went for e2-standard-8 with 1 node (explanation on why only 1 node is in Important note on MLflow on Cloud section). I also found myself the need to increase the quota for PVC in IAM too.

Cleanup steps

helm uninstall sfmlops-helm -n mlops
helm uninstall kafka-release -n kafka
helm uninstall airflow -n airflow
helm uninstall kube-prometheus-stack -n monitoring

Important note on MLflow on Cloud

In this setting, I set the MLflow's artifact path to point to a local path. Internally, MLflow expects this path to be accessible from both MLflow client and server (honestly, I'm not a fan of this model either). It is meant to be an object storage path like S3 (AWS) or Cloud Storage (GCP). For a full on-premises experience, we can create a Docker volume and mount it to the EXACT same path on both client and server to address this. In a local Kubernetes cluster, we can do the same thing by creating a PVC with accessModes: ReadWriteOnce (in sfmlops-helm/templates/mlflow-pvc.yaml).

However for on-cloud Kubernetes with a typical multi-node cluster, if we want the PVC to be able to read and write across nodes, we need to set accessModes: ReadWriteMany. Most cloud providers DO NOT support this type of PVC and recommend using centralized storage instead. Therefore, if you want to just try it out and run for fun, you can use this exact setting and create a single-node cluster (which will behave similarly to a local Kubernetes cluster, just on the cloud). For a real production environment, please create a cloud storage bucket, remove mlflow-pvc.yaml and its mount paths, and change the artifact path variable MLFLOW_ARTIFACT_ROOT in sfmlops-helm/templates/global-configmap.yaml to the cloud storage path. Here's the official doc for more information.

References / Useful resources

My notes

If you have any comments, questions, or want to learn more, check out notes/README.md. I have included a lot of useful notes about how and why I made certain choices during development. Mostly, they cover tool selection, design choices, and some caveats.

sales-forecast-mlops-at-scale's People

Contributors

jomariya23156 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sales-forecast-mlops-at-scale's Issues

Error when setting up project with With Docker Compose

After running the following command docker-compose -f docker-compose.yml -f docker-compose-airflow.yml up -d im getting the error below

` => [nginx 2/3] COPY default.conf.template /etc/nginx/templates/                                                                      0.5s
 => ERROR [nginx 3/3] RUN ln -sf /dev/stdout /var/log/nginx/access.log && ln -sf /dev/stderr /var/log/nginx/error.log                 1.5s
------
 > [nginx 3/3] RUN ln -sf /dev/stdout /var/log/nginx/access.log && ln -sf /dev/stderr /var/log/nginx/error.log:
1.430 exec /bin/sh: exec format error
------
failed to solve: process "/bin/sh -c ln -sf /dev/stdout /var/log/nginx/access.log && ln -sf /dev/stderr /var/log/nginx/error.log" did not complete successfully: exit code: 1```



**and `docker-compose build` is also giving the following error**

` => ERROR [ray 2/4] RUN mkdir -p /storage/mlruns/ &&     chown -R ray /storage/mlruns/ &&     chmod -R go+rX /storage/mlruns/         1.4s
 => CACHED [mlflow 1/5] FROM docker.io/library/python:3.9.17-slim@sha256:42a5da33675ec5a692e8cdbb09ffa4e39588c10dd9a96235e543c498484  0.0s
 => [web-ui internal] load build context                                                                                              0.9s
 => => transferring context: 14.81kB                                                                                                  0.0s
 => [data-producer internal] load build context                                                                                       1.6s
 => => transferring context: 34.01MB                                                                                                  0.4s
 => ERROR [data-producer 2/7] RUN mkdir -p /service/datasets                                                                          1.7s
 => [mlflow internal] load build context                                                                                              1.2s
 => => transferring context: 79B                                                                                                      0.0s
 => [forecast-service internal] load build context                                                                                    0.6s
 => => transferring context: 6.80kB                                                                                                   0.0s
 => CACHED [forecast-service 2/5] COPY requirements.txt /service/requirements.txt                                                     0.1s
 => CACHED [web-ui 2/5] COPY requirements.txt /service/requirements.txt                                                               0.0s
 => CANCELED [web-ui 3/5] RUN pip install -r /service/requirements.txt                                                                1.8s
 => CANCELED [forecast-service 3/5] RUN pip install -r /service/requirements.txt                                                      2.1s
 => CACHED [mlflow 2/3] COPY requirements.txt .                                                                                       0.0s
 => CANCELED [mlflow 3/3] RUN pip install -r requirements.txt                                                                         2.3s
------
 > [ray 2/4] RUN mkdir -p /storage/mlruns/ &&     chown -R ray /storage/mlruns/ &&     chmod -R go+rX /storage/mlruns/:
0.884 exec /bin/bash: exec format error
------
------
 > [data-producer 2/7] RUN mkdir -p /service/datasets:
1.488 exec /bin/sh: exec format error
------
failed to solve: process "/bin/bash -c mkdir -p $MLFLOW_ARTIFACT_ROOT &&     chown -R ray $MLFLOW_ARTIFACT_ROOT &&     chmod -R go+rX $MLFLOW_ARTIFACT_ROOT" did not complete successfully: exit code: 1`





Failed to create new KafkaAdminClient

Im getting the following error when i try to run the kafka_spark_db_dag in airflow

Traceback (most recent call last):
  File "/workspaces/sales-forecast-mlops-at-scale/services/airflow/dags/spark_streaming.py", line 110, in <module>
    stream_kafka_to_db()
  File "/workspaces/sales-forecast-mlops-at-scale/services/airflow/dags/spark_streaming.py", line 107, in stream_kafka_to_db
    write_df_to_db(processed_df)
  File "/workspaces/sales-forecast-mlops-at-scale/services/airflow/dags/spark_streaming.py", line 97, in write_df_to_db
    return query.awaitTermination()
  File "/workspaces/sales-forecast-mlops-at-scale/venv/lib/python3.10/site-packages/pyspark/sql/streaming/query.py", line 221, in awaitTermination
    return self._jsq.awaitTermination()
  File "/workspaces/sales-forecast-mlops-at-scale/venv/lib/python3.10/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/workspaces/sales-forecast-mlops-at-scale/venv/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py", line 185, in deco
    raise converted from None
pyspark.errors.exceptions.captured.StreamingQueryException: [STREAM_FAILED] Query [id = 568bfb7c-0611-4ed0-9c8c-d5efb7a22068, runId = 54a32339-a883-4985-b4f8-11d1d58f4f0e] terminated with exception: Faile
```d to create new KafkaAdminClient`

Set up issue

I have some problems setting up and I do not know how to solve them.
I am setting up with docker-compose on Windows. The first thing that makes me confused is whether I should type wsl first or not. In this case, I did not do it but the program still seems to work.
messageImage_1713779524941
After installation, there is an error from the daemon path and it seems like none of the UI in the container works. Or am I missing a way to open other programs?
image (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.