packtpublishing / practical-deep-learning-at-scale-with-mlflow Goto Github PK

View Code? Open in Web Editor NEW

153.0 15.0 68.0 38.46 MB

Practical Deep Learning at Scale with MLFlow, published by Packt

License: MIT License

Python 24.20% Shell 0.02% Dockerfile 0.24% Jupyter Notebook 74.62% HTML 0.91%

practical-deep-learning-at-scale-with-mlflow's Introduction

Packt Conference : Put Generative AI to work on Oct 11-13 (Virtual)

3 Days, 20+ AI Experts, 25+ Workshops and Power Talks

Code: USD75OFF

Practical Deep Learning at Scale with MLflow

This is the code repository for Practical Deep Learning at Scale with MLflow, published by Packt.

Bridge the gap between offline experimentation and online production

What is this book about?

The book starts with an overview of the deep learning (DL) life cycle and the emerging Machine Learning Ops (MLOps) field, providing a clear picture of the four pillars of deep learning: data, model, code, and explainability and the role of MLflow in these areas.

This book covers the following exciting features:

Understand MLOps and deep learning life cycle development
Track deep learning models, code, data, parameters, and metrics
Build, deploy, and run deep learning model pipelines anywhere
Run hyperparameter optimization at scale to tune deep learning models
Build production-grade multi-step deep learning inference pipelines
Implement scalable deep learning explainability as a service
Deploy deep learning batch and streaming inference services
Ship practical NLP solutions from experimentation to production

If you feel this book is for you, get your copy at Amazon today!

Instructions and Navigations

All of the code is organized into folders.

The code will look like the following:

Xclient = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName=app_name,
ContentType=content_type,
Accept=accept,
Body=payload
)

Following is what you need for this book: This book is for machine learning practitioners including data scientists, data engineers, ML engineers, and scientists who want to build scalable full life cycle deep learning pipelines with reproducibility and provenance tracking using MLflow. A basic understanding of data science and machine learning is necessary to grasp the concepts presented in this book.

With the following software and hardware list you can run all code files present in the book (Chapter 1-10).

Software and Hardware List

The majority of the code in this book can be implemented and executed using the open source MLflow tool, with a few exceptions where a 14-day full Databricks trial is needed (sign up at https://databricks.com/try-databricks) along with an AWS Free Tier account (sign up at https://aws.amazon.com/free/). The following lists some major software packages covered in this book:

MLflow 1.20.2 and above
Python 3.8.10
Lightning-flash 0.5.0
Transformers 4.9.2
SHAP 0.40.0
PySpark 3.2.1
Ray[tune] 1.9.2
Optuna 2.10.0

The complete package dependencies are listed in each chapter's requirements.txt file or the conda.yaml file in this book's GitHub repository. All code has been tested to run successfully in a macOS or Linux environment. If you are a Microsoft Windows user, it is recommended to install WSL2 to run the bash scripts provided in this book: https://www.windowscentral.com/how-install-wsl2-windows-10. It is a known issue that the MLflow CLI does not work properly in the Microsoft Windows command line.

Starting from Chapter 3, Tracking Models, Parameters, and Metrics of this book, you will also need to have Docker Desktop (https://www.docker.com/products/ docker-desktop/) installed to set up a fully-fledged local MLflow tracking server for executing the code in this book. AWS SageMaker is needed in Chapter 8, Deploying a DL Inference Pipeline at Scale, for the cloud deployment example. VS Code version 1.60 or above (https://code.visualstudio.com/updates/v1_60) is used as the integrated development environment (IDE) in this book. Miniconda version 4.10.3 or above (https://docs.conda.io/en/latest/miniconda.html) is used throughout this book for creating and activating virtual environments.

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Get to Know the Author

Yong Liu has been working in big data science, machine learning, and optimization since his doctoral student years at the University of Illinois at Urbana-Champaign (UIUC) and later as a senior research scientist and principal investigator at the National Center for Supercomputing Applications (NCSA), where he led data science R&D projects funded by the National Science Foundation and Microsoft Research. He then joined Microsoft and AI/ML start-ups in the industry. He has shipped ML and DL models to production and has been a speaker at the Spark/Data+AI summit and NLP summit. He has recently published peer-reviewed papers on deep learning, linked data, and knowledge-infused learning at various ACM/IEEE conferences and journals.

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.

https://packt.link/free-ebook/9781803241333

practical-deep-learning-at-scale-with-mlflow's People

Contributors

Stargazers

Watchers

Forkers

kishorkukreja dimimal allensmile adbmd bananemure whoiscnu thesekyi python-repository-hub tpnguyen ashishpatel26 techthiyanes ductho9799 shaina-12 peterpirog santhoshsthanikam nurgunawan sarikamohan08 ngocnguyenincepit amoat7 juanlamadrid20 pablojmoreno micseb donwany maratkmch benbielin vipul1306 weinyn ishayajayock svfarande projetsplusia amitkayal ankitsharma22458 mikbal42 geethav93 pbajpayee truongtud charlieviettq ayoub-berdeddouch mtahir19 yunusrf phamthanhtu310702 achilela mattburnham ahmedalaa24494 animesh bibuwei kusumy samidha09 bernardrb saibaldasprivate amc3777 furrrow outlierslug ericagyemang lamass stevew00ds elenaviewsynthesis matsuobasho shashipal95 kalkite testwithproduction drmichaelwang maehue

practical-deep-learning-at-scale-with-mlflow's Issues

Issue running Ch 2 script

I follow the Ch 2 setup instructions, but when I run python first_dl_with_mlflow.py I get an error because there's a problem with flash.

import flash
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\flash\__init__.py", line 18, in <module>
    from flash.core.utilities.imports import _TORCH_AVAILABLE
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\flash\core\utilities\imports.py", line 125, in <module>
    _PL_GREATER_EQUAL_1_4_3 = _compare_version("pytorch_lightning", operator.ge, "1.4.3")
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\flash\core\utilities\imports.py", line 58, in _compare_version
    pkg = importlib.import_module(package)
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\__init__.py", line 21, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\callbacks\__init__.py", line 24, in <module>
    from pytorch_lightning.callbacks.pruning import ModelPruning
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\callbacks\pruning.py", line 31, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\core\__init__.py", line 16, in <module>
    from pytorch_lightning.core.lightning import LightningModule
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\core\lightning.py", line 41, in <module>
    from pytorch_lightning.trainer.connectors.logger_connector.fx_validator import FxValidator
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\trainer\__init__.py", line 18, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 31, in <module>
    from pytorch_lightning.loggers import LightningLoggerBase
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\loggers\__init__.py", line 23, in <module>
    from pytorch_lightning.loggers.mlflow import _MLFLOW_AVAILABLE, MLFlowLogger  # noqa: F401
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\pytorch_lightning\loggers\mlflow.py", line 32, in <module>
    import mlflow
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\__init__.py", line 32, in <module>
    import mlflow.tracking._model_registry.fluent
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\tracking\__init__.py", line 8, in <module>
    from mlflow.tracking.client import MlflowClient
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\tracking\client.py", line 16, in <module>
    from mlflow.entities import Experiment, Run, RunInfo, Param, Metric, RunTag, FileInfo, ViewType
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\entities\__init__.py", line 6, in <module>
    from mlflow.entities.experiment import Experiment
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\entities\experiment.py", line 2, in <module>
    from mlflow.entities.experiment_tag import ExperimentTag
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\entities\experiment_tag.py", line 2, in <module>
    from mlflow.protos.service_pb2 import ExperimentTag as ProtoExperimentTag
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\protos\service_pb2.py", line 18, in <module>
    from .scalapb import scalapb_pb2 as scalapb_dot_scalapb__pb2
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\mlflow\protos\scalapb\scalapb_pb2.py", line 29, in <module>
    options = _descriptor.FieldDescriptor(
  File "C:\Users\Bob\miniconda3\envs\dl_model\lib\site-packages\google\protobuf\descriptor.py", line 553, in __new__
    _message.Message._CheckCalledFromGeneratedFile()

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

I am running this on Windows. I'm not able to import mlflow either.

ImportError: cannot import name 'ROUGEScore' from 'torchmetrics.text'

ImportError Traceback (most recent call last)
Input In [13], in <cell line: 2>()
1 from torchmetrics.text.rouge import ROUGEScore
----> 2 from flash.text import TextClassificationData, TextClassifier
3 datamodule = TextClassificationData.from_csv(
4 input_fields="review",
5 target_fields="sentiment",
(...)
8 test_file="data/imdb/test.csv"
9 )

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/flash/text/init.py:3, in
1 from flash.text.classification import TextClassificationData, TextClassifier # noqa: F401
2 from flash.text.embedding import TextEmbedder # noqa: F401
----> 3 from flash.text.question_answering import QuestionAnsweringData, QuestionAnsweringTask # noqa: F401
4 from flash.text.seq2seq import ( # noqa: F401
5 Seq2SeqData,
6 Seq2SeqTask,
(...)
10 TranslationTask,
11 )

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/flash/text/question_answering/init.py:2, in
1 from flash.text.question_answering.data import QuestionAnsweringData # noqa: F401
----> 2 from flash.text.question_answering.model import QuestionAnsweringTask

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/flash/text/question_answering/model.py:40, in
38 from flash.text.ort_callback import ORTCallback
39 from flash.text.question_answering.finetuning import _get_question_answering_bacbones_for_freezing
---> 40 from flash.text.seq2seq.core.metrics import RougeMetric
42 if _TEXT_AVAILABLE:
43 from transformers import AutoModelForQuestionAnswering

File /anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/flash/text/seq2seq/core/metrics.py:25, in
23 from pytorch_lightning.utilities import rank_zero_deprecation
24 from torchmetrics.text import BLEUScore as _BLEUScore
---> 25 from torchmetrics.text import ROUGEScore as _ROUGEScore
27 _deprecated_text_metrics = partial(deprecated, deprecated_in="0.6.0", remove_in="0.7.0", stream=rank_zero_deprecation)
30 class BLEUScore(_BLEUScore):

ImportError: cannot import name 'ROUGEScore' from 'torchmetrics.text' (/anaconda/envs/azureml_py38_PT_TF/lib/python3.8/site-packages/torchmetrics/text/init.py)

1062, "Duplicate entry 'mlflow.runName-f13c66c81dc842059edad4ee436e11f7' for key 'PRIMARY'"

Hello,

I am currently on chapter 3 and have started all of the containers for the mlflow tracking server. I can see minio, the mlflow server information, and mysql is running just fine. I am currently working on the notebook dl_model_tracking and am attempting to run cell 6:
mlflow.pytorch.autolog()
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name="chapter03") as dl_model_tracking_run:
trainer.finetune(classifier_model, datamodule=datamodule, strategy="freeze")
trainer.test()

I am getting the error "BAD_REQUEST: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'mlflow.runName-f13c66c81dc842059edad4ee436e11f7' for key 'PRIMARY'")" [SQL: INSERT INTO tags....; however, when I peek into the mysql database at the tags table, it's empty. So there can't be a primary key already there if it's empty. Do you by chance know how to solve this? Thank you!

Errors while training on multiple GPUs

Hi!

I'm trying to run the code as-is from this repository but there are some errors if you try to train on multiple GPUs.

I've created the conda environment with the following commands:

conda create --name dl_model python==3.8.10
conda activate dl_model
pip install -r requirements.txt

The result from conda list | grep lightning is:

lightning-bolts           0.5.0                    pypi_0    pypi
lightning-flash           0.5.0                    pypi_0    pypi
pytorch-lightning         1.4.9                    pypi_0    pypi

ErrorMultipleGPUs.txt

`RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Traceback (most recent call last):
File "first_dl.py", line 23, in
trainer.finetune(classifier_model, datamodule=datamodule, strategy="freeze")
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/flash/core/trainer.py", line 165, in finetune
return super().fit(model, train_dataloader, val_dataloaders, datamodule)
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
self._dispatch()
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
self.accelerator.start_training(self)
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 158, in start_training
mp.spawn(self.new_process, **self.mp_spawn_kwargs)
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/steeve/Anaconda3/envs/dl_model/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1`

nvidia-smi output is:

Tue Aug 16 10:40:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:05:00.0  On |                  N/A |
| 41%   71C    P0    70W / 250W |    633MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   50C    P8    12W / 250W |     11MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 24%   43C    P8    10W / 250W |     11MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 23%   37C    P8     9W / 250W |     11MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

TypeError: from_csv() missing 1 required positional argument: 'input_fields'

Each time we encounter this error while executing this code.

Note: This is facing in Chapter01 and Chapter02

Code :

datamodule = TextClassificationData.from_csv(
    input_fields="review", # error is here. may be it's typo error
    target_fields="sentiment",
    train_file="data/imdb/train.csv",
    val_file="data/imdb/valid.csv",
    test_file="data/imdb/test.csv"
)

Traceback (most recent call last):
  File "d:/Books/Practical Deep learning at Scale with mlflow/chapter02/first_dl_with_mlflow.py", line 8, in <module>
    datamodule = TextClassificationData.from_csv(
TypeError: from_csv() missing 1 required positional argument: 'input_field'