Awesome MLOps

An awesome list of references for MLOps - Machine Learning Operations 👉 ml-ops.org

Table of Content


MLOps Core	MLOps Communities
MLOps Books	MLOps Articles
MLOps Workflow Management	MLOps: Feature Stores
MLOps: Data Engineering (DataOps)	MLOps: Model Deployment and Serving
MLOps: Testing, Monitoring and Maintenance	MLOps: Infrastructure
MLOps Papers	Talks About MLOps
Existing ML Systems	Machine Learning
Software Engineering	Product Management for ML/AI
The Economics of ML/AI	Model Governance, Ethics, Responsible AI
MLOps: People & Processes	Newsletters About MLOps, Machine Learning, Data Science and Co.

MLOps Core

MLOps Communities

MLOps Books

“Machine Learning Engineering” by Andriy Burkov, 2020
"ML Ops: Operationalizing Data Science" by David Sweenor, Steven Hillion, Dan Rope, Dev Kannabiran, Thomas Hill, Michael O'Connell
"Building Machine Learning Powered Applications" by Emmanuel Ameisen
"Building Machine Learning Pipelines" by Hannes Hapke, Catherine Nelson, 2020, O’Reilly
"Managing Data Science" by Kirill Dubovikov
"Accelerated DevOps with AI, ML & RPA: Non-Programmer's Guide to AIOPS & MLOPS" by Stephen Fleming
"Evaluating Machine Learning Models" by Alice Zheng
Agile AI. 2020. By Carlo Appugliese, Paco Nathan, William S. Roberts. O'Reilly Media, Inc.
"Machine Learning Logistics". 2017. By T. Dunning et al. O'Reilly Media Inc.
"Machine Learning Design Patterns" by Valliappa Lakshmanan, Sara Robinson, Michael Munn. O'Reilly 2020
"Serving Machine Learning Models: A Guide to Architecture, Stream Processing Engines, and Frameworks" by Boris Lublinsky, O'Reilly Media, Inc. 2017
"Kubeflow for Machine Learning" by Holden Karau, Trevor Grant, Ilan Filonenko, Richard Liu, Boris Lublinsky
"Clean Machine Learning Code" by Moussa Taifi. Leanpub. 2020
E-Book "Practical MLOps. How to Get Ready for Production Models"
"Introducing MLOps" by Mark Treveil, et al. O'Reilly Media, Inc. 2020
"Machine Learning for Data Streams with Practical Examples in MOA", Bifet, Albert and Gavald`a, Ricard and Holmes, Geoff and Pfahringer, Bernhard, MIT Press, 2018
"Machine Learning Product Manual" by Laszlo Sragner, Chris Kelly
"Data Science Bootstrap Notes" by Eric J. Ma
"Data Teams by Jesse Anderson, 2020"

MLOps Articles

MLOps: Workflow Management

Open-source Workflow Management Tools: A Survey by Ploomber

MLOps: Feature Stores

MLOps: Data Engineering (DataOps)

MLOps: Model Deployment and Serving

MLOps: Testing, Monitoring and Maintenance

MLOps: Infrastructure

MLOps Papers

(2021) Asset management in machine learning: a survey. This paper presents a feature-based survey of 17 tools with ML asset management support identified in a systematic search. It overviews these tools’ features for managing the different types of assets used for engineering ML-based systems and performing experiments. Go to paper
(2021) Ease.ML: a lifecycle management system for MLDev and MLOps. This paper presents a system for managing and automating the entire lifecycle of machine learning application development. Go to paper
(2021) Challenges in deploying machine learning: a survey of case studies. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. Go to paper
(2020) Adoption and effects of software engineering best practices in machine learning. This paper aims to empirically determine the state of the art in how teams develop, deploy and maintain software with ML components. Go to paper
(2020) A viz recommendation system: ML lifecycle at Tableau. This paper cover Tableau's research and development effort for the ML models behind the recommendation especially in the area of model life-cycle management, deployment, and monitoring. Go to paper
(2020) Building continuous integration services for machine learning. This paper presents a CI system for ML that integrates seamlessly with existing ML development tools. Go to paper
(2020) CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking. This paper present CodeReef, an open source platform to share all the components necessary to enable cross-platform (MLSysOps), i.e., automating the deployment of ML models across diverse system in the most efficient way. Got to paper
(2020) Common problems with creating machine learning pipelines from existing code This workshop paper shares common problems observed in industry on developing machine learning pipelines. Go to paper
(2020) Data engineering for data analytics: a classification of the issues and case studies. This paper provides a description and classification of data engineering tasks (such as acquiring, understanding, cleaning, and preparing the data) into high-levels groups, namely data organization, data quality, and feature engineering. Go to paper
(2020) DevOps for AI - challenges in development of AI-enabled applications. This paper points out the challenges in development of complex systems that include ML components, and discuss possible solutions driven by the combination of DevOps and ML workflow processes. Industrial cases are presented to illustrate these challenges and the possible solutions. Go to paper
(2020) Developments in MLflow: a system to accelerate the machine learning lifecycle. This paper discusses user feedback collected since MLflow was launched in 2018, as well as three major features introduced in response to this feedback. Go to paper
(2020) Engineering AI systems: a research agenda. This paper presents a research agenda for AI engineering that provides an overview of the key engineering challenges surrounding ML solutions and an overview of open items that need to be addressed by the research community at large. Go to paper
(2020) Explainable machine learning in deployment. This study explores how organizations view and use explainability for stakeholder consumption. Go to paper
(2020) From what to how: an initial review of publicly available AI ethics tools, methods and research to translate principles into practices. This papers aims at contributing to closing the gap between principles and practices in Machine Learning by constructing a typology that may help practically-minded developers apply ethics at each stage of the Machine Learning development pipeline, and to signal to researchers where further work is needed. Go to paper
(2020) Implicit provenance for machine learning artifacts. This paper presents an approach, called implicit provenance, where a distributed file system and APIs are instrumented to capture changes to ML artifacts, that, along with file naming conventions, mean that full lineage can be tracked for TensorFlow/Keras/Pytorch programs without requiring code changes. Go to paper
(2020) Machine learning testing: survey, landscapes and horizons. This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. Go to paper
(2020) MLModelCI: an automatic cloud platform for efficient MLaaS. This paper presents MLModelCI, a one-step platform for efficient machine learning (ML) services that leverages DevOps techniques to optimize, test, and manage models. It also containerizes and deploys these optimized and validated models as cloud services. Go to paper
(2020) Monitoring and explainability of models in production. This paper discusses the challenges to successful implementation of solutions in key areas (such as model performance and data monitoring, detecting outliers and data drift using statistical techniques) with some recent examples of production ready solutions using open source tools. Go to paper
(2020) Principles and practice of explainable machine learning. This paper focuses on data-driven methods - machine learning and pattern recognition models in particular - so as to survey and distill the results and observations from the literature about the following challenges: how do we understand the decisions suggested by these systems in order that we can trust them? Go to paper
(2020) sensAI: fast ConvNets serving on live data via class parallelism. This paper presents sensAI, a novel and generic approach to achieve faster inference on single data item, that distributes a single CNN into disconnected subnets, and achieve decent serving accuracy with negligible communication overhead (1 float value). Go to paper
(2020) Software engineering for artificial intelligence and machine learning software: a systematic literature review. This study aims to investigate how software engineering (SE) has been applied in the development of AI/ML systems and identify challenges and practices that are applicable and determine whether they meet the needs of professionals. Go to paper
(2020) Software engineering patterns for machine learning applications (SEP4MLA). From 33 ML patterns, this paper describes three major ML architecture patterns and one ML design pattern in the standard pattern format so that practitioners can (re)use them in their contexts. Go to part 1 or part 2
(2020) Simulating performance of ML systems with offline profiling. This paper advocates that simulation based on offline profiling is a promising approach to better understand and improve the complex ML systems, and proposes and approach that uses operation-level profiling and dataflow based simulation to ensure a unified and automated solution for all frameworks and ML models. Go to paper
(2020) Towards automating the AI operations lifecycle. This paper presents a set of enabling technologies that can be used to increase the level of automation in AI operations, thus lowering the human effort required. Go to paper
(2020) Towards CRISP-ML(Q): a machine learning process model with quality assurance methodology. This paper proposes a process model for the development of machine learning applications that guides machine learning practitioners and project organizations from industry and academia with a checklist of tasks that spans the complete project life-cycle. Go to paper
(2020) Towards distribution transparency for supervised ML with oblivious training functions. This paper introduces the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyper�parameter search and distributed training on clusters. Go to paper
(2020) Towards ML engineering: a brief history of TensorFlow Extended (TFX). This paper gives a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end ML platforms at Alphabet. It also shares the lessons learned from over a decade of applied ML built on these platforms, and explains both their similarities and their differences. Go to paper
(2019) Assuring the machine learning lifecycle: desiderata, methods, and challenges. This paper provides a comprehensive survey of the state-of-the-art in the assurance of ML, i.e., in the generation of evidence that ML is sufficiently safe for its intended use. Go to paper
(2019) Continuous integration of machine learning models with ease.ml/ci: towards a rigorous yet practical treatment. This paper presents ease.ml/ci, a continuous integration system for machine learning to provide rigorous guarantees with a practical amount of labeling effort. Go to paper
(2019) Challenges in the deployment and operation of machine learning in practice. In this work, the authors target to systematically elicit the challenges in deployment and operation to enable broader practical dissemination of machine learning applications. Go to paper
(2019) Overton: a data system for monitoring and improving machine-learned products. This paper describes a system called Overton, whose main design goal is to support engineers in building, monitoring, and improving production machine learning systems. Go to paper
(2019) Studying software engineering patterns for designing machine learning systems. This paper collects good/bad software engineering design patterns for ML techniques to provide developers with a comprehensive classification of such patterns. Go to paper
(2019) Towards automated ML model monitoring: measure, improve and quantify data quality. This paper focuses on the arising challenge of automating the operation of deployed ML applications, especially with respect to monitoring the quality of their input data. Go to paper
(2018) A systems perspective to reproducibility in production machine learning domain This paper presents a system that enables ML experts to track and reproduce ML models and pipelines in production. Go to paper
(2018) Building a reproducible machine learning pipeline This paper discusses some problems encountered while building a variety of machine learning models, and subsequently describes a framework to tackle the problem of model reproducibility. Go to paper
(2018) On challenges in machine learning model management. This paper discusses a selection of ML use cases, develops an overview over conceptual, engineering, and data-processing related challenges arising in the management of the corresponding ML models, and points out future research directions. Go to paper
(2018) Ease.ml in action: towards multi-tenant declarative learning services. This demo paper presents the design principles of ease.ml, highlights the implementation of its key components, and showcases how ease.ml can help ease machine learning tasks that often perplex even experienced users. Go to paper
(2017) Clipper: a low-latency online prediction serving system. This paper introduces Clipper, a general-purpose low-latency prediction serving system that aims to simplify model deployment across frameworks and applications, reduce prediction latency, and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. Go to paper
(2017) Ease.ml: towards multi-tenant resource sharing for machine learning workloads. This paper presents ease.ml, a declarative machine learning service platform. Go to paper
(2017) Data management challenges in production machine learning. This paper discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Go to paper
(2017) TFX: A TensorFlow-based production-scale machine learning platform. This paper presents TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google to reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. Go to paper
(2016) ModelDB: a system for machine learning model management. This paper describes ModelDB, a novel end-to-end system for the management of machine learning models. Go to paper
(2016) Scaling Machine Learning as a Service. This paper presents the scalable MLaaS built for Uber that operates globally. It focus on several challenges, among which: (i) how to scale feature computation for many machine learning use cases; (ii) how to build accurate models using global data; (iii) how to enable scalable model deployment and real-time serving for many models across multiple data centers. Go to paper
(2016) What’s your ML test score? A rubric for ML production systems. This paper presents an ML Test Score rubric based on a set of actionable tests to help quantify a host of issues not found in small toy examples or even large offline research experiments. Go to paper
(2015) Hidden technical debt in machine learning systems. This paper explores several ML-specific risk factors to account for in system design. Go to paper
(2020) Towards complaint-driven ML workflow debugging. Go to paper
(NA) PerfGuard: Deploying ML-for-Systems without Performance Regressions. Go to paper
Addressing the Memory Bottleneck in AI Model-Training
Reliance on Metrics is a Fundamental Challenge for AI
Teaching Software Engineering for AI-Enabled Systems

Additional Resources

Adversarial machine learning reading list
Workshop at ICML 2020: "Challenges in Deploying and Monitoring Machine Learning Systems" (Accepted Papers)
Workshop on MLOps Systems (MLSys)
A survey on concept drift adaptation
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Conversational Applications and Natural Language Understanding Services at Scale. Minh Tue Vo Thanh and Vijay Ramakrishnan.
Efficient Scheduling of DNN Training on Multitenant Clusters. Deepak Narayanan, Keshav Santhanam, Amar Phanishayee and Matei Zaharia.
MLBox: Towards Reproducible ML. Victor Bittorf, Xinyuan Huang, Peter Mattson, Debojyoti Dutta, David Aronchick, Emad Barsoum, Sarah Bird, Sergey Serebryakov, Natalia Vassilieva, Tom St. John, Grigori Fursin, Srini Bala, Sivanagaraju Yarramaneni, Alka Roy, David Kanter and Elvira Dzhuraeva.
MLPM: Machine Learning Package Manager. Xiaozhe Yao.
Tools for machine learning experiment management. Vlad Velici and Adam Prügel-Bennett.
Towards split learning at scale: System design. Iker Rodríguez, Eduardo Muñagorri, Alberto Roman, Abhishek Singh, Praneeth Vepakomma and Ramesh Raskar.

nemani / awesome-mlops Goto Github PK

awesome-mlops's Introduction

Awesome MLOps

Table of Content

MLOps Core

MLOps Communities

MLOps Books

MLOps Articles

MLOps: Workflow Management

MLOps: Feature Stores

MLOps: Data Engineering (DataOps)

MLOps: Model Deployment and Serving

MLOps: Testing, Monitoring and Maintenance

MLOps: Infrastructure

MLOps Papers

Additional Resources

Talks About MLOps

Existing ML Systems

Machine Learning

Software Engineering

Product Management for ML/AI

The Economics of ML/AI

Model Governance, Ethics, Responsible AI

MLOps: People & Processes

Newsletters About MLOps, Machine Learning, Data Science and Co.

awesome-mlops's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org