Coder Social home page Coder Social logo

awesome-mlops's Introduction

Awesome MLOps Awesome Made With Love

MLOps. You Desing It. Your Train It. You Run It.

An awesome list of references for MLOps - Machine Learning Operations 👉 ml-ops.org

Table of Content

MLOps Core MLOps Communities
MLOps Books MLOps Articles
MLOps Workflow Management MLOps: Feature Stores
MLOps: Data Engineering (DataOps) MLOps: Model Deployment and Serving
MLOps: Testing, Monitoring and Maintenance MLOps: Infrastructure
MLOps Papers Talks About MLOps
Existing ML Systems Machine Learning
Software Engineering Product Management for ML/AI
The Economics of ML/AI Model Governance, Ethics, Responsible AI
MLOps: People & Processes Newsletters About MLOps, Machine Learning, Data Science and Co.

MLOps Core

  1. Machine Learning Operations: You Design It, You Train It, You Run It!
  2. MLOps SIG Specification
  3. ML in Production
  4. Awesome production machine learning: State of MLOps Tools and Frameworks
  5. Udemy “Deployment of ML Models”
  6. Full Stack Deep Learning
  7. Engineering best practices for Machine Learning
  8. 🚀 Putting ML in Production
  9. Stanford MLSys Seminar Series
  10. IBM ML Operationalization Starter Kit
  11. Productize ML. A self-study guide for Developers and Product Managers building Machine Learning products.
  12. MLOps (Machine Learning Operations) Fundamentals on GCP
  13. ML full Stack preparation

MLOps Communities

  1. MLOps.community
  2. CDF Special Interest Group - MLOps
  3. RsqrdAI - Robust and Responsible AI
  4. DataTalks.Club
  5. Synthetic Data Community

MLOps Books

  1. “Machine Learning Engineering” by Andriy Burkov, 2020
  2. "ML Ops: Operationalizing Data Science" by David Sweenor, Steven Hillion, Dan Rope, Dev Kannabiran, Thomas Hill, Michael O'Connell
  3. "Building Machine Learning Powered Applications" by Emmanuel Ameisen
  4. "Building Machine Learning Pipelines" by Hannes Hapke, Catherine Nelson, 2020, O’Reilly
  5. "Managing Data Science" by Kirill Dubovikov
  6. "Accelerated DevOps with AI, ML & RPA: Non-Programmer's Guide to AIOPS & MLOPS" by Stephen Fleming
  7. "Evaluating Machine Learning Models" by Alice Zheng
  8. Agile AI. 2020. By Carlo Appugliese, Paco Nathan, William S. Roberts. O'Reilly Media, Inc.
  9. "Machine Learning Logistics". 2017. By T. Dunning et al. O'Reilly Media Inc.
  10. "Machine Learning Design Patterns" by Valliappa Lakshmanan, Sara Robinson, Michael Munn. O'Reilly 2020
  11. "Serving Machine Learning Models: A Guide to Architecture, Stream Processing Engines, and Frameworks" by Boris Lublinsky, O'Reilly Media, Inc. 2017
  12. "Kubeflow for Machine Learning" by Holden Karau, Trevor Grant, Ilan Filonenko, Richard Liu, Boris Lublinsky
  13. "Clean Machine Learning Code" by Moussa Taifi. Leanpub. 2020
  14. E-Book "Practical MLOps. How to Get Ready for Production Models"
  15. "Introducing MLOps" by Mark Treveil, et al. O'Reilly Media, Inc. 2020
  16. "Machine Learning for Data Streams with Practical Examples in MOA", Bifet, Albert and Gavald`a, Ricard and Holmes, Geoff and Pfahringer, Bernhard, MIT Press, 2018
  17. "Machine Learning Product Manual" by Laszlo Sragner, Chris Kelly
  18. "Data Science Bootstrap Notes" by Eric J. Ma
  19. "Data Teams by Jesse Anderson, 2020"

MLOps Articles

  1. Continuous Delivery for Machine Learning (by Thoughtworks)
  2. What is MLOps? NVIDIA Blog
  3. MLSpec: A project to standardize the intercomponent schemas for a multi-stage ML Pipeline.
  4. The 2021 State of Enterprise Machine Learning | State of Enterprise ML 2020: PDF and Interactive
  5. Organizing machine learning projects: project management guidelines.
  6. Rules for ML Project (Best practices)
  7. ML Pipeline Template
  8. Data Science Project Structure
  9. Reproducible ML
  10. ML project template facilitating both research and production phases.
  11. Machine learning requires a fundamentally different deployment approach. As organizations embrace machine learning, the need for new deployment tools and strategies grows.
  12. Why is DevOps for Machine Learning so Different?
  13. Lessons learned turning machine learning models into real products and services – O’Reilly
  14. MLOps: Model management, deployment and monitoring with Azure Machine Learning
  15. Guide to File Formats for Machine Learning: Columnar, Training, Inferencing, and the Feature Store
  16. Architecting a Machine Learning Pipeline How to build scalable Machine Learning systems
  17. Why Machine Learning Models Degrade In Production
  18. Concept Drift and Model Decay in Machine Learning
  19. Bringing ML to Production
  20. A Tour of End-to-End Machine Learning Platforms
  21. MLOps: Continuous delivery and automation pipelines in machine learning
  22. AI meets operations
  23. What would machine learning look like if you mixed in DevOps? Wonder no more, we lift the lid on MLOps
  24. Forbes: The Emergence Of ML Ops
  25. Cognilytica Report "ML Model Management and Operations 2020 (MLOps)"
  26. Introducing Cloud AI Platform Pipelines
  27. A Guide to Production Level Deep Learning
  28. The 5 Components Towards Building Production-Ready Machine Learning Systems
  29. Deep Learning in Production (references about deploying deep learning-based models in production)
  30. Machine Learning Experiment Tracking
  31. 15 Best Tools for Tracking Machine Learning Experiments
  32. The Team Data Science Process (TDSP)
  33. MLOps Solutions (Azure based)
  34. Monitoring ML pipelines
  35. Deployment & Explainability of Machine Learning COVID-19 Solutions at Scale with Seldon Core and Alibi
  36. Demystifying AI Infrastructure
  37. Organizing machine learning projects: project management guidelines.
  38. The Checklist for Machine Learning Projects (from Aurélien Géron,"Hands-On Machine Learning with Scikit-Learn and TensorFlow")
  39. Data Project Checklist by Jeremy Howard
  40. MLOps: not as Boring as it Sounds
  41. 10 Steps to Making Machine Learning Operational. Cloudera White Paper
  42. MLOps is Not Enough. The Need for an End-to-End Data Science Lifecycle Process.
  43. Data Science Lifecycle Repository Template
  44. Template: code and pipeline definition for a machine learning project demonstrating how to automate an end to end ML/AI workflow.
  45. Nitpicking Machine Learning Technical Debt
  46. The Best Tools, Libraries, Frameworks and Methodologies that Machine Learning Teams Actually Use – Things We Learned from 41 ML Startups
  47. Software Engineering for AI/ML - An Annotated Bibliography
  48. Intelligent System. Machine Learning in Practice
  49. CMU 17-445/645: Software Engineering for AI-Enabled Systems (SE4AI)
  50. Machine Learning is Requirements Engineering
  51. Machine Learning Reproducibility Checklist
  52. Machine Learning Ops. A collection of resources on how to facilitate Machine Learning Ops with GitHub.
  53. Task Cheatsheet for Almost Every Machine Learning Project A checklist of tasks for building End-to-End ML projects
  54. Web services vs. streaming for real-time machine learning endpoints
  55. How PyTorch Lightning became the first ML framework to run continuous integration on TPUs
  56. The ultimate guide to building maintainable Machine Learning pipelines using DVC
  57. Continuous Machine Learning (CML) is CI/CD for Machine Learning Projects (DVC)
  58. What I learned from looking at 200 machine learning tools | Update: MLOps Tooling Landscape v2 (+84 new tools) - Dec '20
  59. Big Data & AI Landscape
  60. Deploying Machine Learning Models as Data, not Code — A better match?
  61. “Thou shalt always scale” — 10 commandments of MLOps
  62. Three Risks in Building Machine Learning Systems
  63. Blog about ML in production (by maiot.io)
  64. Back to the Machine Learning fundamentals: How to write code for Model deployment. Part 1, Part 2, Part 3
  65. MLOps: Machine Learning as an Engineering Discipline
  66. ML Engineering on Google Cloud Platform (hands-on labs and code samples)
  67. Deep Reinforcement Learning in Production. The use of Reinforcement Learning to Personalize User Experience at Zynga
  68. What is Data Observability?
  69. A Practical Guide to Maintaining Machine Learning in Production
  70. Continuous Machine Learning. Part 1, Part 2. Part 3 is coming soon.
  71. The Agile approach in data science explained by an ML expert
  72. Here is what you need to look for in a model server to build ML-powered services
  73. The problem with AI developer tools for enterprises (and what IKEA has to do with it)
  74. Streaming Machine Learning with Tiered Storage
  75. Best practices for performance and cost optimization for machine learning (Google Cloud)
  76. Lean Data and Machine Learning Operations
  77. A Brief Guide to Running ML Systems in Production Best Practices for Site Reliability Engineers
  78. AI engineering practices in the wild - SIG | Getting software right for a healthier digital world
  79. SE-ML | The 2020 State of Engineering Practices for Machine Learning
  80. Awesome Software Engineering for Machine Learning (GitHub repository)
  81. Sampling isn’t enough, profile your ML data instead
  82. Reproducibility in ML: why it matters and how to achieve it
  83. 12 Factors of reproducible Machine Learning in production
  84. MLOps: More Than Automation
  85. Lean Data Science
  86. Engineering Skills for Data Scientists
  87. DAGsHub Blog. Read about data science and machine learning workflows, MLOps, and open source data science
  88. Data Science Project Flow for Startups
  89. Data Science Engineering at Shopify
  90. Building state-of-the-art machine learning technology with efficient execution for the crypto economy
  91. Completing the Machine Learning Loop
  92. Deploying Machine Learning Models: A Checklist
  93. Global MLOps and ML tools landscape (by MLReef)
  94. Why all Data Science teams need to get serious about MLOps
  95. MLOps Values (by Bart Grasza)
  96. Machine Learning Systems Design (by Chip Huyen)
  97. Designing an ML system (Stanford | CS 329 | Chip Huyen)
  98. How COVID-19 Has Infected AI Models (about the data drift or model drift concept)

MLOps: Workflow Management

  1. Open-source Workflow Management Tools: A Survey by Ploomber

MLOps: Feature Stores

  1. Feature Stores for Machine Learning Medium Blog
  2. MLOps with a Feature Store
  3. Feature Stores for ML
  4. Hopsworks: Data-Intensive AI with a Feature Store
  5. Feast: An open-source Feature Store for Machine Learning
  6. What is a Feature Store?
  7. ML Feature Stores: A Casual Tour
  8. Comprehensive List of Feature Store Architectures for Data Scientists and Big Data Professionals
  9. ML Engineer Guide: Feature Store vs Data Warehouse (vendor blog)
  10. Building a Gigascale ML Feature Store with Redis, Binary Serialization, String Hashing, and Compression (DoorDash blog)
  11. Feature Stores: Variety of benefits for Enterprise AI.
  12. Feature Store as a Foundation for Machine Learning

MLOps: Data Engineering (DataOps)

  1. The state of data quality in 2020 – O’Reilly
  2. Why We Need DevOps for ML Data
  3. Data Preparation for Machine Learning (7-Day Mini-Course)
  4. Best practices in data cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data.
  5. 17 Strategies for Dealing with Data, Big Data, and Even Bigger Data
  6. DataOps Data Architecture
  7. Data Orchestration — A Primer
  8. 4 Data Trends to Watch in 2020
  9. CSE 291D / 234: Data Systems for Machine Learning
  10. A complete picture of the modern data engineering landscape
  11. Continuous Integration for your data with GitHub Actions and Great Expectations. One step closer to CI/CD for your data pipelines
  12. Emerging Architectures for Modern Data Infrastructure
  13. Awesome Data Engineering. Learning path and resources to become a data engineer
  14. Data Quality at Airbnb Part 1 | Part 2
  15. DataHub: Popular metadata architectures explained
  16. Financial Times Data Platform: From zero to hero. An in-depth walkthrough of the evolution of our Data Platform
  17. Alki, or how we learned to stop worrying and love cold metadata (Dropbox)
  18. A Beginner's Guide to Clean Data. Practical advice to spot and avoid data quality problems (by Benjamin Greve)
  19. ML Lake: Building Salesforce’s Data Platform for Machine Learning
  20. Data Catalog 3.0: Modern Metadata for the Modern Data Stack
  21. Metadata Management Systems
  22. Essential resources for data engineers (a curated recommended read and watch list for scalable data processing)

MLOps: Model Deployment and Serving

  1. AI Infrastructure for Everyone: DeterminedAI
  2. Deploying R Models with MLflow and Docker
  3. What Does it Mean to Deploy a Machine Learning Model?
  4. Software Interfaces for Machine Learning Deployment
  5. Batch Inference for Machine Learning Deployment
  6. AWS Cost Optimization for ML Infrastructure - EC2 spend
  7. CI/CD for Machine Learning & AI
  8. Itaú Unibanco: How we built a CI/CD Pipeline for machine learning with online training in Kubeflow
  9. 101 For Serving ML Models
  10. Deploying Machine Learning models to production — Inference service architecture patterns
  11. Serverless ML: Deploying Lightweight Models at Scale
  12. ML Model Rollout To Production. Part 1 | Part 2
  13. Deploying Python ML Models with Flask, Docker and Kubernetes
  14. Deploying Python ML Models with Bodywork

MLOps: Testing, Monitoring and Maintenance

  1. Building dashboards for operational visibility (AWS)
  2. Monitoring Machine Learning Models in Production
  3. Effective testing for machine learning systems
  4. Unit Testing Data: What is it and how do you do it?
  5. How to Test Machine Learning Code and Systems (Accompanying code)
  6. Wu, T., Dong, Y., Dong, Z., Singa, A., Chen, X. and Zhang, Y., 2020. Testing Artificial Intelligence System Towards Safety and Robustness: State of the Art. IAENG International Journal of Computer Science, 47(3).
  7. Multi-Armed Bandits and the Stitch Fix Experimentation Platform
  8. A/B Testing Machine Learning Models
  9. Data validation for machine learning. Polyzotis, N., Zinkevich, M., Roy, S., Breck, E. and Whang, S., 2019. Proceedings of Machine Learning and Systems
  10. Testing machine learning based systems: a systematic mapping
  11. Explainable Monitoring: Stop flying blind and monitor your AI
  12. WhyLogs: Embrace Data Logging Across Your ML Systems
  13. Evidently AI. Insights on doing machine learning in production. (Vendor blog.)
  14. The definitive guide to comprehensively monitoring your AI
  15. Introduction to Unit Testing for Machine Learning
  16. Production Machine Learning Monitoring: Outliers, Drift, Explainers & Statistical Performance
  17. Test-Driven Development in MLOps Part 1

MLOps: Infrastructure

  1. MLOps Infrastructure Stack Canvas
  2. Rise of the Canonical Stack in Machine Learning. How a Dominant New Software Stack Will Unlock the Next Generation of Cutting Edge AI Apps
  3. AI Infrastructure Alliance. Building the canonical stack for AI/ML
  4. Linux Foundation AI Foundation
  5. ML Infrastructure Tools for Production | Part 1 — Production ML — The Final Stage of the Model Workflow | Part 2 — Model Deployment and Serving
  6. The MLOps Stack Template (by valohai)

MLOps Papers

  1. (2021) Asset management in machine learning: a survey. This paper presents a feature-based survey of 17 tools with ML asset management support identified in a systematic search. It overviews these tools’ features for managing the different types of assets used for engineering ML-based systems and performing experiments. Go to paper
  2. (2021) Ease.ML: a lifecycle management system for MLDev and MLOps. This paper presents a system for managing and automating the entire lifecycle of machine learning application development. Go to paper
  3. (2021) Challenges in deploying machine learning: a survey of case studies. This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow. Go to paper
  4. (2020) Adoption and effects of software engineering best practices in machine learning. This paper aims to empirically determine the state of the art in how teams develop, deploy and maintain software with ML components. Go to paper
  5. (2020) A viz recommendation system: ML lifecycle at Tableau. This paper cover Tableau's research and development effort for the ML models behind the recommendation especially in the area of model life-cycle management, deployment, and monitoring. Go to paper
  6. (2020) Building continuous integration services for machine learning. This paper presents a CI system for ML that integrates seamlessly with existing ML development tools. Go to paper
  7. (2020) CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking. This paper present CodeReef, an open source platform to share all the components necessary to enable cross-platform (MLSysOps), i.e., automating the deployment of ML models across diverse system in the most efficient way. Got to paper
  8. (2020) Common problems with creating machine learning pipelines from existing code This workshop paper shares common problems observed in industry on developing machine learning pipelines. Go to paper
  9. (2020) Data engineering for data analytics: a classification of the issues and case studies. This paper provides a description and classification of data engineering tasks (such as acquiring, understanding, cleaning, and preparing the data) into high-levels groups, namely data organization, data quality, and feature engineering. Go to paper
  10. (2020) DevOps for AI - challenges in development of AI-enabled applications. This paper points out the challenges in development of complex systems that include ML components, and discuss possible solutions driven by the combination of DevOps and ML workflow processes. Industrial cases are presented to illustrate these challenges and the possible solutions. Go to paper
  11. (2020) Developments in MLflow: a system to accelerate the machine learning lifecycle. This paper discusses user feedback collected since MLflow was launched in 2018, as well as three major features introduced in response to this feedback. Go to paper
  12. (2020) Engineering AI systems: a research agenda. This paper presents a research agenda for AI engineering that provides an overview of the key engineering challenges surrounding ML solutions and an overview of open items that need to be addressed by the research community at large. Go to paper
  13. (2020) Explainable machine learning in deployment. This study explores how organizations view and use explainability for stakeholder consumption. Go to paper
  14. (2020) From what to how: an initial review of publicly available AI ethics tools, methods and research to translate principles into practices. This papers aims at contributing to closing the gap between principles and practices in Machine Learning by constructing a typology that may help practically-minded developers apply ethics at each stage of the Machine Learning development pipeline, and to signal to researchers where further work is needed. Go to paper
  15. (2020) Implicit provenance for machine learning artifacts. This paper presents an approach, called implicit provenance, where a distributed file system and APIs are instrumented to capture changes to ML artifacts, that, along with file naming conventions, mean that full lineage can be tracked for TensorFlow/Keras/Pytorch programs without requiring code changes. Go to paper
  16. (2020) Machine learning testing: survey, landscapes and horizons. This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. Go to paper
  17. (2020) MLModelCI: an automatic cloud platform for efficient MLaaS. This paper presents MLModelCI, a one-step platform for efficient machine learning (ML) services that leverages DevOps techniques to optimize, test, and manage models. It also containerizes and deploys these optimized and validated models as cloud services. Go to paper
  18. (2020) Monitoring and explainability of models in production. This paper discusses the challenges to successful implementation of solutions in key areas (such as model performance and data monitoring, detecting outliers and data drift using statistical techniques) with some recent examples of production ready solutions using open source tools. Go to paper
  19. (2020) Principles and practice of explainable machine learning. This paper focuses on data-driven methods - machine learning and pattern recognition models in particular - so as to survey and distill the results and observations from the literature about the following challenges: how do we understand the decisions suggested by these systems in order that we can trust them? Go to paper
  20. (2020) sensAI: fast ConvNets serving on live data via class parallelism. This paper presents sensAI, a novel and generic approach to achieve faster inference on single data item, that distributes a single CNN into disconnected subnets, and achieve decent serving accuracy with negligible communication overhead (1 float value). Go to paper
  21. (2020) Software engineering for artificial intelligence and machine learning software: a systematic literature review. This study aims to investigate how software engineering (SE) has been applied in the development of AI/ML systems and identify challenges and practices that are applicable and determine whether they meet the needs of professionals. Go to paper
  22. (2020) Software engineering patterns for machine learning applications (SEP4MLA). From 33 ML patterns, this paper describes three major ML architecture patterns and one ML design pattern in the standard pattern format so that practitioners can (re)use them in their contexts. Go to part 1 or part 2
  23. (2020) Simulating performance of ML systems with offline profiling. This paper advocates that simulation based on offline profiling is a promising approach to better understand and improve the complex ML systems, and proposes and approach that uses operation-level profiling and dataflow based simulation to ensure a unified and automated solution for all frameworks and ML models. Go to paper
  24. (2020) Towards automating the AI operations lifecycle. This paper presents a set of enabling technologies that can be used to increase the level of automation in AI operations, thus lowering the human effort required. Go to paper
  25. (2020) Towards CRISP-ML(Q): a machine learning process model with quality assurance methodology. This paper proposes a process model for the development of machine learning applications that guides machine learning practitioners and project organizations from industry and academia with a checklist of tasks that spans the complete project life-cycle. Go to paper
  26. (2020) Towards distribution transparency for supervised ML with oblivious training functions. This paper introduces the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyper�parameter search and distributed training on clusters. Go to paper
  27. (2020) Towards ML engineering: a brief history of TensorFlow Extended (TFX). This paper gives a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end ML platforms at Alphabet. It also shares the lessons learned from over a decade of applied ML built on these platforms, and explains both their similarities and their differences. Go to paper
  28. (2019) Assuring the machine learning lifecycle: desiderata, methods, and challenges. This paper provides a comprehensive survey of the state-of-the-art in the assurance of ML, i.e., in the generation of evidence that ML is sufficiently safe for its intended use. Go to paper
  29. (2019) Continuous integration of machine learning models with ease.ml/ci: towards a rigorous yet practical treatment. This paper presents ease.ml/ci, a continuous integration system for machine learning to provide rigorous guarantees with a practical amount of labeling effort. Go to paper
  30. (2019) Challenges in the deployment and operation of machine learning in practice. In this work, the authors target to systematically elicit the challenges in deployment and operation to enable broader practical dissemination of machine learning applications. Go to paper
  31. (2019) Overton: a data system for monitoring and improving machine-learned products. This paper describes a system called Overton, whose main design goal is to support engineers in building, monitoring, and improving production machine learning systems. Go to paper
  32. (2019) Studying software engineering patterns for designing machine learning systems. This paper collects good/bad software engineering design patterns for ML techniques to provide developers with a comprehensive classification of such patterns. Go to paper
  33. (2019) Towards automated ML model monitoring: measure, improve and quantify data quality. This paper focuses on the arising challenge of automating the operation of deployed ML applications, especially with respect to monitoring the quality of their input data. Go to paper
  34. (2018) A systems perspective to reproducibility in production machine learning domain This paper presents a system that enables ML experts to track and reproduce ML models and pipelines in production. Go to paper
  35. (2018) Building a reproducible machine learning pipeline This paper discusses some problems encountered while building a variety of machine learning models, and subsequently describes a framework to tackle the problem of model reproducibility. Go to paper
  36. (2018) On challenges in machine learning model management. This paper discusses a selection of ML use cases, develops an overview over conceptual, engineering, and data-processing related challenges arising in the management of the corresponding ML models, and points out future research directions. Go to paper
  37. (2018) Ease.ml in action: towards multi-tenant declarative learning services. This demo paper presents the design principles of ease.ml, highlights the implementation of its key components, and showcases how ease.ml can help ease machine learning tasks that often perplex even experienced users. Go to paper
  38. (2017) Clipper: a low-latency online prediction serving system. This paper introduces Clipper, a general-purpose low-latency prediction serving system that aims to simplify model deployment across frameworks and applications, reduce prediction latency, and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. Go to paper
  39. (2017) Ease.ml: towards multi-tenant resource sharing for machine learning workloads. This paper presents ease.ml, a declarative machine learning service platform. Go to paper
  40. (2017) Data management challenges in production machine learning. This paper discusses data-management issues that arise in the context of machine learning pipelines deployed in production. Go to paper
  41. (2017) TFX: A TensorFlow-based production-scale machine learning platform. This paper presents TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google to reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions. Go to paper
  42. (2016) ModelDB: a system for machine learning model management. This paper describes ModelDB, a novel end-to-end system for the management of machine learning models. Go to paper
  43. (2016) Scaling Machine Learning as a Service. This paper presents the scalable MLaaS built for Uber that operates globally. It focus on several challenges, among which: (i) how to scale feature computation for many machine learning use cases; (ii) how to build accurate models using global data; (iii) how to enable scalable model deployment and real-time serving for many models across multiple data centers. Go to paper
  44. (2016) What’s your ML test score? A rubric for ML production systems. This paper presents an ML Test Score rubric based on a set of actionable tests to help quantify a host of issues not found in small toy examples or even large offline research experiments. Go to paper
  45. (2015) Hidden technical debt in machine learning systems. This paper explores several ML-specific risk factors to account for in system design. Go to paper
  46. (2020) Towards complaint-driven ML workflow debugging. Go to paper
  47. (NA) PerfGuard: Deploying ML-for-Systems without Performance Regressions. Go to paper
  48. Addressing the Memory Bottleneck in AI Model-Training
  49. Reliance on Metrics is a Fundamental Challenge for AI
  50. Teaching Software Engineering for AI-Enabled Systems

Additional Resources

  1. Adversarial machine learning reading list
  2. Workshop at ICML 2020: "Challenges in Deploying and Monitoring Machine Learning Systems" (Accepted Papers)
  3. Workshop on MLOps Systems (MLSys)
  4. A survey on concept drift adaptation
  5. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
  6. Conversational Applications and Natural Language Understanding Services at Scale. Minh Tue Vo Thanh and Vijay Ramakrishnan.
  7. Efficient Scheduling of DNN Training on Multitenant Clusters. Deepak Narayanan, Keshav Santhanam, Amar Phanishayee and Matei Zaharia.
  8. MLBox: Towards Reproducible ML. Victor Bittorf, Xinyuan Huang, Peter Mattson, Debojyoti Dutta, David Aronchick, Emad Barsoum, Sarah Bird, Sergey Serebryakov, Natalia Vassilieva, Tom St. John, Grigori Fursin, Srini Bala, Sivanagaraju Yarramaneni, Alka Roy, David Kanter and Elvira Dzhuraeva.
  9. MLPM: Machine Learning Package Manager. Xiaozhe Yao.
  10. Tools for machine learning experiment management. Vlad Velici and Adam Prügel-Bennett.
  11. Towards split learning at scale: System design. Iker Rodríguez, Eduardo Muñagorri, Alberto Roman, Abhishek Singh, Praneeth Vepakomma and Ramesh Raskar.

Talks About MLOps

  1. DeliveryConf 2020. "Continuous Delivery For Machine Learning: Patterns And Pains" by Emily Gorcenski
  2. MLOps Conference: Talks from 2019
  3. A CI/CD Framework for Production Machine Learning at Massive Scale (using Jenkins X and Seldon Core)
  4. MLOps Virtual Event (Databricks)
  5. MLOps NY conference 2019
  6. MLOps.community YouTube Channel
  7. MLinProduction YouTube Channel
  8. Introducing MLflow for End-to-End Machine Learning on Databricks. Spark+AI Summit 2020. Sean Owen
  9. MLOps Tutorial #1: Intro to Continuous Integration for ML
  10. Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams (2019)
  11. Damian Brady - The emerging field of MLops
  12. MLOps - Entwurf, Entwicklung, Betrieb (INNOQ Podcast in German)
  13. Instrumentation, Observability & Monitoring of Machine Learning Models
  14. Efficient ML engineering: Tools and best practices
  15. Beyond the jupyter notebook: how to build data science products
  16. An introduction to MLOps on Google Cloud (First 19 min are vendor-, language-, and framework-agnostic. @visenger)
  17. How ML Breaks: A Decade of Outages for One Large ML Pipeline
  18. Clean Machine Learning Code: Practical Software Engineering
  19. Machine Learning Engineering: 10 Fundamentale Praktiken
  20. Architecture of machine learning systems (3-part series)
  21. Machine Learning Design Patterns

Existing ML Systems

  1. Introducing FBLearner Flow: Facebook’s AI backbone
  2. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
  3. Getting started with Kubeflow Pipelines
  4. Meet Michelangelo: Uber’s Machine Learning Platform
  5. Meson: Workflow Orchestration for Netflix Recommendations
  6. What are Azure Machine Learning pipelines?
  7. Uber ATG’s Machine Learning Infrastructure for Self-Driving Vehicles
  8. An overview of ML development platforms
  9. Snorkel AI: Putting Data First in ML Development
  10. A Tour of End-to-End Machine Learning Platforms
  11. Introducing WhyLabs, a Leap Forward in AI Reliability
  12. Project: Ease.ml (ETH Zürich)
  13. Bodywork: model-training and deployment automation

Machine Learning

  1. Book, Aurélien Géron,"Hands-On Machine Learning with Scikit-Learn and TensorFlow"
  2. Foundations of Machine Learning
  3. Best Resources to Learn Machine Learning
  4. Awesome TensorFlow
  5. "Papers with Code" - Browse the State-of-the-Art in Machine Learning
  6. Zhi-Hua Zhou. 2012. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC.
  7. Feature Engineering for Machine Learning. Principles and Techniques for Data Scientists. By Alice Zheng, Amanda Casari
  8. Google Research: Looking Back at 2019, and Forward to 2020 and Beyond
  9. O’Reilly: The road to Software 2.0
  10. Machine Learning and Data Science Applications in Industry
  11. Curated papers, articles, and blogs on data science & machine learning in production.
  12. Deep Learning for Anomaly Detection
  13. Federated Learning for Mobile Keyboard Prediction
  14. Federated Learning. Building better products with on-device data and privacy on default
  15. Federated Learning: Collaborative Machine Learning without Centralized Training Data
  16. Yang, Q., Liu, Y., Cheng, Y., Kang, Y., Chen, T. and Yu, H., 2019. Federated learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 13(3). Chapters 1 and 2.
  17. Federated Learning by FastForward
  18. THE FEDERATED & DISTRIBUTED MACHINE LEARNING CONFERENCE
  19. Federated Learning: Challenges, Methods, and Future Directions
  20. Book: Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019
  21. Book: Hutter, Frank, Lars Kotthoff, and Joaquin Vanschoren. "Automated Machine Learning". Springer,2019.
  22. ML resources by topic, curated by the community.
  23. An Introduction to Machine Learning Interpretability, by Patrick Hall, Navdeep Gill, 2nd Edition. O'Reilly 2019
  24. Examples of techniques for training interpretable machine learning (ML) models, explaining ML models, and debugging ML models for accuracy, discrimination, and security.
  25. Paper: "Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence", by Sebastian Raschka, Joshua Patterson, and Corey Nolet. 2020
  26. Distill: Machine Learning Research
  27. AtHomeWithAI: Curated Resource List by DeepMind
  28. Awesome Data Science
  29. Intro to probabilistic programming. A use case using Tensorflow-Probability (TFP)
  30. Dive into Snorkel: Weak-Superversion on German Texts. inovex Blog
  31. Dive into Deep Learning. An interactive deep learning book with code, math, and discussions. Provides NumPy/MXNet, PyTorch, and TensorFlow implementations
  32. Data Science Collected Resources (GitHub repository)
  33. A resource list for causality in statistics, data science and physics
  34. Set of illustrated Machine Learning cheatsheets
  35. "Machine Learning Bookcamp" by Alexey Grigorev
  36. 130 Machine Learning Projects Solved and Explained
  37. Machine learning cheat sheet
  38. Stateoftheart AI. An open-data and free platform built by the research community to facilitate the collaborative development of AI
  39. Online Machine Learning Courses: 2020 Edition
  40. End-to-End Machine Learning Library
  41. Machine Learning Toolbox (by Amit Chaudhary)
  42. Causality for Machine Learning

Software Engineering

  1. The Twelve Factors
  2. Book "Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations", 2018 by Nicole Forsgren et.al
  3. Book "The DevOps Handbook" by Gene Kim, et al. 2016
  4. State of DevOps 2019
  5. Clean Code concepts adapted for machine learning and data science.
  6. School of SRE

Product Management for ML/AI

  1. What you need to know about product management for AI. A product manager for AI does everything a traditional PM does, and much more.
  2. Bringing an AI Product to Market. Previous articles have gone through the basics of AI product management. Here we get to the meat: how do you bring a product to market?
  3. The People + AI Guidebook
  4. User Needs + Defining Success
  5. Building machine learning products: a problem well-defined is a problem half-solved.
  6. Talk: Designing Great ML Experiences (Apple)
  7. Machine Learning for Product Managers
  8. Understanding the Data Landscape and Strategic Play Through Wardley Mapping

The Economics of ML/AI

  1. Book: "Prediction Machines: The Simple Economics of Artificial Intelligence"
  2. Book: "The AI Organization" by David Carmona
  3. Book: "Succeeding with AI". 2020. By Veljko Krunic. Manning Publications
  4. A list of articles about AI and the economy
  5. Gartner AI Trends 2019
  6. Global AI Survey: AI proves its worth, but few scale impact
  7. Getting started with AI? Start here! Everything you need to know to dive into your project
  8. 11 questions to ask before starting a successful Machine Learning project
  9. What AI still can’t do
  10. Demystifying AI Part 4: What is an AI Canvas and how do you use it?
  11. A Data Science Workflow Canvas to Kickstart Your Projects
  12. Is your AI project a nonstarter? Here’s a reality check(list) to help you avoid the pain of learning the hard way
  13. What is THE main reason most ML projects fail?
  14. Designing great data products. The Drivetrain Approach: A four-step process for building data products.
  15. The New Business of AI (and How It’s Different From Traditional Software)
  16. The idea maze for AI startups
  17. The Enterprise AI Challenge: Common Misconceptions
  18. Misconception 1 (of 5): Enterprise AI Is Primarily About The Technology
  19. Misconception 2 (of 5): Automated Machine Learning Will Unlock Enterprise AI
  20. Three Principles for Designing ML-Powered Products
  21. A Step-by-Step Guide to Machine Learning Problem Framing
  22. AI adoption in the enterprise 2020
  23. How Adopting MLOps can Help Companies With ML Culture?
  24. Weaving AI into Your Organization
  25. What to Do When AI Fails
  26. Introduction to Machine Learning Problem Framing
  27. Structured Approach for Identifying AI Use Cases
  28. Book: "Machine Learning for Business" by Doug Hudgeon, Richard Nichol, O'reilly
  29. Why Commercial Artificial Intelligence Products Do Not Scale (FemTech)
  30. Google Cloud’s AI Adoption Framework (White Paper)
  31. Data Science Project Management
  32. Book: "Competing in the Age of AI" by Marco Iansiti, Karim R. Lakhani. Harvard Business Review Press. 2020
  33. Laszlo Sragner Newsletter
  34. The Three Questions about AI that Startups Need to Ask. The first is: Are you sure you need AI?
  35. Taming the Tail: Adventures in Improving AI Economics
  36. Managing the Risks of Adopting AI Engineering
  37. Get rid of AI Saviorism
  38. Collection of articles listing reasons why data science projects fail
  39. How to Choose Your First AI Project by Andrew Ng
  40. How to Set AI Goals
  41. Expanding AI's Impact With Organizational Learning
  42. Potemkin Data Science

Model Governance, Ethics, Responsible AI

This stuff will be extracted into our new Awesome ML Model Governace repository

  1. Book: "Practical Fairness". 2020. By Aileen Nielsen. O'Reilly Media, Inc.
  2. Book: "Fairness and machine learning: Limitations and Opportunities." Barocas, S., Hardt, M. and Narayanan, A., 2018.
  3. What are model governance and model operations? A look at the landscape of tools for building and deploying robust, production-ready machine learning models
  4. Specialized tools for machine learning development and model governance are becoming essential. Why companies are turning to specialized machine learning tools like MLflow.
  5. What are model governance and model operations? – O’Reilly
  6. AI Fairness 360, A Step Towards Trusted AI - IBM Research
  7. Responsible AI
  8. Learn how to integrate Responsible AI practices into your ML workflow using TensorFlow
  9. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT)
  10. Programming Fairness in Algorithms. Understanding and combating issues of fairness in supervised learning.
  11. Secure, privacy-preserving and federated machine learning in medical imaging
  12. Artifical intelligence and machine learning security (by Microsoft) The references therein are useful.
  13. Evtimov, Ivan, Weidong Cui, Ece Kamar, Emre Kiciman, Tadayoshi Kohno, and Jerry Li. "Security and Machine Learning in the Real World." arXiv (2020).
  14. Explainable AI (Gartner Prediction for 2023)
  15. What We've Learned to Control. By Ben Recht
  16. State of AI Ethics June 2020 Report by the Montreal AI Ethics Institute
  17. Practical Data Ethics
  18. Vasudevan, Sriram and Kenthapadi, Krishnaram. "LiFT: A Scalable Framework for Measuring Fairness in ML Applications" (2020) - Code: The LinkedIn Fairness Toolkit (LiFT)
  19. Four Principles of Explainable Artificial Intelligence (NIST Draft). Phillips, P.J., Hahn, A.C., Fontana, P.C., Broniatowski, D.A. and Przybocki, M.A., 2020.
  20. Data Ethics Canvas. Helps identify and manage ethical issues – at the start of a project that uses data, and throughout. Also see Ethics Canvas for broader scope.
  21. ABOUT ML - Annotation and Benchmarking on Understanding and Transparency of Machine learning Lifecycles.
  22. Mitchell, Margaret and Wu, Simone and Zaldivar, Andrew and Barnes, Parker and Vasserman, Lucy and Hutchinson, Ben and Spitzer, Elena and Raji, Inioluwa Deborah and Gebru, Timnit. "Model Cards for Model Reporting" (2019) - Code: Model Card Toolkit
  23. Navigate the road to Responsible AI – Gradient Flow Blog
  24. Machine Learning Systems: Security
  25. 😈 Awful AI is a curated list to track current scary usages of AI - hoping to raise awareness
  26. Seven legal questions for data scientists
  27. 2020 in Review: 8 New AI Regulatory Proposals from Governments

MLOps: People & Processes

  1. Scaling An ML Team (0–10 People)

Newsletters About MLOps, Machine Learning, Data Science and Co.

  1. ML in Production newsletter
  2. MLOps.community
  3. Andriy Burkov newsletter
  4. Decision Intelligence by Cassie Kozyrkov
  5. Laszlo's Newsletter about Data Science
  6. Data Elixir newsletter for a weekly dose of the top data science picks from around the web. Covering machine learning, data visualization, analytics, and strategy.
  7. The Data Science Roundup by Tristan Handy
  8. Vicki Boykis Newsletter about Data Science
  9. KDnuggets News
  10. Analytics Vidhya, Any questions on business analytics, data science, big data, data visualizations tools and techniques
  11. Data Science Weekly Newsletter: A free weekly newsletter featuring curated news, articles and jobs related to Data Science
  12. The Machine Learning Engineer Newsletter
  13. Gradient Flow helps you stay ahead of the latest technology trends and tools with in-depth coverage, analysis and insights. See the latest on data, technology and business, with a focus on machine learning and AI
  14. Your guide to AI by Nathan Benaich. Monthly analysis of AI technology, geopolitics, research, and startups.
  15. O'Reilly Data & AI Newsletter
  16. deeplearning.ai’s newsletter by Andrew Ng
  17. Deep Learning Weekly
  18. Import AI is a weekly newsletter about artificial intelligence, read by more than ten thousand experts. By Jack Clark.
  19. AI Ethics Weekly
  20. Announcing Projects To Know, a weekly machine intelligence and data science newsletter
  21. TWIML: This Week in Machine Learning and AI newsletter
  22. featurestore.org: Monthly Newsletter on Feature Stores for ML
  23. DataTalks.Club Community: Slack, Newsletter, Podcast, Weeekly Events
  24. Machine Learning Ops Roundup
  25. Data Science Programming Newsletter by Eric Ma

Twitter Follow

awesome-mlops's People

Contributors

alexeygrigorev avatar atg-abhishek avatar bact avatar biogeek avatar cenrax avatar duartecsoares avatar ericmjl avatar eugeneyan avatar gsajko avatar huaizhengzhang avatar jballoonist avatar jhngrant avatar miraculixx avatar moritzmeister avatar naiiytom avatar neilconway avatar nikronic avatar nlathia avatar sahbic avatar sdabhi23 avatar solegalli avatar spekulatius avatar stefanodallapalma avatar ttzt avatar twolodzko avatar visenger avatar woop avatar xlaszlo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.