Machine Learning Operations, or MLOps, refers to the principles, practices, culture, and tools that enable organizations to develop, deploy, and maintain production machine learning and AI systems.
MLOps, short for machine learning operations, is a set of practices and principles that aims to streamline the development, deployment, and maintenance of machine learning (ML) models in production environments. It combines aspects of machine learning, data engineering, software engineering, and site operations to create a more efficient and reliable workflow for machine learning projects. MLOps emphasizes automation, collaboration, and continuous improvement throughout the entire ML lifecycle—from exploratory analysis, data preparation, and model development to deployment, monitoring, and ongoing optimization.
The figure below shows a characteristic machine learning discovery workflow, showing how ML teams turn experiments into production systems. This workflow features seven stages, each of which informs the next, and four prototypical personas: data scientists, who find and exploit structure to solve business problems; business analysts, who characterize structured data with queries and dashboards; data engineers, who make data available, reliable, and clean at scale; and engineering teams, who focus on building and maintaining production systems. Because software development in general is iterative, it is possible for ML teams to revisit decisions made in earlier stages, based on things they’ve learned later in the process.
MLOps is an extension of the existing discipline of DevOps, the modern practice of efficiently writing, deploying, and running enterprise applications. The central insight of DevOps is that building, maintaining, and operating applications are more deeply interrelated than software teams had previously understood; therefore, it doesn’t make sense for development and operations or IT teams to exist in silos. Some of the characteristic advances of the DevOps movement include:
Ideally, these DevOps practices lead to greater team velocity, higher quality, and greater application reliability. They also make it possible for teams building complex distributed applications to mitigate the impact of changes and defects. Since machine learning systems are, at heart, complex software systems, these methods make it possible to develop machine learning systems. However, machine learning systems have additional concerns, like managing data at rest and in motion, designing and tracking more involved experiments as part of both development and production, and providing a pathway to realize the insights from experiments in production software.
Machine learning systems present formidable engineering challenges. Datasets are massive and growing, and they can change in real time. Messy or shifting data can dramatically affect the predictive performance of an ML system. AI models require careful tracking through cycles of experiments, tuning, and retraining. MLOps needs a powerful AI infrastructure that can scale as companies grow. For this foundation, many companies use the NVIDIA DGX™ platform and NVIDIA AI Enterprise, which includes AI tools and frameworks like TAO Toolkit, NVIDIA Triton Inference Server™, RAPIDS, and more.
It’s easy to get lost in the forest of AI operations terminology, but one thing is clear: the industry has unified around MLOps.
By contrast, AIOps is a narrower practice of using machine learning to automate IT functions. One part of AIOps is IT operations analytics, or ITOA, which examines the data AIOps generates to figure out how to improve IT practices.
Similarly, some have coined the phrases DataOps and ModelOps to refer to the people and processes for creating and managing datasets and AI models, respectively. Those are two important pieces of the overall MLOps puzzle–at NVIDIA, we use these to describe categories of MLOps tools.
Interestingly, thousands of people search for the meaning of DLOps every month. While some might assume that DLOps are IT operations for deep learning, the industry uses the term MLOps instead, since deep learning is a part of the broader field of Machine Learning.
MLOps and GenAIOps are both operational frameworks for AI technologies, but they differ significantly in their focus and scope. MLOps is the overarching concept covering the core tools, processes, and best practices for end-to-end machine learning system development and operations in production. GenAIOps extends MLOps to develop and operationalize generative AI solutions. The distinct characteristic of GenAIOps is the management of and interaction with a foundation model.
While MLOps primarily deals with traditional machine learning models, focusing on model training, deployment, monitoring, and performance optimization, GenAIOps is specifically tailored for generative AI technologies, addressing the unique challenges of managing and operationalizing generative AI solutions.
Let’s look at the ML workflow again, to see the kinds of tools that support each stage of this process:
End-to-end platforms represent the topmost category, including ML platforms that incorporate a control plane and support for several lifecycle phases. It’s important to note that the term “end-to-end” isn't a value judgment or statement of completeness but merely indicates that a particular offering covers a broad slice of the lifecycle and is designed to be operated by itself. Since ML platforms represent an integrated solution, they’re a great place to start your MLOps journey.
Data wrangling tools include data exploration, visualization, and high-level federation capabilities, conventional business intelligence and analytics solutions, and labeling technologies for unstructured data, like natural-language text, speech, images, or video. The types of problems you're solving will determine which of these resources are most relevant to your workflows.
Interactive development solutions provide a control plane to give data science and ML practitioners access to on-demand compute resources. These often provide a facility for managing development environments and integrate with external version control systems, desktop IDEs, and other standalone developer tools, facilitating collaboration within teams.
Experiment management offerings provide a way to track results from various model configurations, along with versioned code and data, to understand modeling performance over time. AutoML systems build on experiment management to automatically search the space of possible techniques and hyperparameters for a given technique to produce a trained model with minimal practitioner input.
Data management frameworks support data warehousing, versioning, provenance, ingest, and access control. Data versioning and data provenance are critical components of building reproducible ML systems.
Feature management and pipeline management tools provide two complementary approaches to enabling data processing, development to production workflows, and collaboration. Feature stores enable users to track derived, aggregated, or expensive-to-compute features for development and production, along with their provenance. Pipeline management solutions provide a way to declare the reproducible workflows that generate data and models, manage orchestration, and monitor the multiple software components involved in exploratory and production workflows.
ModelOps platforms address the concerns of publishing models as deployable services, managing and scaling these services, and monitoring their outputs, particularly for detecting data drift.
Infrastructure management provides an interface to schedule compute jobs and services on underlying hardware or cloud resources. For ML in particular, key capabilities include reserving multiple nodes for training jobs and requesting resources with specific memory capacities or GPUs.
For more details on these categories, see Demystifying Enterprise MLOps on NVIDIA’s technical blog.
MLOps offers numerous benefits for organizations implementing machine learning projects, delivering significant advantages across multiple dimensions:
MLOps faces several key technical challenges as organizations strive to implement and scale machine learning operations. Different ML problems impose various requirements on ML systems. Problems involving unstructured data, like understanding video, audio, or natural language, involve far more effort, including manual human effort, to label training examples than problems involving tabular business data, in which the labeling effort can often be trivial or automated. Some problems benefit from a model that only needs data that's immediately available from a single source, while other approaches and problems depend on federating historical and aggregated data from multiple sources with a new observation to make a prediction. Novel applications of ML may benefit from better support for experimental and exploratory development, while mature systems may benefit more from development process automation. Finally, systems that automate critical decisions that can affect human lives, control dangerous machinery, or manage financial portfolios need to be simulated in a range of conditions, including unlikely or adversarial scenarios, in order to validate their suitability and safety. If you're working with problems that imply special requirements, make sure you land on an MLOps solution that can help you meet those requirements.
More than 100 MLOps software and service providers are working with NVIDIA. The software providers include:
Cloud service providers have integrated MLOps into NVIDIA-accelerated platforms, including: