Large Language Models

What Does the ML Lifecycle Look Like for LLMs in Practice?

What Does the ML Lifecycle Look Like for LLMs in Practice?

I’m not going to say I was a full-blown deep learning denier, but I was stubbornly holding out to avoid getting swept up in the LLM hype. While I find almost all ML research interesting, I am really enamored with ML systems research, dedicated to the practical implementation of ML models. And it seemed like generative AI was being touted around more as a fun toy to “see what it could do” than something feasible to put into production. 

As usual, the LinkedIn posts started to pour in. However, it wasn’t until recently, when those same posts began to shift from “Wow, look at this response” to “Wow, look at this application I built,” that I realized I could no longer keep my head in the sand. I realized that sooner than I was expecting, these applications will be a part of the ML systems work I am so interested in. So, I spent some time trying to find out what makes the development of these models different from typical ML models. 

What Does the Traditional ML Lifecycle Look Like? 

Over the years, there have been many different formats that researchers have used to describe the steps teams go through to build an ML model. While I will not detail these steps, please refer to the diagram below.

We can see in the diagram two core pipelines that work to put ML models into production (and one version-controlled layer that helps to manage them). The first development pipeline is where data scientists and research teams work to develop the model. In the second pipeline, ML Engineers and developers work to take their model at the final stage of development and productionalize it into something that can be used to make real-world predictions. 

What Does an LLM Lifecycle Look Like?

Although LLMs are trying to carve out their own phrase within MLOps (LLMOps), it’s important to remember that they’re still machine learning systems. This means that even if they use different tools or phrases, they still follow most of the same lifecycle and best practices. 

I’ve broken down the LLM lifecycle into three phases, seen in the drawing below. In the following sections, I will give a high-level overview of each phase and how it relates back to the traditional ML lifecycle. 

Foundation Models 

On the left, we can see the most famous aspect of these new LLM systems, the foundation models. Foundation models are large pre-trained language models capable of understanding and generating natural language text. This ability to perform general language tasks well enables these models to serve as the starting point for various NLP tasks, such as language translation, text summarization, and question-answering. 

Foundation models are the most significant shift away from the traditional ML lifecycle. API access to foundation models has made it easier for teams to leverage NLP in their operations, regardless of industry or size. They can implement just the API as their production model—skipping most of the ML lifecycle workflow—if they are just looking for general language task capabilities.

This is a standard route we see organizations go down as a first iteration of putting an LLM into production. However, this is no longer remarkable, as any company can create an OpenAI account. Instead, users seek customized experiences based on the specific use case for the model they interact with. To do this, ML teams need to use many of their existing techniques for ML model development to fine-tune and improve their model based on its specific use case. 

Note: Many foundation models have a closed-source nature due to the lack of transparency and accessibility of the code and data used to train them. Verifying the model’s accuracy and biases is difficult without access to this code and data. This can lead to unintended consequences and perpetuate existing biases in the data. 


The development phase is where ML practitioners build and improve upon these use case–specific ML models. As we can see from the diagrams above, it exists in both the traditional and LLM lifecycle. However, one key difference is that there are currently fewer development steps for LLMs. For example, teams do not need to select and test different model architectures. While development for LLM models will undoubtedly continue to advance, the workflow is currently streamlined into three main steps: 

Defining the Use Case

While not explicitly listed in the diagram above, the first step to building any worthwhile ML model for production, LLM or not, is to define and understand your use case. Teams will need to spend time with business and product stakeholders to understand the purpose of the model they are putting into production. 

Data Curation & Model Fine-Tuning

Data science teams must curate and clean use case–specific data to build out use case–specific LLMs. This data will be used to train/fine-tune the foundation model’s language understanding to their task requirements. 

Cleaning and curating data is something that data science teams are used to, as it is a part of their traditional ML lifecycle. However, one benefit of using LLM models over traditional ML techniques is that they already have a solid understanding of language due to all the data they were trained on originally. By using data only to fine-tune on top of an existing large model, teams are not required to curate and clean as much data. 

Qualitative Validation with Prompts

Similar to all traditional ML models, these models must be tested and validated before they can interact with the world in production. In traditional ML model lifecycles, this is done with the help of well-established historical benchmarks and metrics. As we will cover in a future post, this is different with LLMs. 

Instead, teams must understand the use case enough to create realistic tests and adversarial prompts to evaluate the model. They then can use metrics built to quantify essential qualities of the text (such as tone or context) against example responses provided for each prompt. Additionally, teams may choose to qualitatively assess the model’s performance based on their understanding of human language and use cases.

Application Schema 

The final block in our LLM lifecycle diagram is the application schema. Referring to how your LLM is implemented and interacted with in production, it is similar to the productionalization process that ML Engineers and developers go through when traditionally implementing ML models. 

In practice, this is often built with prompt orchestration, where multiple prompts are chained together. Before getting into application schemas where multiple prompts interact, let’s look at what this would look like for an application with a single user prompt—like a chatbot.  

The process of constructing a single prompt is more complicated in production than just taking in the user’s input. One concept of LLMs that is different from traditional ML models is that beyond fine-tuning the model, prompt engineers are able to “fine-tune” users’ inputs to the application at the time of inference with a prompt template. 

Taken from LangChain Prompt Template article mentioned later in this post.

Prompt Template

Prompt engineers write a prompt template as an additional step of model “fine-tuning.” Traditionally, we think of fine-tuning as adding parametric knowledge to the model. This is the knowledge that a model learns at training time and is stored within the model weights (or parameters). 

On the other hand, prompt templates work as source knowledge added to the model. This is knowledge added during inference via inputs to the model. They provide additional information on top of users’ input requests during inference. This information typically includes additional background information, context, and rules for model responses. For a deeper dive into prompt templates, check out this article

Prompt Orchestration 

Prompt orchestration refers to the chaining of prompts together interactively. Some applications, like chatbots, may be simple enough application-wise to work with one core prompt. However, in production, many LLM use cases are actually much more complex. 

This article does a great job introducing LangChain—a popular prompt orchestrator—by using the metaphor of baking a cake. You can ask a chatbot to provide you with the ingredients for the cake, but that’s not actually very useful for your end goal of a finished cake. Instead, models need to be able to use that prompt in conjunction with other prompts and actions to get the end result they are looking for. 

We can also see below for a more “real world” example provided by the article for pulling information from generated SQL queries. For those interested, you can also see a real-world example utilizing Arthur written by one of our ML Engineers, who built a chatbot to interact with our documentation

Taken from LangChain Prompt Orchestration article mentioned above.

Note: In traditional ML models, there has been a push in the community to recognize data pipelines as the logical unit of ML work, and not the model itself. One thing that I find interesting about LLM models in production comparatively is that it seems there will be a push to view prompt orchestration as the logical unit of LLM work. 

This would make the chaining together of prompts into practice more similar to that of data engineering orchestrating ETL pipelines. Prompt engineering and application designers will need to spend more time and effort defining how these flowcharts will look and how outputs will be validated and monitored.  

User Feedback

Measuring the performance of generative models in production can be an even greater challenge than the already mentioned challenge of validating the model during development. Teams must navigate practical constraints, such as the infeasibility of scaling human labelers to generate common metrics. One approach that has proven successful for teams is tracking user feedback. 

This feedback provides valuable insights into how well their model is performing, enabling teams to continuously fine-tune and improve model performance. The specific techniques used will depend on the nature of the feedback and desired outcomes for the specific model. 

Conclusion & A Look Ahead

In conclusion, although LLMs are getting their own fancy suite of new tools and job titles, they are still rooted in the best practices and techniques that the ML community has been using for years. 

  • Foundation Models: serve as a jumpstart for teams to develop strong baseline NLP models
  • Development: still the same need to fine-tune and evaluate models for specific end-task, even if there are new techniques and job titles 
  • Application Schema: process for putting LLMs into production that still needs to be validated and monitored, even if it is reliant on new tools/prompts 
LLMs are most definitely finding their way into production systems near you—and fast. Hopefully, this was an informative first look into how they fit into the frameworks that teams already use for their traditional ML approaches. We’re busy at Arthur helping folks build with LLMs, so stay tuned for more related content soon.


How do LLMs compare with traditional ML models in terms of computational resources and environmental impact?

Large Language Models (LLMs) generally require significantly more computational resources for training compared to traditional machine learning models. This is due to their vast number of parameters, extensive datasets, and the complexity of the tasks they perform. For instance, models like GPT-3 have billions of parameters and require substantial amounts of data and processing power to train effectively. This increased computational requirement translates to higher energy consumption and, consequently, a larger environmental impact. The carbon footprint associated with training and operating LLMs is a concern, as it contributes to greenhouse gas emissions. In contrast, traditional ML models, which might focus on more constrained tasks and possess fewer parameters, typically require less computational power, leading to lower energy usage and a smaller environmental footprint. However, efforts are being made to make LLMs more energy-efficient and to reduce their environmental impact through methods such as more efficient hardware, better model design, and by fine-tuning pre-trained models instead of training new ones from scratch.

What are the specific challenges in ensuring the ethical use and bias mitigation in LLMs compared to traditional models?

The ethical use and bias mitigation in LLMs present unique challenges primarily due to the scale and nature of the data they are trained on. LLMs are trained on vast datasets sourced from the internet, which can contain biased, incorrect, or harmful information. These biases can be amplified and perpetuated by the models, leading to ethical concerns, especially when the models are used in sensitive or impactful contexts. The sheer volume of data makes it difficult to fully audit and clean, resulting in challenges in identifying and mitigating all sources of bias. Additionally, because LLMs generate human-like text, there is a risk of them producing harmful or misleading information that appears credible. This is less of a concern with traditional ML models, which typically perform more narrowly defined tasks and therefore have a more controlled and limited scope for bias introduction and propagation. Addressing these challenges requires ongoing efforts in data curation, model transparency, and the development of robust evaluation frameworks to detect and mitigate biases.

How can businesses measure the return on investment (ROI) when implementing LLMs into their operations?

Measuring the return on investment (ROI) for businesses implementing LLMs involves assessing both the tangible and intangible benefits against the costs associated with these systems. Tangible benefits can include increased efficiency, reduced operational costs, and enhanced customer satisfaction, which can be measured through metrics such as time saved, reduction in customer service expenses, and improvements in sales or customer retention rates. Intangible benefits might include improved brand reputation, customer experience, and innovation. Costs to consider include not only the direct expenses related to developing, training, and maintaining the LLMs but also indirect costs such as training staff to use the technology and potential risks associated with model biases or errors. Businesses can assess ROI by setting clear objectives before implementation, monitoring performance metrics closely, and adjusting the use of LLMs to align with strategic goals. Regularly reviewing these metrics against the initial investment and operational costs helps in understanding the value LLMs bring to the organization.