Build an automated generative AI solution evaluation pipeline with Amazon Nova

Large language models (LLMs) have become integral to numerous applications across industries, ranging from enhanced customer interactions to automated business processes. Deploying these models in real-world scenarios presents significant challenges, particularly in ensuring accuracy, fairness, relevance, and mitigating hallucinations. Thorough evaluation of the performance and outputs of these models is therefore critical to maintaining trust and safety.

Evaluation plays a central role in the generative AI application lifecycle, much like in traditional machine learning. Robust evaluation methodologies enable informed decision-making regarding the choice of models and prompts. However, evaluating LLMs is a complex and resource-intensive process given the free-form text output of LLMs. Methods such as human evaluation provide valuable insights but are costly and difficult to scale. Consequently, there is a demand for automated evaluation frameworks that are highly scalable and can be integrated into application development, much like unit and integration tests in software development.

In this post, to address the aforementioned challenges, we introduce an automated evaluation framework that is deployable on AWS. The solution can integrate multiple LLMs, use customized evaluation metrics, and enable businesses to continuously monitor model performance. We also provide LLM-as-a-judge evaluation metrics using the newly released Amazon Nova models. These models enable scalable evaluations due to their advanced capabilities and low latency. Additionally, we provide a user-friendly interface to enhance ease of use.

In the following sections, we discuss various approaches to evaluate LLMs. We then present a typical evaluation workflow, followed by our AWS-based solution that facilitates this process.

Evaluation methods

Prior to implementing evaluation processes for generative AI solutions, it’s crucial to establish clear metrics and criteria for assessment and gather an evaluation dataset.

The evaluation dataset should be representative of the actual real-world use case. It should consist of diverse samples and ideally contain ground truth values generated by experts. The size of the dataset will depend on the exact application and the cost of acquiring data; however, a dataset that spans relevant and diverse use cases should be a minimum. Developing an evaluation dataset can itself be an iterative task that is progressively enhanced by adding new samples and enriching the dataset with samples where the model performance is lacking. After the evaluation dataset is acquired, evaluation criteria can then be defined.

The evaluation criteria can be broadly divided into three main areas:

Latency-based metrics – These include measurements such as response generation time or time to first token. The importance of each metric might vary depending on the specific application.
Cost – This refers to the expense associated with response generation.
Performance – Performance-based metrics are highly case-dependent. They might include measurements of accuracy, factual consistency of responses, or the ability to generate structured responses.

Generally, there is an inverse relationship between latency, cost, and performance. Depending on the use case, one factor might be more critical than the others. Having metrics for these categories across different models can help you make data-driven decisions to determine the optimum choice for your specific use case.

Although measuring latency and cost can be relatively straightforward, assessing performance requires a deep understanding of the use case and knowing what is crucial for success. Depending on the application, you might be interested in evaluating the factual accuracy of the model’s output (particularly if the output is based on specific facts or reference documents), or you might want to assess whether the model’s responses are consistently polite and helpful, or both.

To support these diverse scenarios, we have incorporated several evaluation metrics in our solution:

FMEval – Foundation Model Evaluation (FMEval) library provided by AWS offers purpose-built evaluation models to provide metrics like toxicity in LLM output, accuracy, and semantic similarity between generated and reference text. This library can be used to evaluate LLMs across several tasks such as open-ended generation, text summarization, question answering, and classification.
Ragas – Ragas is an open source framework that provides metrics for evaluation of Retrieval Augmented Generation (RAG) systems (systems that generate answers based on a provided context). Ragas can be used to evaluate the performance of an information retriever (the component that retrieves relevant information from a database) using metrics like context precision and recall. Ragas also provides metrics to evaluate the LLM generation from the provided context using metrics like answer faithfulness to the provided context and answer relevance to the original question.
LLMeter – LLMeter is a simple solution for latency and throughput testing of LLMs, such as LLMs provided through Amazon Bedrock and OpenAI. This can be helpful in comparing models on metrics for latency-critical workloads.
LLM-as-a-judge metrics – Several challenges arise in defining performance metrics for free form text generated by LLMs – for example, the same information might be expressed in a different way. It’s also difficult to clearly define metrics for measuring characteristics like politeness. To tackle such evaluations, LLM-as-a-judge metrics have become popular. LLM-as-a-judge evaluations use a judge LLM to score the output of an LLM based on certain predefined criteria. We use the Amazon Nova model as the judge due to its advanced accuracy and performance.

Evaluation workflow

Now that we know what metrics we care about, how do we go about evaluating our solution? A typical generative AI application development (proof of concept) process can be abstracted as follows:

Builders use a few test examples and try out different prompts to see the performance and get a rough idea of the prompt template and model they want to start with (online evaluation).
Builders test the first prompt template version with a selected LLM against a test dataset with ground truth for a list of evaluation metrics to check the performance (offline evaluation). Based on the evaluation results, they might need to modify the prompt template, fine-tune the model, or implement RAG to add additional context to improve performance.
Builders implement the change and evaluate the updated solution against the dataset to validate improvements on the solution. Then they repeat the previous steps until the performance of the developed solution meets the business requirements.

The two key stages in the evaluation process are:

Online evaluation – This involves manually evaluating prompts based on a few examples for qualitative checks
Offline evaluation – This involves automated quantitative evaluation on an evaluation dataset

This process can add significant operational complications and effort from the builder team and operations team. To achieve this workflow, you need the following:

A side-by-side comparison tool for various LLMs
A prompt management service that can be used to save and version control prompts
A batch inference service that can invoke your selected LLM on a large number of examples
A batch evaluation service that can be used to evaluate the LLM response generated in the previous step

In the next section, we describe how we can create this workflow on AWS.

Solution overview

In this section, we present an automated generative AI evaluation solution that can be used to simplify the evaluation process. The architecture diagram of the solution is shown in the following figure.

This solution provides both online (real-time comparison) and offline (batch evaluation) evaluation options that fulfill different needs during the generative AI solution development lifecycle. Each component in this evaluation infrastructure can be developed using existing open source tools or AWS native services.

The architecture of the automated LLM evaluation pipeline focuses on modularity, flexibility, and scalability. The design philosophy makes sure that different components can be reused or adapted for other generative AI projects. The following is an overview of each component and its role in the solution:

UI – The UI provides a straightforward way to interact with the evaluation framework. Users can compare different LLMs with a side-by-side comparison. The UI provides latency, model outputs, and cost for each input query (online evaluation). The UI also helps you store and manage your different prompt templates backed by the Amazon Bedrock prompt management feature. These prompts can be referenced later for batch generation or production use. You can also launch batch generation and evaluation jobs through the UI. The UI service can be run locally in a Docker container or deployed to AWS Fargate.
Prompt management – The evaluation solution includes a key component for prompt management. Backed by Amazon Bedrock prompt management, you can save and retrieve your prompts using the UI.
LLM invocation pipeline – Using AWS Step Functions, this workflow automates the process of generating outputs from the LLM for a test dataset. It retrieves inputs from Amazon Simple Storage Service (Amazon S3), processes them, and stores the responses back to Amazon S3. This workflow supports batch processing, making it suitable for large-scale evaluations.
LLM evaluation pipeline – This workflow, also managed by Step Functions, evaluates the outputs generated by the LLM. At the time of writing, the solution supports metrics provided by the FMEval library, Ragas library, and custom LLM-as-a-judge metrics. It handles various evaluation methods, including direct metrics computation and LLM-guided evaluation. The results are stored in Amazon S3, ready for analysis.
Eval factory – A core service for conducting evaluations, the eval factory supports multiple evaluation techniques, including those that use other LLMs for reference-free scoring. It provides consistency in evaluation results by standardizing outputs into a single metric per evaluation. It can be difficult to find a one-size-fits-all solution when it comes to evaluation, so we provide you the flexibility to use your own script for evaluation. We also provide pre-built scripts and pipelines for some common tasks including classification, summarization, translation, and RAG. Especially for RAG, we have integrated popular open source libraries like Ragas.
Postprocessing and results store – After the pipeline results are generated, postprocessing can concatenate the results and potentially display the results in a results store that can provide a graphical view of the results. This part also handles updates to the prompt management system because each prompt template and LLM combination will have recorded evaluation results to help you select the right model and prompt template for the use case. Visualization of the results can be done on the UI or even with an Amazon Athena table if the prompt management system uses Amazon S3 as the data storage. This part can be done by using an AWS Lambda function, which can be triggered by an event sent after the new data has been saved to the Amazon S3 location for the prompt management system.

The evaluation solution can significantly enhance team productivity throughout the development lifecycle by reducing manual intervention and increasing automated processes. As new LLMs emerge, builders can compare the current production LLM with new models to determine if upgrading would improve the system’s performance. This ongoing evaluation process makes sure that the generative AI solution remains optimal and up-to-date.

Prerequisites

For scripts to set up the solution, refer to the GitHub repository. After the backend and the frontend are up and running, you can start the evaluation process.

To start, open the UI in your browser. The UI provides the ability to do both online and offline evaluations.

Online evaluation

To iteratively refine prompts, you can follow these steps:

Choose the options menu (three lines) on the top left side of the page to set the AWS Region.
After you choose the Region, the model lists will be prefilled with the available Amazon Bedrock models in that Region.
You can choose two models for side-by-side comparison.
You can select a prompt already stored in Amazon Bedrock prompt management from the dropdown menu. If selected, this will automatically fill the prompts.
You can also create a new prompt by entering the prompt in the text box. You can select generation configurations (temperature, top P, and so on) on the Generation Configuration The prompt template can also use dynamic variables by entering variables in {{}} (for example, for additional context, add a variable like {{context}}). Then define the value of these variables on the Context tab.
Choose Enter to start generation.
This will invoke the two models and present the output in the text boxes below each model. Additionally, you will also be provided with the latency and cost for each model.
To save the prompt to Amazon Bedrock, choose Save.

Offline generation and evaluation

After you have made the model and prompt choice, you can run batch generation and evaluation over a larger dataset.

To run batch generation, choose the model from the dropdown list.
You can provide an Amazon Bedrock knowledge base ID if additional context is required for generation.
You can also provide a prompt template ID. This prompt will be used for generation.
Upload a dataset file. This file will be uploaded to the S3 bucket set in the sidebar. This file should be a pipe (|) separated CSV file. For more details on expected data file format, see the project’s GitHub README file.
Choose Start Generation to start the job. This will trigger a Step Functions workflow that you can track by choosing the link in the pop-up.

Invoking batch generation triggers a Step Functions workflow, which is shown in the following figure. The logic follows these steps:

GetPrompts – This step retrieves a CSV file containing prompts from an S3 bucket. The contents of this file become the Step Functions workflow’s payload.
convert_to_json – This step parses the CSV output and converts it into a JSON format. This transformation enables the step function to use the Map state to process the invoke_llm flow concurrently.
Map step – This is an iterative step that processes the JSON payload by invoking the invoke_llm Lambda function concurrently for each item in the payload. A concurrency limit is set, with a default value of 3. You can adjust this limit based on the capacity of your backend LLM service. Within each Map iteration, the invoke_llm Lambda function calls the backend LLM service to generate a response for a single question and its associated context.
InvokeSummary – This step combines the output from each iteration of the Map step. It generates a JSON Lines result file containing the outputs, which is then stored in an S3 bucket for evaluation purposes.

When the batch generation is complete, you can trigger a batch evaluation pipeline with the selected metrics from the predefined metric list. You can also specify the location of an S3 file that contains already generated LLM outputs to perform batch evaluation.

Invoking batch evaluation triggers an Evaluate-LLM Step Functions workflow, which is shown in the following figure. The Evaluate-LLM Step Functions workflow is designed to comprehensively assess LLM performance using multiple evaluation frameworks:

LLMeter evaluation – Uses the AWS Labs LLMeter framework and focuses on endpoint performance metrics and benchmarking.
Ragas framework evaluation – Uses Ragas framework evaluation to measure four critical quality metrics:
- Context precision – A metric that evaluates whether the ground truth relevant items present in the contexts (retrieved chunks from vector database) are ranked higher or not. Its value ranges between 0–1, with higher values indicating better performance. The RAG system usually retrieves more than 1 chunks for a given query, and the chunks are ranked in order. A lower score is assigned when the high-ranked chunks contain more irrelevant information, which indicate bad information retrieval capability.
- Context recall – A metric that measures the extent to which the retrieved context aligns with the ground truth. Its value ranges between 0–1, with higher values indicating better performance. The ground truth can contain several short and definitive claims. For example, the ground truth “Canberra is the capital city of Australia, and the city is located at the northern end of the Australian Capital Territory” has two claims: “Canberra is the capital city of Australia” and “Canberra city is located at the northern end of the Australian Capital Territory.” Each claim in the ground truth is analyzed to determine whether it can be attributed to the retrieved context or not. A higher value is assigned when more claims in the ground truth are attributable to the retrieved context.
- Faithfulness – A metric that measures the factual consistency of the generated answer against the given context. Its value ranges between 0–1, with higher values indicating better performance. The answer can also contain several claims. A lower score is assigned to answers that contain a smaller number of claims that can be inferred from the given context.
- Answer relevancy – A metric that focuses on assessing how pertinent the generated answer is to the given prompt. It is scaled to (0, 1) range, and the higher the better. A lower score is assigned to answers that are incomplete or contain redundant information, and higher scores indicate better relevancy.
LLM-as-a-judge evaluation – Uses LLM capabilities to compare and score outputs against expected answers, which provides qualitative assessment of response accuracy. The prompts used for the LLM-as-a-judge are for demonstration purposes; to serve your specific use case, provide your own evaluation prompts to make sure the LLM-as-a-judge meets the correct evaluation requirements.
FM evaluation: Uses the AWS open source FMEval library and analyzes key metrics, including toxicity measurement.

The architecture implements these evaluations as nested Step Functions workflows that execute concurrently, enabling efficient and comprehensive model assessment. This design also makes it straightforward to add new frameworks to the evaluation workflow.

Clean up

To delete local deployment for the frontend, run run.sh delete_local. If you need to delete the cloud deployment, run run.sh delete_cloud. For the backend, you can delete the AWS CloudFormation stack, llm-evaluation-stack. For resources that you can’t delete automatically, manually delete them on the AWS Management Console.

Conclusion

In this post, we explored the importance of evaluating LLMs in the context of generative AI applications, highlighting the challenges posed by issues like hallucinations and biases. We introduced a comprehensive solution using AWS services to automate the evaluation process, allowing for continuous monitoring and assessment of LLM performance. By using tools like the FMeval Library, Ragas, LLMeter, and Step Functions, the solution provides flexibility and scalability, meeting the evolving needs of LLM consumers.

With this solution, businesses can confidently deploy LLMs, knowing they adhere to the necessary standards for accuracy, fairness, and relevance. We encourage you to explore the GitHub repository and start building your own automated LLM evaluation pipeline on AWS today. This setup can not only streamline your AI workflows but also make sure your models deliver the highest-quality outputs for your specific applications.

About the Authors

Deepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in artificial intelligence, he partners with clients to accelerate their GenAI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences, currently focusing on strength training.

Rafa XU, is a passionate Amazon Web Services (AWS) senior cloud architect focused on helping Public Sector customers design, build, and run infrastructure application and services on AWS. With more than 10 years of experience working across multiple information technology disciplines, Rafa has spent the last five years focused on AWS Cloud infrastructure, serverless applications, and automation. More recently, Rafa has expanded his skillset to include Generative AI, Machine Learning, Big data and Internet of Things (IoT).

Dr. Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Sam Edwards, is a Solutions Architect at AWS based in Sydney and focused on Media & Entertainment. He is a Subject Matter Expert for Amazon Bedrock and Amazon SageMaker AI services. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. In his spare time, he likes traveling and enjoying time with Family.

Dr. Kai Zhu, currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker and Bedrock Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.