Large language models (LLMs) have become integral to numerous applications across industries, ranging from enhanced customer interactions to automated business processes. Deploying these models in real-world scenarios presents significant challenges, particularly in ensuring accuracy, fairness, relevance, and mitigating hallucinations. Thorough evaluation of the performance and outputs of these models is therefore critical to maintaining trust and safety.

Evaluation plays a central role in the generative AI application lifecycle, much like in traditional machine learning. Robust evaluation methodologies enable informed decision-making regarding the choice of models and prompts. However, evaluating LLMs is a complex and resource-intensive process given the free-form text output of LLMs. Methods such as human evaluation provide valuable insights but are costly and difficult to scale. Consequently, there is a demand for automated evaluation frameworks that are highly scalable and can be integrated into application development, much like unit and integration tests in software development.

In this post, to address the aforementioned challenges, we introduce an automated evaluation framework that is deployable on AWS. The solution can integrate multiple LLMs, use customized evaluation metrics, and enable businesses to continuously monitor model performance. We also provide LLM-as-a-judge evaluation metrics using the newly released Amazon Nova models. These models enable scalable evaluations due to their advanced capabilities and low latency. Additionally, we provide a user-friendly interface to enhance ease of use.

In the following sections, we discuss various approaches to evaluate LLMs. We then present a typical evaluation workflow, followed by our AWS-based solution that facilitates this process.

Evaluation methods

Prior to implementing evaluation processes for generative AI solutions, it’s crucial to establish clear metrics and criteria for assessment and gather an evaluation dataset.

The evaluation dataset should be representative of the actual real-world use case. It should consist of diverse samples and ideally contain ground truth values generated by experts. The size of the dataset will depend on the exact application and the cost of acquiring data; however, a dataset that spans relevant and diverse use cases should be a minimum. Developing an evaluation dataset can itself be an iterative task that is progressively enhanced by adding new samples and enriching the dataset with samples where the model performance is lacking. After the evaluation dataset is acquired, evaluation criteria can then be defined.

The evaluation criteria can be broadly divided into three main areas:

Generally, there is an inverse relationship between latency, cost, and performance. Depending on the use case, one factor might be more critical than the others. Having metrics for these categories across different models can help you make data-driven decisions to determine the optimum choice for your specific use case.

Although measuring latency and cost can be relatively straightforward, assessing performance requires a deep understanding of the use case and knowing what is crucial for success. Depending on the application, you might be interested in evaluating the factual accuracy of the model’s output (particularly if the output is based on specific facts or reference documents), or you might want to assess whether the model’s responses are consistently polite and helpful, or both.

To support these diverse scenarios, we have incorporated several evaluation metrics in our solution:

Evaluation workflow

Now that we know what metrics we care about, how do we go about evaluating our solution? A typical generative AI application development (proof of concept) process can be abstracted as follows:

  1. Builders use a few test examples and try out different prompts to see the performance and get a rough idea of the prompt template and model they want to start with (online evaluation).
  2. Builders test the first prompt template version with a selected LLM against a test dataset with ground truth for a list of evaluation metrics to check the performance (offline evaluation). Based on the evaluation results, they might need to modify the prompt template, fine-tune the model, or implement RAG to add additional context to improve performance.
  3. Builders implement the change and evaluate the updated solution against the dataset to validate improvements on the solution. Then they repeat the previous steps until the performance of the developed solution meets the business requirements.

The two key stages in the evaluation process are:

This process can add significant operational complications and effort from the builder team and operations team. To achieve this workflow, you need the following:

In the next section, we describe how we can create this workflow on AWS.

Solution overview

In this section, we present an automated generative AI evaluation solution that can be used to simplify the evaluation process. The architecture diagram of the solution is shown in the following figure.

This solution provides both online (real-time comparison) and offline (batch evaluation) evaluation options that fulfill different needs during the generative AI solution development lifecycle. Each component in this evaluation infrastructure can be developed using existing open source tools or AWS native services.

The architecture of the automated LLM evaluation pipeline focuses on modularity, flexibility, and scalability. The design philosophy makes sure that different components can be reused or adapted for other generative AI projects. The following is an overview of each component and its role in the solution:

The evaluation solution can significantly enhance team productivity throughout the development lifecycle by reducing manual intervention and increasing automated processes. As new LLMs emerge, builders can compare the current production LLM with new models to determine if upgrading would improve the system’s performance. This ongoing evaluation process makes sure that the generative AI solution remains optimal and up-to-date.

Prerequisites

For scripts to set up the solution, refer to the GitHub repository. After the backend and the frontend are up and running, you can start the evaluation process.

To start, open the UI in your browser. The UI provides the ability to do both online and offline evaluations.

Online evaluation

To iteratively refine prompts, you can follow these steps:

  1. Choose the options menu (three lines) on the top left side of the page to set the AWS Region.
  2. After you choose the Region, the model lists will be prefilled with the available Amazon Bedrock models in that Region.
  3. You can choose two models for side-by-side comparison.
  4. You can select a prompt already stored in Amazon Bedrock prompt management from the dropdown menu. If selected, this will automatically fill the prompts.
  5. You can also create a new prompt by entering the prompt in the text box. You can select generation configurations (temperature, top P, and so on) on the Generation Configuration The prompt template can also use dynamic variables by entering variables in {{}} (for example, for additional context, add a variable like {{context}}). Then define the value of these variables on the Context tab.
  6. Choose Enter to start generation.
  7. This will invoke the two models and present the output in the text boxes below each model. Additionally, you will also be provided with the latency and cost for each model.
  8. To save the prompt to Amazon Bedrock, choose Save.

Offline generation and evaluation

After you have made the model and prompt choice, you can run batch generation and evaluation over a larger dataset.

  1. To run batch generation, choose the model from the dropdown list.
  2. You can provide an Amazon Bedrock knowledge base ID if additional context is required for generation.
  3. You can also provide a prompt template ID. This prompt will be used for generation.
  4. Upload a dataset file. This file will be uploaded to the S3 bucket set in the sidebar. This file should be a pipe (|) separated CSV file. For more details on expected data file format, see the project’s GitHub README file.
  5. Choose Start Generation to start the job. This will trigger a Step Functions workflow that you can track by choosing the link in the pop-up.

Select model for Batch Generation

Invoking batch generation triggers a Step Functions workflow, which is shown in the following figure. The logic follows these steps:

  1. GetPrompts – This step retrieves a CSV file containing prompts from an S3 bucket. The contents of this file become the Step Functions workflow’s payload.
  2. convert_to_json – This step parses the CSV output and converts it into a JSON format. This transformation enables the step function to use the Map state to process the invoke_llm flow concurrently.
  3. Map step – This is an iterative step that processes the JSON payload by invoking the invoke_llm Lambda function concurrently for each item in the payload. A concurrency limit is set, with a default value of 3. You can adjust this limit based on the capacity of your backend LLM service. Within each Map iteration, the invoke_llm Lambda function calls the backend LLM service to generate a response for a single question and its associated context.
  4. InvokeSummary – This step combines the output from each iteration of the Map step. It generates a JSON Lines result file containing the outputs, which is then stored in an S3 bucket for evaluation purposes.

When the batch generation is complete, you can trigger a batch evaluation pipeline with the selected metrics from the predefined metric list. You can also specify the location of an S3 file that contains already generated LLM outputs to perform batch evaluation.

Select model for Evaluation

Invoking batch evaluation triggers an Evaluate-LLM Step Functions workflow, which is shown in the following figure. The Evaluate-LLM Step Functions workflow is designed to comprehensively assess LLM performance using multiple evaluation frameworks:

The architecture implements these evaluations as nested Step Functions workflows that execute concurrently, enabling efficient and comprehensive model assessment. This design also makes it straightforward to add new frameworks to the evaluation workflow.

Step Function workflow for Evaluation

Clean up

To delete local deployment for the frontend, run run.sh delete_local. If you need to delete the cloud deployment, run run.sh delete_cloud. For the backend, you can delete the AWS CloudFormation stack, llm-evaluation-stack. For resources that you can’t delete automatically, manually delete them on the AWS Management Console.

Conclusion

In this post, we explored the importance of evaluating LLMs in the context of generative AI applications, highlighting the challenges posed by issues like hallucinations and biases. We introduced a comprehensive solution using AWS services to automate the evaluation process, allowing for continuous monitoring and assessment of LLM performance. By using tools like the FMeval Library, Ragas, LLMeter, and Step Functions, the solution provides flexibility and scalability, meeting the evolving needs of LLM consumers.

With this solution, businesses can confidently deploy LLMs, knowing they adhere to the necessary standards for accuracy, fairness, and relevance. We encourage you to explore the GitHub repository and start building your own automated LLM evaluation pipeline on AWS today. This setup can not only streamline your AI workflows but also make sure your models deliver the highest-quality outputs for your specific applications.


About the Authors

Deepak DalakotiDeepak Dalakoti, PhD, is a Deep Learning Architect at the Generative AI Innovation Centre in Sydney, Australia. With expertise in artificial intelligence, he partners with clients to accelerate their GenAI adoption through customized, innovative solutions. Outside the world of AI, he enjoys exploring new activities and experiences, currently focusing on strength training.

Rafa XURafa XU, is a passionate Amazon Web Services (AWS) senior cloud architect focused on helping Public Sector customers design, build, and run infrastructure application and services on AWS. With more than 10 years of experience working across multiple information technology disciplines, Rafa has spent the last five years focused on AWS Cloud infrastructure, serverless applications, and automation. More recently, Rafa has expanded his skillset to include Generative AI, Machine Learning, Big data and Internet of Things (IoT).

Melanie LiDr. Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Sam EdwardsSam Edwards, is a Solutions Architect at AWS based in Sydney and focused on Media & Entertainment. He is a Subject Matter Expert for Amazon Bedrock and Amazon SageMaker AI services. He is passionate about helping customers solve issues related to machine learning workflows and creating new solutions for them. In his spare time, he likes traveling and enjoying time with Family.

Kai ZhuDr. Kai Zhu, currently works as Cloud Support Engineer at AWS, helping customers with issues in AI/ML related services like SageMaker, Bedrock, etc. He is a SageMaker and Bedrock Subject Matter Expert. Experienced in data science and data engineering, he is interested in building generative AI powered projects.