Recently, we’ve been witnessing the rapid development and evolution of generative AI applications, with observability and evaluation emerging as critical aspects for developers, data scientists, and stakeholders. Observability refers to the ability to understand the internal state and behavior of a system by analyzing its outputs, logs, and metrics. Evaluation, on the other hand, involves assessing the quality and relevance of the generated outputs, enabling continual improvement.

Comprehensive observability and evaluation are essential for troubleshooting, identifying bottlenecks, optimizing applications, and providing relevant, high-quality responses. Observability empowers you to proactively monitor and analyze your generative AI applications, and evaluation helps you collect feedback, refine models, and enhance output quality.

In the context of Amazon Bedrock, observability and evaluation become even more crucial. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. As the complexity and scale of these applications grow, providing comprehensive observability and robust evaluation mechanisms are essential for maintaining high performance, quality, and user satisfaction.

We have built a custom observability solution that Amazon Bedrock users can quickly implement using just a few key building blocks and existing logs using FMs, Amazon Bedrock Knowledge BasesAmazon Bedrock Guardrails, and Amazon Bedrock Agents. This solution uses decorators in your application code to capture and log metadata such as input prompts, output results, run time, and custom metadata, offering enhanced security, ease of use, flexibility, and integration with native AWS services.

Notably, the solution supports comprehensive Retrieval Augmented Generation (RAG) evaluation so you can assess the quality and relevance of generated responses, identify areas for improvement, and refine the knowledge base or model accordingly.

In this post, we set up the custom solution for observability and evaluation of Amazon Bedrock applications. Through code examples and step-by-step guidance, we demonstrate how you can seamlessly integrate this solution into your Amazon Bedrock application, unlocking a new level of visibility, control, and continual improvement for your generative AI applications.

By the end of this post, you will:

  1. Understand the importance of observability and evaluation in generative AI applications
  2. Learn about the key features and benefits of this solution
  3. Gain hands-on experience in implementing the solution through step-by-step demonstrations
  4. Explore best practices for integrating observability and evaluation into your Amazon Bedrock workflows

Prerequisites

To implement the observability solution discussed in this post, you need the following prerequisites:

Solution overview

The observability solution for Amazon Bedrock empowers users to track and analyze interactions with FMs, knowledge bases, guardrails, and agents using decorators in their source code. Key highlights of the solution include:

Here’s a high-level overview of the observability solution architecture:

The following steps explain how the solution works:

  1. Application code using Amazon Bedrock is decorated with @bedrock_logs.watch to save the log
  2. Logged data streams through Amazon Data Firehose
  3. AWS Lambda transforms the data and applies dynamic partitioning based on call_type variable
  4. Amazon S3 stores the data securely
  5. Optional components for advanced analytics
  6. AWS Glue creates tables from S3 data
  7. Amazon Athena enables data querying
  8. Visualize logs and insights in your favorite dashboard tool

This architecture provides comprehensive logging, efficient data processing, and powerful analytics capabilities for your Amazon Bedrock applications.

Getting started

To help you get started with the observability solution, we have provided example notebooks in the attached GitHub repository, covering knowledge bases, evaluation, and agents for Amazon Bedrock. These notebooks demonstrate how to integrate the solution into your Amazon Bedrock application and showcase various use cases and features including feedback collected from users or quality assurance (QA) teams.

The repository contains well-documented notebooks that cover topics such as:

To get started with the example notebooks, follow these steps:

  1. Clone the GitHub repository
    git clone https://github.com/aws-samples/amazon-bedrock-samples.git
  2. Navigate to the observability solution directory
    cd amazon-bedrock-samples/evaluation-observe/Custom-Observability-Solution
  1. Follow the instructions in the README file to set up the required AWS resources and configure the solution
  2. Open the provided Jupyter notebooks and follow along with the examples and demonstrations

These notebooks provide a hands-on learning experience and serve as a starting point for integrating our solution into your generative AI applications. Feel free to explore, modify, and adapt the code examples to suit your specific requirements.

Key features

The solution offers a range of powerful features to streamline observability and evaluation for your generative AI applications on Amazon Bedrock:

This concise list highlights the key features you can use to gain insights, optimize performance, and drive continual improvement for your generative AI applications on Amazon Bedrock. For a detailed breakdown of the features and implementation specifics, refer to the comprehensive documentation in the GitHub repository.

Implementation and best practices

The solution is designed to be modular and flexible so you can customize it according to your specific requirements. Although the implementation is straightforward, following best practices is crucial for the scalability, security, and maintainability of your observability infrastructure.

Solution deployment

This solution includes an AWS CloudFormation template that streamlines the deployment of required AWS resources, providing consistent and repeatable deployments across environments. The CloudFormation template provisions resources such as Amazon Data Firehose delivery streams, AWS Lambda functions, Amazon S3 buckets, and AWS Glue crawlers and databases.

Decorator pattern

The solution uses the decorator pattern to integrate observability logging into your application functions seamlessly. The @bedrock_logs.watch decorator wraps your functions, automatically logging inputs, outputs, and metadata to Amazon Kinesis Firehose. Here’s an example of how to use the decorator:

# import observability
from observability import BedrockLogs

# instantiate BedrockLogs in Firehose mode
bedrock_logs = BedrockLogs(delivery_stream_name='your-firehose-delivery-stream', feedback_variables=True)

# decorate your function
@bedrock_logs.watch(capture_input=True, capture_output=True, call_type='<your-custom-dataset-name>')
def your_function(arg1, arg2):
    # Your function code here along with any custom metric of your choosing
    return output

Human-in-the-loop evaluation

The solution supports human-in-the-loop evaluation so you can incorporate human feedback into the performance evaluation of your generative AI application. You can involve end users, experts, or QA teams in the evaluation process, providing insights to enhance output quality and relevance. Here’s an example of how you can implement human-in-the-loop evaluation:

@bedrock_logs.watch(call_type='Retrieve-and-Generate-with-KB')
def main(input_arguments):
    # Your code to interact with Amazon Bedrock Knowledge Base or Agent
    return response, custom_metric, etc.

@bedrock_logs.watch(call_type='observation-feedback')
def observation_level_feedback(feedback):
    pass

# Invoke main function with user input and get run_id and observation_id
tuple_of_function_outputs, run_id, observation_id = main(input_arguments)

# Collect human feedback on model response in your application
user_feedback = 'thumbs-up'

observation_feedback_from_front_end = {
    'user_id': 'User-1',
    'f_run_id': run_id,
    'f_observation_id': observation_id,
    'actual_feedback': user_feedback
}

# Log the human-in-loop feedback using observation_level_feedback function
observation_level_feedback(observation_feedback_from_front_end)

By using the run_id and observation_id generated, you can associate human feedback with specific model responses or sessions. This feedback can then be analyzed and used to refine the knowledge base, fine-tune models, or identify areas for improvement.

Best practices

It’s recommended to follow these best practices:

By following these best practices and using the features of the solution, you can set up comprehensive observability and evaluation for your generative AI applications to gain valuable insights, identify areas for improvement, and enhance the overall user experience.

In the next post in this three-part series, we dive deeper into observability and evaluation for RAG and agent-based generative AI applications, providing in-depth insights and guidance.

Clean up

To avoid incurring costs and maintain a clean AWS account, you can remove the associated resources by deleting the AWS CloudFormation stack you created for this walkthrough. You can follow the steps provided in the Deleting a stack on the AWS CloudFormation console documentation to delete the resources created for this solution.

Conclusion and next steps

This comprehensive solution empowers you to seamlessly integrate comprehensive observability into your generative AI applications in Amazon Bedrock. Key benefits include streamlined integration, selective logging, custom metadata tracking, and comprehensive evaluation capabilities, including RAG evaluation. Use AWS services such as Athena to analyze observability data, drive continual improvement, and connect with your favorite dashboard tool to visualize the data.

This post focused is on Amazon Bedrock, but it can be extended to broader machine learning operations (MLOps) workflows or integrated with other AWS services such as AWS Lambda or Amazon SageMaker. We encourage you to explore this solution and integrate it into your workflows. Access the source code and documentation in our GitHub repository  and start your integration journey. Embrace the power of observability and unlock new heights for your generative AI applications.


About the authors

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focused on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Mani Khanuja is a Tech Lead – Generative AI Specialists, author of the book Applied Machine Learning and High Performance Computing on AWS, and a member of the Board of Directors for Women in Manufacturing Education Foundation Board. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.