Best practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock

Multimodal fine-tuning represents a powerful approach for customizing foundation models (FMs) to excel at specific tasks that involve both visual and textual information. Although base multimodal models offer impressive general capabilities, they often fall short when faced with specialized visual tasks, domain-specific content, or particular output formatting requirements. Fine-tuning addresses these limitations by adapting models to your specific data and use cases, dramatically improving performance on tasks that mater to your business. Our experiments show that fine-tuned Meta Llama 3.2 models can achieve up to 74% improvements in accuracy scores compared to their base versions with prompt optimization on specialized visual understanding tasks. Amazon Bedrock now offers fine-tuning capabilities for Meta Llama 3.2 multimodal models, so you can adapt these sophisticated models to your unique use case.

In this post, we share comprehensive best practices and scientific insights for fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock. Our recommendations are based on extensive experiments using public benchmark datasets across various vision-language tasks, including visual question answering, image captioning, and chart interpretation and understanding. By following these guidelines, you can fine-tune smaller, more cost-effective models to achieve performance that rivals or even surpasses much larger models—potentially reducing both inference costs and latency, while maintaining high accuracy for your specific use case.

Recommended use cases for fine-tuning

Meta Llama 3.2 multimodal fine-tuning excels in scenarios where the model needs to understand visual information and generate appropriate textual responses. Based on our experimental findings, the following use cases demonstrate substantial performance improvements through fine-tuning:

Visual question answering (VQA) – Customization enables the model to accurately answer questions about images.
Chart and graph interpretation – Fine-tuning allows models to comprehend complex visual data representations and answer questions about them.
Image captioning – Fine-tuning helps models generate more accurate and descriptive captions for images.
Document understanding – Fine-tuning is particularly effective for extracting structured information from document images. This includes tasks like form field extraction, table data retrieval, and identifying key elements in invoices, receipts, or technical diagrams. When working with documents, note that Meta Llama 3.2 processes documents as images (such as PNG format), not as native PDFs or other document formats. For multi-page documents, each page should be converted to a separate image and processed individually.
Structured output generation – Fine-tuning can teach models to output information in consistent JSON formats or other structured representations based on visual inputs, making integration with downstream systems more reliable.

One notable advantage of multimodal fine-tuning is its effectiveness with mixed datasets that contain both text-only and image and text examples. This versatility allows organizations to improve performance across a range of input types with a single fine-tuned model.

Prerequisites

To use this feature, make sure that you have satisfied the following requirements:

An active AWS account.
Meta Llama 3.2 models enabled in your Amazon Bedrock account. You can confirm that the models are enabled on the Model access page of the Amazon Bedrock console.
As of writing this post, Meta Llama 3.2 model customization is available in the US West (Oregon) AWS Region. Refer to Supported models and Regions for fine-tuning and continued pre-training for updates on Regional availability and quotas.
The required training dataset (and optional validation dataset) prepared and stored in Amazon Simple Storage Service (Amazon S3).

To create a model customization job using Amazon Bedrock, you need to create an AWS Identity and Access Management (IAM) role with the following permissions (for more details, see Create a service role for model customization):

A trust relationship, which allows Amazon Bedrock to assume the role
Permissions to access training and validation data in Amazon S3
Permissions to write output data to Amazon S3
Optionally, permissions to decrypt an AWS Key Management Service (AWS KMS) key if you have encrypted resources with a KMS key

The following code is the trust relationship, which allows Amazon Bedrock to assume the IAM role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": <account-id>
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:bedrock:<region>:account-id:model-customization-job/*" 
                }
            }
        }
    ] 
}

Key multimodal datasets and experiment setup

To develop our best practices, we conducted extensive experiments using three representative multimodal datasets:

LlaVA-Instruct-Mix-VSFT – This comprehensive dataset contains diverse visual question-answering pairs specifically formatted for vision-language supervised fine-tuning. The dataset includes a wide variety of natural images paired with detailed instructions and high-quality responses.
ChartQA – This specialized dataset focuses on question answering about charts and graphs. It requires sophisticated visual reasoning to interpret data visualizations and answer numerical and analytical questions about the presented information.
Cut-VQAv2 – This is a carefully curated subset of the VQA dataset, containing diverse image-question-answer triplets designed to test various aspects of visual understanding and reasoning.

Our experimental approach involved systematic testing with different sample sizes (ranging between 100–10,000 samples) from each dataset to understand how performance scales with data quantity. We fine-tuned both Meta Llama 3.2 11B and Meta Llama 3.2 90B models, using Amazon Bedrock Model Customization, to compare the impact of model size on performance gains. The models were evaluated using the SQuAD F1 score metric, which measures the word-level overlap between generated responses and reference answers.

Best practices for data preparation

The quality and structure of your training data fundamentally determine the success of fine-tuning. Our experiments revealed several critical insights for preparing effective multimodal datasets:

Data structure – You should use a single image per example rather than multiple images. Our research shows this approach consistently yields superior performance in model learning. With one image per example, the model forms clearer associations between specific visual inputs and corresponding textual outputs, leading to more accurate predictions across various tasks. Although we recommend single-image training examples for optimal results, you can include multiple images per training record based on your use case. Refer to Model requirements for training and validation datasets for detailed data preparation requirements.
Start small, scale as needed – Larger datasets generally produce better results, but initial gains are often substantial even with minimal data. Our experiments demonstrate that even small datasets (approximately 100 samples) yield significant performance improvements over base models. For specialized use cases, we recommend starting with approximately 100 high-quality examples, which can deliver meaningful improvements, then scaling up as resources allow to further enhance performance. This approach makes multimodal fine-tuning accessible even with limited data collection resources.
Annotation quality – This emerges as another critical factor, with the precision and consistency of text associated with images dramatically affecting fine-tuning results. Investing time in creating accurate, detailed annotations that reflect the desired output format yields substantial dividends in model performance. For complex reasoning tasks like chart interpretation, comprehensive annotations should capture the entire reasoning process rather than merely providing the final answer, enabling the model to learn sophisticated analytical pathways.
- Quality annotation (ChartQA as an example):
  - Question: Which mode of transport has CO2 emissions value 191.6 g?
  - Good annotation: Medium car (petrol)
  - Bad annotation: Medium car (diesel)
- Quality annotation (Cut-VQAv2 as an example):
  - Question: What does the sign in the window read?
  - Good annotation: ATM inside
  - Bad annotation: a clock
- Quality annotation (LlaVA as an example):
  - Question:
    - User: What type of building is featured in the image?
    - Assistant: The image features a historic church with two tall towers, which is a large and beautiful building.
    - User: Where is the church located in the city?
    - Assistant: The historic church is located at an intersection in the city.
    - User: Is there a street sign in the image? If so, can you describe it?
    - Assistant: Yes, there is a small street sign in the image, with the large and beautiful church in the background.
    - User: Are there any cars or other vehicles visible in the image?
    - Assistant: Yes, there are cars on the road near the old, large building in the city.
    - User: Is there any celebration or event happening in the area?
  - Good annotation:
    - Assistant: It appears that people are celebrating Cinco de Mayo in this area.
  - Bad annotation:
    - Assistant: People gather annually to celebrate National Pizza Day by launching tacos into orbit from the church rooftops.
Validation data – This provides additional performance insights during fine-tuning. We recommend allocating 10–20% of the dataset for validation purposes. Amazon Bedrock customization outputs validation loss metrics throughput the training process, allowing you to assess model convergence and potential overfitting without conducting extensive inference benchmarks. These validation metrics serve as early indicators of how your fine-tuned model performs on unseen data, providing additional performance insights during fine-tuning.
Formatting consistency – Consistency throughout your dataset further enhances learning efficiency. Standardizing the structure of training examples, particularly how images are referenced within the text, helps the model develop stable patterns for interpreting the relationship between visual and textual elements. This consistency enables more reliable learning across diverse examples and facilitates better generalization to new inputs during inference. Importantly, make sure that the data you plan to use for inference follows the same format and structure as your training data; significant differences between training and testing inputs can reduce the effectiveness of the fine-tuned model.

Configuring fine-tuning parameters

When fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock, you can configure the following key parameters to optimize performance for your specific use case:

Epochs – The number of complete passes through your training dataset significantly impacts model performance. Our findings suggest:
- For smaller datasets (fewer than 500 examples): Consider using more epochs (7–10) to allow the model sufficient learning opportunities with limited data. With the ChartQA dataset at 100 samples, increasing from 3 to 8 epochs improved F1 scores by approximately 5%.
- For medium datasets (500–5,000 examples): The default setting of 5 epochs typically works well, balancing effective learning with training efficiency.
- For larger datasets (over 5,000 examples): You might achieve good results with fewer epochs (3–4), because the model sees sufficient examples to learn patterns without overfitting.
Learning rate – This parameter controls how quickly the model adapts to your training data, with significant implications for performance:
- For smaller datasets: Lower learning rates (5e-6 to 1e-5) can help prevent overfitting by making more conservative parameter updates.
- For larger datasets: Slightly higher learning rates (1e-5 to 5e-5) can achieve faster convergence without sacrificing quality.
- If uncertain: Start with a learning rate of 1e-5 (the default), which performed robustly across most of our experimental conditions.
Behind-the-scenes optimizations – Through extensive experimentation, we’ve optimized implementations of Meta Llama 3.2 multimodal fine-tuning in Amazon Bedrock for better efficiency and performance. These include batch processing strategies, LoRA configuration settings, and prompt masking techniques that improved fine-tuned model performance by up to 5% compared to open-source fine-tuning recipe performance. These optimizations are automatically applied, allowing you to focus on data quality and the configurable parameters while benefiting from our research-backed tuning strategies.

Model size selection and performance comparison

Choosing between Meta Llama 3.2 11B and Meta Llama 3.2 90B for fine-tuning presents an important decision that balances performance against cost and latency considerations. Our experiments reveal that fine-tuning dramatically enhances performance regardless of model size. Looking at ChartQA as an example, the 11B base model improved from 64.1 with prompt optimization to 69.5 F1 score with fine-tuning, a 8.4% increase, whereas the 90B model improved from 64.0 to 71.9 F1 score (12.3% increase). For Cut-VQAv2, the 11B model improved from 42.17 to 73.2 F1 score (74% increase) and the 90B model improved from 67.4 to 76.5 (13.5% increase). These substantial gains highlight the transformative impact of multimodal fine-tuning even before considering model size differences.

The following visualization demonstrates how these fine-tuned models perform across different datasets and training data volumes.

The visualization demonstrates that the 90B model (orange bars) consistently outperforms the 11B model (blue bars) across all three datasets and training sizes. This advantage is most pronounced in complex visual reasoning tasks such as ChartQA, where the 90B model achieves 71.9 F1 score compared to 69.5 for the 11B model at 10,000 samples. Both models show improved performance as training data increases, with the most dramatic gains observed in the LLaVA dataset, where the 11B model improves from 76.2 to 82.4 F1 score and 90B model improves from 76.6 to 83.1 F1 score, when scaling from 100 to 10,000 samples.

An interesting efficiency pattern emerges when comparing across sample sizes: in several cases, the 90B model with fewer training samples outperforms the 11B model with significantly more data. For instance, in the Cut-VQAv2 dataset, the 90B model trained on just 100 samples (72.9 F1 score) exceeds the performance of the 11B model trained on 1,000 samples (68.6 F1 score).

For optimal results, we recommend selecting the 90B model for applications demanding maximum accuracy, particularly with complex visual reasoning tasks or limited training data. The 11B model remains an excellent choice for balanced applications where resource efficiency is important, because it still delivers substantial improvements over base models while requiring fewer computational resources.

Conclusion

Fine-tuning Meta Llama 3.2 multimodal models on Amazon Bedrock offers organizations a powerful way to create customized AI solutions that understand both visual and textual information. Our experiments demonstrate that following best practices—using high-quality data with consistent formatting, selecting appropriate parameters, and validating results—can yield dramatic performance improvements across various vision-language tasks. Even with modest datasets, fine-tuned models can achieve remarkable enhancements over base models, making this technology accessible to organizations of all sizes.

Ready to start fine-tuning your own multimodal models? Explore our comprehensive code samples and implementation examples in our GitHub repository. Happy fine-tuning!

About the authors

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Sovik Kumar Nath is an AI/ML and Generative AI senior solution architect with AWS. He has extensive experience designing end-to-end machine learning and business analytics solutions in finance, operations, marketing, healthcare, supply chain management, and IoT. He has double masters degrees from the University of South Florida, University of Fribourg, Switzerland, and a bachelors degree from the Indian Institute of Technology, Kharagpur. Outside of work, Sovik enjoys traveling, taking ferry rides, and watching movies.

Karel Mundnich is a Sr. Applied Scientist in AWS Agentic AI. He has previously worked in AWS Lex and AWS Bedrock, where he worked in speech recognition, speech LLMs, and LLM fine-tuning. He holds a PhD in Electrical Engineering from the University of Southern California. In his free time, he enjoys skiing, hiking, and cycling.

Marcelo Aberle is a Sr. Research Engineer at AWS Bedrock. In recent years, he has been working at the intersection of science and engineering to enable new AWS service launches. This includes various LLM projects across Titan, Bedrock, and other AWS organizations. Outside of work, he keeps himself busy staying up-to-date on the latest GenAI startups in his adopted home city of San Francisco, California.

Jiayu Li is an Applied Scientist at AWS Bedrock, where he contributes to the development and scaling of generative AI applications using foundation models. He holds a Ph.D. and a Master’s degree in computer science from Syracuse University. Outside of work, Jiayu enjoys reading and cooking.

Fang Liu is a principal machine learning engineer at Amazon Web Services, where he has extensive experience in building AI/ML products using cutting-edge technologies. He has worked on notable projects such as Amazon Transcribe and Amazon Bedrock. Fang Liu holds a master’s degree in computer science from Tsinghua University.

Jennifer Zhu is a Senior Applied Scientist at AWS Bedrock, where she helps building and scaling generative AI applications with foundation models. Jennifer holds a PhD degree from Cornell University, and a master degree from University of San Francisco. Outside of work, she enjoys reading books and watching tennis games.