Meta: Llama 3.2 90B Vision Instruct

Description

A 90-billion-parameter multimodal model excelling at visual reasoning and language tasks. The model handles image captioning, visual question answering, and advanced image-text comprehension through pre-training on multimodal datasets and human feedback fine-tuning. It is designed for industries requiring sophisticated real-time visual and textual analysis.

This model is part of the Llama 3.2 Vision collection, which represents Meta's multimodal expansion of the Llama family, enabling models to process and understand both text and images.

Technical Specifications

Specification	Value
Context Window	128,000 tokens
Parameters	90 billion
Context Length	32,768 tokens
Max Completion Tokens	16,384 tokens
Input Modalities	Text, Image
Output Modalities	Text
Instruction Type	Llama 3
Architecture	Transformer-based multimodal

Default Stop Tokens

<|eot_id|>
<|end_of_text|>

Pricing

Via DeepInfra (Primary Provider)

Type	Price
Input	$0.35 per 1M tokens
Output	$0.40 per 1M tokens
Image Processing	$0.0005058 per image

Cost Calculation Examples

Usage Scenario	Approximate Cost
1M input tokens + 100K output tokens	$0.39
100 images processed	$0.05
Average conversation (2K input, 500 output)	$0.0009

Capabilities

Vision Tasks

Image captioning and description
Visual question answering (VQA)
Image-text comprehension
Multi-image analysis
Chart and diagram interpretation
Document understanding
Scene analysis and description

Language Tasks

Instruction following
Conversational AI
Text generation
Reasoning and analysis
Code understanding (with visual context)

Multimodal Reasoning

Cross-modal inference
Visual grounding
Image-based reasoning
Context-aware responses

Supported Parameters

Parameter	Description
`max_tokens`	Maximum number of tokens to generate
`temperature`	Controls randomness (0.0-2.0)
`top_p`	Nucleus sampling threshold
`stop`	Stop sequences to end generation
`frequency_penalty`	Penalize repeated tokens based on frequency
`presence_penalty`	Penalize tokens based on presence
`repetition_penalty`	General repetition penalty
`top_k`	Top-k sampling parameter
`seed`	Random seed for reproducibility
`min_p`	Minimum probability threshold
`response_format`	Format specification for response

Limitations

Cannot process video content (images only)
May hallucinate details not present in images
Performance varies with image quality and complexity
Subject to biases present in training data
Not suitable for real-time video analysis

API Usage Example

LangMart API Request

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $LANGMART_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.2-90b-vision-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image in detail."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

LangMart Gateway Request

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.2-90b-vision-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What objects are visible in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,..."
            }
          }
        ]
      }
    ]
  }'

Llama 3.2 Vision Family

Model	Parameters	Description
`meta-llama/llama-3.2-11b-vision-instruct`	11B	Lightweight vision model
`meta-llama/llama-3.2-90b-vision-instruct`	90B	Large-scale vision model (this model)

Llama 3.2 Text-Only Family

Model	Parameters	Description
`meta-llama/llama-3.2-1b-instruct`	1B	Ultra-lightweight text model
`meta-llama/llama-3.2-3b-instruct`	3B	Compact text model

Other Meta Vision Models

Model	Description
`meta-llama/llama-3.3-70b-instruct`	Latest text-only Llama model
`meta-llama/llama-4-maverick`	Next-generation Llama model

Model Identity

Property	Value
Model Name	Meta: Llama 3.2 90B Vision Instruct
Model ID	meta-llama/llama-3.2-90b-vision-instruct
Author	Meta-Llama
Created	September 25, 2024
License	Meta Llama 3.2 Community License

Provider Information

Provider	Quantization	Status
DeepInfra	BF16 (Brain Float 16)	Primary

Quantization Details

BF16 (Brain Float 16): Full precision inference with optimal balance between accuracy and performance
No quantization loss compared to original model weights

Usage Policy

This model is subject to Meta's Acceptable Use Policy. Key points:

Prohibited uses include generating illegal content, spam, or misleading information
Commercial use is permitted under the Llama 3.2 Community License
Model weights available via Hugging Face with license agreement

Model Weights Access

Access the model weights through Hugging Face:

https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct

Performance Considerations

Latency

Larger model size results in higher latency compared to 11B variant
Image processing adds additional latency
Recommended for quality-critical applications over latency-sensitive ones

Throughput

Batch processing recommended for high-volume image analysis
Streaming supported for real-time applications

Best Practices

Image Resolution: Provide clear, high-resolution images for best results
Prompt Engineering: Be specific about what aspects of the image to analyze
Context Management: Keep conversations within context window limits
Temperature Setting: Use lower temperature (0.3-0.5) for factual descriptions, higher (0.7-1.0) for creative tasks

Version History

Date	Version	Changes
September 25, 2024	Initial	Model release with Llama 3.2 family

Data sourced from LangMart. Last updated: December 2024

Meta: Llama 3.2 90B Vision Instruct

Meta: Llama 3.2 90B Vision Instruct

Description

Technical Specifications

Default Stop Tokens

Pricing

Via DeepInfra (Primary Provider)

Cost Calculation Examples

Capabilities

Vision Tasks

Language Tasks

Multimodal Reasoning

Supported Parameters

Limitations

API Usage Example

LangMart API Request

LangMart Gateway Request

Related Models

Llama 3.2 Vision Family

Llama 3.2 Text-Only Family

Other Meta Vision Models

Model Identity

Provider Information

Quantization Details

Usage Policy

Model Weights Access

Performance Considerations

Latency

Throughput

Best Practices

Version History