M

Meta: Llama 3.2 90B Vision Instruct

Meta
Vision
128K
Context
$0.3500
Input /1M
$0.4000
Output /1M
16K
Max Output

Meta: Llama 3.2 90B Vision Instruct

Description

A 90-billion-parameter multimodal model excelling at visual reasoning and language tasks. The model handles image captioning, visual question answering, and advanced image-text comprehension through pre-training on multimodal datasets and human feedback fine-tuning. It is designed for industries requiring sophisticated real-time visual and textual analysis.

This model is part of the Llama 3.2 Vision collection, which represents Meta's multimodal expansion of the Llama family, enabling models to process and understand both text and images.

Technical Specifications

Specification Value
Context Window 128,000 tokens
Parameters 90 billion
Context Length 32,768 tokens
Max Completion Tokens 16,384 tokens
Input Modalities Text, Image
Output Modalities Text
Instruction Type Llama 3
Architecture Transformer-based multimodal

Default Stop Tokens

  • <|eot_id|>
  • <|end_of_text|>

Pricing

Via DeepInfra (Primary Provider)

Type Price
Input $0.35 per 1M tokens
Output $0.40 per 1M tokens
Image Processing $0.0005058 per image

Cost Calculation Examples

Usage Scenario Approximate Cost
1M input tokens + 100K output tokens $0.39
100 images processed $0.05
Average conversation (2K input, 500 output) $0.0009

Capabilities

Vision Tasks

  • Image captioning and description
  • Visual question answering (VQA)
  • Image-text comprehension
  • Multi-image analysis
  • Chart and diagram interpretation
  • Document understanding
  • Scene analysis and description

Language Tasks

  • Instruction following
  • Conversational AI
  • Text generation
  • Reasoning and analysis
  • Code understanding (with visual context)

Multimodal Reasoning

  • Cross-modal inference
  • Visual grounding
  • Image-based reasoning
  • Context-aware responses

Supported Parameters

Parameter Description
max_tokens Maximum number of tokens to generate
temperature Controls randomness (0.0-2.0)
top_p Nucleus sampling threshold
stop Stop sequences to end generation
frequency_penalty Penalize repeated tokens based on frequency
presence_penalty Penalize tokens based on presence
repetition_penalty General repetition penalty
top_k Top-k sampling parameter
seed Random seed for reproducibility
min_p Minimum probability threshold
response_format Format specification for response

Limitations

  • Cannot process video content (images only)
  • May hallucinate details not present in images
  • Performance varies with image quality and complexity
  • Subject to biases present in training data
  • Not suitable for real-time video analysis

API Usage Example

LangMart API Request

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $LANGMART_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.2-90b-vision-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image in detail."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }'

LangMart Gateway Request

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.2-90b-vision-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What objects are visible in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,..."
            }
          }
        ]
      }
    ]
  }'

Llama 3.2 Vision Family

Model Parameters Description
meta-llama/llama-3.2-11b-vision-instruct 11B Lightweight vision model
meta-llama/llama-3.2-90b-vision-instruct 90B Large-scale vision model (this model)

Llama 3.2 Text-Only Family

Model Parameters Description
meta-llama/llama-3.2-1b-instruct 1B Ultra-lightweight text model
meta-llama/llama-3.2-3b-instruct 3B Compact text model

Other Meta Vision Models

Model Description
meta-llama/llama-3.3-70b-instruct Latest text-only Llama model
meta-llama/llama-4-maverick Next-generation Llama model

Model Identity

Property Value
Model Name Meta: Llama 3.2 90B Vision Instruct
Model ID meta-llama/llama-3.2-90b-vision-instruct
Author Meta-Llama
Created September 25, 2024
License Meta Llama 3.2 Community License

Provider Information

Provider Quantization Status
DeepInfra BF16 (Brain Float 16) Primary

Quantization Details

  • BF16 (Brain Float 16): Full precision inference with optimal balance between accuracy and performance
  • No quantization loss compared to original model weights

Usage Policy

This model is subject to Meta's Acceptable Use Policy. Key points:

  • Prohibited uses include generating illegal content, spam, or misleading information
  • Commercial use is permitted under the Llama 3.2 Community License
  • Model weights available via Hugging Face with license agreement

Model Weights Access

Access the model weights through Hugging Face:

https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct

Performance Considerations

Latency

  • Larger model size results in higher latency compared to 11B variant
  • Image processing adds additional latency
  • Recommended for quality-critical applications over latency-sensitive ones

Throughput

  • Batch processing recommended for high-volume image analysis
  • Streaming supported for real-time applications

Best Practices

  1. Image Resolution: Provide clear, high-resolution images for best results
  2. Prompt Engineering: Be specific about what aspects of the image to analyze
  3. Context Management: Keep conversations within context window limits
  4. Temperature Setting: Use lower temperature (0.3-0.5) for factual descriptions, higher (0.7-1.0) for creative tasks

Version History

Date Version Changes
September 25, 2024 Initial Model release with Llama 3.2 family

Data sourced from LangMart. Last updated: December 2024