Description
A 90-billion-parameter multimodal model excelling at visual reasoning and language tasks. The model handles image captioning, visual question answering, and advanced image-text comprehension through pre-training on multimodal datasets and human feedback fine-tuning. It is designed for industries requiring sophisticated real-time visual and textual analysis.
This model is part of the Llama 3.2 Vision collection, which represents Meta's multimodal expansion of the Llama family, enabling models to process and understand both text and images.
Technical Specifications
| Specification |
Value |
| Context Window |
128,000 tokens |
| Parameters |
90 billion |
| Context Length |
32,768 tokens |
| Max Completion Tokens |
16,384 tokens |
| Input Modalities |
Text, Image |
| Output Modalities |
Text |
| Instruction Type |
Llama 3 |
| Architecture |
Transformer-based multimodal |
Default Stop Tokens
<|eot_id|>
<|end_of_text|>
Pricing
Via DeepInfra (Primary Provider)
| Type |
Price |
| Input |
$0.35 per 1M tokens |
| Output |
$0.40 per 1M tokens |
| Image Processing |
$0.0005058 per image |
Cost Calculation Examples
| Usage Scenario |
Approximate Cost |
| 1M input tokens + 100K output tokens |
$0.39 |
| 100 images processed |
$0.05 |
| Average conversation (2K input, 500 output) |
$0.0009 |
Capabilities
Vision Tasks
- Image captioning and description
- Visual question answering (VQA)
- Image-text comprehension
- Multi-image analysis
- Chart and diagram interpretation
- Document understanding
- Scene analysis and description
Language Tasks
- Instruction following
- Conversational AI
- Text generation
- Reasoning and analysis
- Code understanding (with visual context)
Multimodal Reasoning
- Cross-modal inference
- Visual grounding
- Image-based reasoning
- Context-aware responses
Supported Parameters
| Parameter |
Description |
max_tokens |
Maximum number of tokens to generate |
temperature |
Controls randomness (0.0-2.0) |
top_p |
Nucleus sampling threshold |
stop |
Stop sequences to end generation |
frequency_penalty |
Penalize repeated tokens based on frequency |
presence_penalty |
Penalize tokens based on presence |
repetition_penalty |
General repetition penalty |
top_k |
Top-k sampling parameter |
seed |
Random seed for reproducibility |
min_p |
Minimum probability threshold |
response_format |
Format specification for response |
Limitations
- Cannot process video content (images only)
- May hallucinate details not present in images
- Performance varies with image quality and complexity
- Subject to biases present in training data
- Not suitable for real-time video analysis
API Usage Example
LangMart API Request
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.2-90b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
],
"max_tokens": 1024,
"temperature": 0.7
}'
LangMart Gateway Request
curl -X POST https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.2-90b-vision-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What objects are visible in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,..."
}
}
]
}
]
}'
Llama 3.2 Vision Family
| Model |
Parameters |
Description |
meta-llama/llama-3.2-11b-vision-instruct |
11B |
Lightweight vision model |
meta-llama/llama-3.2-90b-vision-instruct |
90B |
Large-scale vision model (this model) |
Llama 3.2 Text-Only Family
| Model |
Parameters |
Description |
meta-llama/llama-3.2-1b-instruct |
1B |
Ultra-lightweight text model |
meta-llama/llama-3.2-3b-instruct |
3B |
Compact text model |
| Model |
Description |
meta-llama/llama-3.3-70b-instruct |
Latest text-only Llama model |
meta-llama/llama-4-maverick |
Next-generation Llama model |
Model Identity
| Property |
Value |
| Model Name |
Meta: Llama 3.2 90B Vision Instruct |
| Model ID |
meta-llama/llama-3.2-90b-vision-instruct |
| Author |
Meta-Llama |
| Created |
September 25, 2024 |
| License |
Meta Llama 3.2 Community License |
| Provider |
Quantization |
Status |
| DeepInfra |
BF16 (Brain Float 16) |
Primary |
Quantization Details
- BF16 (Brain Float 16): Full precision inference with optimal balance between accuracy and performance
- No quantization loss compared to original model weights
Usage Policy
This model is subject to Meta's Acceptable Use Policy. Key points:
- Prohibited uses include generating illegal content, spam, or misleading information
- Commercial use is permitted under the Llama 3.2 Community License
- Model weights available via Hugging Face with license agreement
Model Weights Access
Access the model weights through Hugging Face:
https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct
Latency
- Larger model size results in higher latency compared to 11B variant
- Image processing adds additional latency
- Recommended for quality-critical applications over latency-sensitive ones
Throughput
- Batch processing recommended for high-volume image analysis
- Streaming supported for real-time applications
Best Practices
- Image Resolution: Provide clear, high-resolution images for best results
- Prompt Engineering: Be specific about what aspects of the image to analyze
- Context Management: Keep conversations within context window limits
- Temperature Setting: Use lower temperature (0.3-0.5) for factual descriptions, higher (0.7-1.0) for creative tasks
Version History
| Date |
Version |
Changes |
| September 25, 2024 |
Initial |
Model release with Llama 3.2 family |
Data sourced from LangMart. Last updated: December 2024