LangMart: Mistral: Pixtral 12B
Model Overview
| Property | Value |
|---|---|
| Model ID | openrouter/mistralai/pixtral-12b |
| Name | Mistral: Pixtral 12B |
| Provider | mistralai |
| Released | 2024-09-10 |
Description
The first multi-modal, text+image-to-text model from Mistral AI. Its weights were launched via torrent: https://x.com/mistralai/status/1833758285167722836.
Description
LangMart: Mistral: Pixtral 12B is a language model provided by mistralai. This model offers advanced capabilities for natural language processing tasks.
Provider
mistralai
Specifications
| Spec | Value |
|---|---|
| Context Window | 32,768 tokens |
| Modalities | text+image->text |
| Input Modalities | text, image |
| Output Modalities | text |
Pricing
| Type | Price |
|---|---|
| Input | $0.10 per 1M tokens |
| Output | $0.10 per 1M tokens |
Capabilities
- Frequency penalty
- Logit bias
- Max tokens
- Min p
- Presence penalty
- Repetition penalty
- Response format
- Seed
- Stop
- Structured outputs
- Temperature
- Tool choice
- Tools
- Top k
- Top p
Detailed Analysis
Pixtral 12B (September 2024) is Mistral's first multimodal model, combining a 12B parameter language model with a 400M parameter vision encoder to process both text and images. This groundbreaking release extends Mistral's capabilities into vision-language tasks while maintaining open-source accessibility (Apache 2.0 license). Pixtral 12B achieves 52.5% on MMMU reasoning benchmark, surpassing many larger models by understanding both natural images and documents. The model's unique architecture accepts images at native resolution and aspect ratio, providing flexibility in token usage - users control the token budget for image processing. The 128K context window accommodates extensive text alongside multiple images, enabling document analysis, multi-image reasoning, and long-context vision tasks. Pixtral 12B excels at document understanding (invoices, forms, reports), chart and graph interpretation, visual reasoning, image captioning with context, and OCR with comprehension. Ideal for document processing workflows, visual QA systems, accessibility tools, and applications requiring combined vision-language intelligence at 12B efficiency.