LangMart: Mistral: Pixtral 12B

Model Overview

Property	Value
Model ID	`openrouter/mistralai/pixtral-12b`
Name	Mistral: Pixtral 12B
Provider	mistralai
Released	2024-09-10

Description

The first multi-modal, text+image-to-text model from Mistral AI. Its weights were launched via torrent: https://x.com/mistralai/status/1833758285167722836.

Description

LangMart: Mistral: Pixtral 12B is a language model provided by mistralai. This model offers advanced capabilities for natural language processing tasks.

Provider

mistralai

Specifications

Spec	Value
Context Window	32,768 tokens
Modalities	text+image->text
Input Modalities	text, image
Output Modalities	text

Pricing

Type	Price
Input	$0.10 per 1M tokens
Output	$0.10 per 1M tokens

Capabilities

Frequency penalty
Logit bias
Max tokens
Min p
Presence penalty
Repetition penalty
Response format
Seed
Stop
Structured outputs
Temperature
Tool choice
Tools
Top k
Top p

Detailed Analysis

Pixtral 12B (September 2024) is Mistral's first multimodal model, combining a 12B parameter language model with a 400M parameter vision encoder to process both text and images. This groundbreaking release extends Mistral's capabilities into vision-language tasks while maintaining open-source accessibility (Apache 2.0 license). Pixtral 12B achieves 52.5% on MMMU reasoning benchmark, surpassing many larger models by understanding both natural images and documents. The model's unique architecture accepts images at native resolution and aspect ratio, providing flexibility in token usage - users control the token budget for image processing. The 128K context window accommodates extensive text alongside multiple images, enabling document analysis, multi-image reasoning, and long-context vision tasks. Pixtral 12B excels at document understanding (invoices, forms, reports), chart and graph interpretation, visual reasoning, image captioning with context, and OCR with comprehension. Ideal for document processing workflows, visual QA systems, accessibility tools, and applications requiring combined vision-language intelligence at 12B efficiency.