O

LangMart: Mistral: Pixtral 12B

Openrouter
Vision
33K
Context
$0.1000
Input /1M
$0.1000
Output /1M
N/A
Max Output

LangMart: Mistral: Pixtral 12B

Model Overview

Property Value
Model ID openrouter/mistralai/pixtral-12b
Name Mistral: Pixtral 12B
Provider mistralai
Released 2024-09-10

Description

The first multi-modal, text+image-to-text model from Mistral AI. Its weights were launched via torrent: https://x.com/mistralai/status/1833758285167722836.

Description

LangMart: Mistral: Pixtral 12B is a language model provided by mistralai. This model offers advanced capabilities for natural language processing tasks.

Provider

mistralai

Specifications

Spec Value
Context Window 32,768 tokens
Modalities text+image->text
Input Modalities text, image
Output Modalities text

Pricing

Type Price
Input $0.10 per 1M tokens
Output $0.10 per 1M tokens

Capabilities

  • Frequency penalty
  • Logit bias
  • Max tokens
  • Min p
  • Presence penalty
  • Repetition penalty
  • Response format
  • Seed
  • Stop
  • Structured outputs
  • Temperature
  • Tool choice
  • Tools
  • Top k
  • Top p

Detailed Analysis

Pixtral 12B (September 2024) is Mistral's first multimodal model, combining a 12B parameter language model with a 400M parameter vision encoder to process both text and images. This groundbreaking release extends Mistral's capabilities into vision-language tasks while maintaining open-source accessibility (Apache 2.0 license). Pixtral 12B achieves 52.5% on MMMU reasoning benchmark, surpassing many larger models by understanding both natural images and documents. The model's unique architecture accepts images at native resolution and aspect ratio, providing flexibility in token usage - users control the token budget for image processing. The 128K context window accommodates extensive text alongside multiple images, enabling document analysis, multi-image reasoning, and long-context vision tasks. Pixtral 12B excels at document understanding (invoices, forms, reports), chart and graph interpretation, visual reasoning, image captioning with context, and OCR with comprehension. Ideal for document processing workflows, visual QA systems, accessibility tools, and applications requiring combined vision-language intelligence at 12B efficiency.