Ollama: Mistral 7B Instruct

Overview

Property	Value
Model ID	ollama/mistral:7b-instruct
Provider	Mistral AI
Display Name	Mistral 7B Instruct
Parameters	7 Billion (7B)
Type	Instruction-Tuned Language Model
Release Date	2023

Description

Mistral 7B Instruct is Mistral AI's powerful 7-billion parameter instruction-tuned language model, renowned for exceptional efficiency and speed. Despite having only 7B parameters, it achieves performance equivalent to much larger models (3x larger) through innovative attention mechanisms and optimized architecture. The model excels at fast inference, making it ideal for real-time applications, edge deployment, and latency-sensitive tasks while maintaining excellent quality.

Mistral 7B's compact size and fast inference speed make it a favorite for production systems requiring quick response times without sacrificing quality.

Specifications

Parameters: 7 Billion (7B)
Context Window: 32,000 tokens (32K)
Quantization Formats: Q4, Q5, Q8, FP16
Architecture: Transformer with Grouped Query Attention (GQA)
Attention Mechanism: Sliding Window Attention (SWA)
Training Tokens: 2+ Trillion tokens
Memory Required (Float16): ~15GB VRAM
Memory Required (Quantized Q4): ~4GB VRAM
Inference Speed: 93.3 tokens/second on single GPU
Latency to First Token: ~0.27 seconds

Use Cases

Real-Time Applications: Chatbots, customer service bots
Fast Inference Required: Live translation, concurrent requests
Edge Deployment: Local servers, mobile devices
Code Assistant: IDE plugins, code completion
Data Processing: Log analysis, fast summarization
API Services: High-throughput inference services
Structured Output: JSON generation, data extraction

Limitations

Smaller context window (32K vs 128K for Llama 3.1)
May be less detailed than larger models
Weaker at very complex reasoning
Limited multilingual support compared to Llama 3.1
Not as strong at creative writing as larger models

mistral:7b - Base model (not instruction-tuned)
mistral:medium - API-only larger variant
mistral:large - Larger version with extended context
mixtral:8x7b-instruct - Mixture of Experts variant

Key Capabilities

Ultra-fast inference (fastest among 7B models)
Instruction following
Reasoning and problem-solving
Code generation
Multi-turn conversation
Summarization
Text analysis
Knowledge retrieval
JSON/structured output generation

Installation

Quick Start

# Pull and run the model directly
ollama pull mistral:7b-instruct
ollama run mistral:7b-instruct

With Specific Quantization

# Q4 Quantization (4GB RAM) - fastest, recommended
ollama pull mistral:7b-instruct:q4

# Q5 Quantization (6GB RAM)
ollama pull mistral:7b-instruct:q5

# Q8 Quantization (8GB RAM)
ollama pull mistral:7b-instruct:q8

# Full Precision (Float16 - 15GB RAM)
ollama pull mistral:7b-instruct:fp16

Usage

Basic Chat

ollama run mistral:7b-instruct "Write a Python function for sorting"

Interactive Conversation

ollama run mistral:7b-instruct
# Type your message and press Enter
# Type /quit to exit

API Usage (Ollama server on port 11434)

curl -X POST https://api.langmart.ai/v1/generate \
  -d '{
    "model": "mistral:7b-instruct",
    "prompt": "Explain REST APIs",
    "stream": false
  }' | jq

Streaming Response

curl -X POST https://api.langmart.ai/v1/generate \
  -d '{
    "model": "mistral:7b-instruct",
    "prompt": "Write a haiku about AI",
    "stream": true
  }'

OpenAI Compatible API

curl https://api.langmart.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral:7b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

Performance & Hardware Requirements

Minimum Requirements

CPU: 4+ cores, 2.5GHz+
RAM: 8GB (12GB+ recommended)
VRAM: 4GB minimum for quantized versions
Disk: 4-6GB free space

Recommended Setup

CPU: 8+ cores, 3.0GHz+
RAM: 16GB or more
GPU: NVIDIA (6GB+ VRAM), AMD, or Apple Silicon
Quantization: Q4 or Q5

Inference Performance

Throughput: 93.3 output tokens/second (single GPU)
First Token Latency: 0.27 seconds
CPU Only: Not recommended - very slow

Memory by Format

Quantization	VRAM Required	Quality	Speed
Context Window	32,000 tokens
Q4 (GGML)	4 GB	Good	Very Fast
Q5 (GGML)	5-6 GB	Very Good	Fast
Q8 (GGML)	8 GB	Excellent	Medium
FP16	15 GB	Excellent	Slower

Efficiency Features

Grouped Query Attention (GQA)

Reduces memory requirements for KV cache
Speeds up inference compared to standard MQA
Maintains output quality

Sliding Window Attention (SWA)

Fixed rolling buffer cache limited to 4K tokens
Saves 50% of cache memory during inference
Enables efficient handling of longer sequences

Performance Comparison

Model	Parameters	Context	Speed	Memory	Quality
Mistral 7B	7B	32K	Fastest	4GB	Very Good
Llama 3.1 8B	8B	128K	Fast	4-16GB	Excellent
Qwen2.5 7B	7.6B	128K	Fast	4-16GB	Excellent
Llama 2 7B	7B	4K	Medium	4-16GB	Good

Advantages

Exceptionally fast inference (best in class for 7B)
Very memory-efficient (4GB minimum)
Excellent quality-to-size ratio
Low latency to first token
Sliding window attention is innovative and efficient
Great for production systems
Excellent instruction following

Troubleshooting

Out of Memory Error

# Use Q4 quantization (smallest)
ollama pull mistral:7b-instruct:q4

# Limit GPU memory usage
OLLAMA_GPU_MEMORY_PERCENT=0.7 ollama serve

Poor Response Quality

# Try higher precision quantization
ollama pull mistral:7b-instruct:q5

# Adjust temperature
curl https://api.langmart.ai/v1/chat/completions \
  -d '{"temperature": 0.5}' # Lower = more deterministic

Slow Inference

# Check GPU acceleration is enabled
ollama ps

# Use quantized version for faster inference
ollama run mistral:7b-instruct:q4

Optimization Tips

For Production

Use Q4 quantization for best speed/quality balance
Set GPU memory limits to prevent system slowdown
Use streaming for better perceived latency
Implement request batching for throughput

For Quality

Use Q5 or Q8 quantization
Lower temperature (0.3-0.5) for factual tasks
Higher temperature (0.7-0.9) for creative tasks

For Speed

Use Q4 quantization
Enable streaming responses
Batch requests if possible
Run on GPU (not CPU)

Pricing

Type	Price
Input	Free
Output	Free

Note: Ollama models run locally on your own hardware. No API costs apply - the model is free and open-source.

Resources & Documentation

Ollama Library: https://ollama.com/library/mistral
Model Page: https://mistral.ai/news/announcing-mistral-7b
GitHub: https://github.com/ollama/ollama
Ollama Docs: https://docs.ollama.com
Mistral Docs: https://docs.mistral.ai

Version History

Mistral 7B Instruct v0.3: Latest version
Mistral 7B Instruct v0.2: Previous version
Mistral 7B Instruct v0.1: Original release

Ollama: Mistral 7B Instruct

Ollama: Mistral 7B Instruct

Overview

Description

Specifications

Use Cases

Limitations

Related Models

Key Capabilities

Installation

Quick Start

With Specific Quantization

Usage

Basic Chat

Interactive Conversation

API Usage (Ollama server on port 11434)

Streaming Response

OpenAI Compatible API

Performance & Hardware Requirements

Minimum Requirements

Recommended Setup

Inference Performance

Memory by Format

Efficiency Features

Grouped Query Attention (GQA)

Sliding Window Attention (SWA)

Performance Comparison

Advantages

Troubleshooting

Out of Memory Error

Poor Response Quality

Slow Inference

Optimization Tips

For Production

For Quality

For Speed

Pricing

Resources & Documentation

Version History