Ollama: Mistral 7B Instruct
Overview
| Property |
Value |
| Model ID |
ollama/mistral:7b-instruct |
| Provider |
Mistral AI |
| Display Name |
Mistral 7B Instruct |
| Parameters |
7 Billion (7B) |
| Type |
Instruction-Tuned Language Model |
| Release Date |
2023 |
Description
Mistral 7B Instruct is Mistral AI's powerful 7-billion parameter instruction-tuned language model, renowned for exceptional efficiency and speed. Despite having only 7B parameters, it achieves performance equivalent to much larger models (3x larger) through innovative attention mechanisms and optimized architecture. The model excels at fast inference, making it ideal for real-time applications, edge deployment, and latency-sensitive tasks while maintaining excellent quality.
Mistral 7B's compact size and fast inference speed make it a favorite for production systems requiring quick response times without sacrificing quality.
Specifications
- Parameters: 7 Billion (7B)
- Context Window: 32,000 tokens (32K)
- Quantization Formats: Q4, Q5, Q8, FP16
- Architecture: Transformer with Grouped Query Attention (GQA)
- Attention Mechanism: Sliding Window Attention (SWA)
- Training Tokens: 2+ Trillion tokens
- Memory Required (Float16): ~15GB VRAM
- Memory Required (Quantized Q4): ~4GB VRAM
- Inference Speed: 93.3 tokens/second on single GPU
- Latency to First Token: ~0.27 seconds
Use Cases
- Real-Time Applications: Chatbots, customer service bots
- Fast Inference Required: Live translation, concurrent requests
- Edge Deployment: Local servers, mobile devices
- Code Assistant: IDE plugins, code completion
- Data Processing: Log analysis, fast summarization
- API Services: High-throughput inference services
- Structured Output: JSON generation, data extraction
Limitations
- Smaller context window (32K vs 128K for Llama 3.1)
- May be less detailed than larger models
- Weaker at very complex reasoning
- Limited multilingual support compared to Llama 3.1
- Not as strong at creative writing as larger models
mistral:7b - Base model (not instruction-tuned)
mistral:medium - API-only larger variant
mistral:large - Larger version with extended context
mixtral:8x7b-instruct - Mixture of Experts variant
Key Capabilities
- Ultra-fast inference (fastest among 7B models)
- Instruction following
- Reasoning and problem-solving
- Code generation
- Multi-turn conversation
- Summarization
- Text analysis
- Knowledge retrieval
- JSON/structured output generation
Installation
Quick Start
# Pull and run the model directly
ollama pull mistral:7b-instruct
ollama run mistral:7b-instruct
With Specific Quantization
# Q4 Quantization (4GB RAM) - fastest, recommended
ollama pull mistral:7b-instruct:q4
# Q5 Quantization (6GB RAM)
ollama pull mistral:7b-instruct:q5
# Q8 Quantization (8GB RAM)
ollama pull mistral:7b-instruct:q8
# Full Precision (Float16 - 15GB RAM)
ollama pull mistral:7b-instruct:fp16
Usage
Basic Chat
ollama run mistral:7b-instruct "Write a Python function for sorting"
Interactive Conversation
ollama run mistral:7b-instruct
# Type your message and press Enter
# Type /quit to exit
API Usage (Ollama server on port 11434)
curl -X POST https://api.langmart.ai/v1/generate \
-d '{
"model": "mistral:7b-instruct",
"prompt": "Explain REST APIs",
"stream": false
}' | jq
Streaming Response
curl -X POST https://api.langmart.ai/v1/generate \
-d '{
"model": "mistral:7b-instruct",
"prompt": "Write a haiku about AI",
"stream": true
}'
OpenAI Compatible API
curl https://api.langmart.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral:7b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
Minimum Requirements
- CPU: 4+ cores, 2.5GHz+
- RAM: 8GB (12GB+ recommended)
- VRAM: 4GB minimum for quantized versions
- Disk: 4-6GB free space
Recommended Setup
- CPU: 8+ cores, 3.0GHz+
- RAM: 16GB or more
- GPU: NVIDIA (6GB+ VRAM), AMD, or Apple Silicon
- Quantization: Q4 or Q5
- Throughput: 93.3 output tokens/second (single GPU)
- First Token Latency: 0.27 seconds
- CPU Only: Not recommended - very slow
| Quantization |
VRAM Required |
Quality |
Speed |
| Context Window |
32,000 tokens |
|
|
| Q4 (GGML) |
4 GB |
Good |
Very Fast |
| Q5 (GGML) |
5-6 GB |
Very Good |
Fast |
| Q8 (GGML) |
8 GB |
Excellent |
Medium |
| FP16 |
15 GB |
Excellent |
Slower |
Efficiency Features
Grouped Query Attention (GQA)
- Reduces memory requirements for KV cache
- Speeds up inference compared to standard MQA
- Maintains output quality
Sliding Window Attention (SWA)
- Fixed rolling buffer cache limited to 4K tokens
- Saves 50% of cache memory during inference
- Enables efficient handling of longer sequences
| Model |
Parameters |
Context |
Speed |
Memory |
Quality |
| Mistral 7B |
7B |
32K |
Fastest |
4GB |
Very Good |
| Llama 3.1 8B |
8B |
128K |
Fast |
4-16GB |
Excellent |
| Qwen2.5 7B |
7.6B |
128K |
Fast |
4-16GB |
Excellent |
| Llama 2 7B |
7B |
4K |
Medium |
4-16GB |
Good |
Advantages
- Exceptionally fast inference (best in class for 7B)
- Very memory-efficient (4GB minimum)
- Excellent quality-to-size ratio
- Low latency to first token
- Sliding window attention is innovative and efficient
- Great for production systems
- Excellent instruction following
Troubleshooting
Out of Memory Error
# Use Q4 quantization (smallest)
ollama pull mistral:7b-instruct:q4
# Limit GPU memory usage
OLLAMA_GPU_MEMORY_PERCENT=0.7 ollama serve
Poor Response Quality
# Try higher precision quantization
ollama pull mistral:7b-instruct:q5
# Adjust temperature
curl https://api.langmart.ai/v1/chat/completions \
-d '{"temperature": 0.5}' # Lower = more deterministic
Slow Inference
# Check GPU acceleration is enabled
ollama ps
# Use quantized version for faster inference
ollama run mistral:7b-instruct:q4
Optimization Tips
For Production
- Use Q4 quantization for best speed/quality balance
- Set GPU memory limits to prevent system slowdown
- Use streaming for better perceived latency
- Implement request batching for throughput
For Quality
- Use Q5 or Q8 quantization
- Lower temperature (0.3-0.5) for factual tasks
- Higher temperature (0.7-0.9) for creative tasks
For Speed
- Use Q4 quantization
- Enable streaming responses
- Batch requests if possible
- Run on GPU (not CPU)
Pricing
| Type |
Price |
| Input |
Free |
| Output |
Free |
Note: Ollama models run locally on your own hardware. No API costs apply - the model is free and open-source.
Resources & Documentation
Version History
- Mistral 7B Instruct v0.3: Latest version
- Mistral 7B Instruct v0.2: Previous version
- Mistral 7B Instruct v0.1: Original release