Yi Vision 34B Model Documentation
Note: This model (Yi-VL-34B) is not currently available on LangMart. It is available through Hugging Face, ModelScope, and WiseModel. The information below is compiled from official 01.AI documentation and Hugging Face model card.
Model Overview
| Property | Value |
|---|---|
| Model Name | Yi-VL-34B (Vision-Language) |
| Developer | 01.AI (Zero One Everything / Lingyiwanwu) |
| Model ID | 01-ai/Yi-VL-34B (Hugging Face), 01ai/Yi-VL-34B (ModelScope) |
| Release Date | January 2024 |
| Model Type | Open-Source Vision-Language Model |
| Parameters | 34 billion |
| Architecture | LLaVA-based multimodal model |
| Availability | Open-source (Apache 2.0 License) |
| License | Apache 2.0 |
Description
Yi-VL-34B is the world's first open-source 34 billion parameter vision-language model, combining advanced image understanding with multilingual text generation capabilities. It represents a significant advancement in open-source multimodal AI, enabling sophisticated interactions between visual and textual information.
The model integrates CLIP ViT-H/14 for image encoding with the Yi-34B-Chat language model, creating a powerful system capable of understanding images and generating contextually relevant responses in both English and Chinese. As of January 2024, it ranked first among all existing open-source models on key vision-language benchmarks.
Capabilities
Core Features
- ✅ Multi-round text-image conversations
- ✅ Visual question answering (VQA)
- ✅ Image content comprehension and recognition
- ✅ Optical character recognition (OCR) - text in images
- ✅ Information extraction from visual content
- ✅ Image summarization and analysis
- ✅ Bilingual responses (English & Chinese)
- ✅ Complex visual reasoning
Supported Tasks
- Document analysis and extraction
- Scene understanding and description
- Object detection and identification
- Text recognition and extraction from images
- Visual reasoning and inference
- Multi-step visual problem solving
Supported Parameters
Inference Configuration
| Parameter | Type | Range | Description |
|---|---|---|---|
temperature |
float | 0.0 - 2.0 | Controls response randomness |
top_p |
float | 0.0 - 1.0 | Nucleus sampling threshold |
top_k |
integer | 1 - 100 | Top-k sampling count |
max_tokens |
integer | 1 - 2048 | Maximum output length |
repetition_penalty |
float | 0.0 - 2.0 | Penalty for token repetition |
stream |
boolean | true/false | Enable streaming output |
Image Handling Parameters
| Parameter | Value | Description |
|---|---|---|
image_resolution |
448×448 | Fixed input resolution |
image_format |
JPG, PNG, BMP | Supported formats |
max_images_per_query |
1 | Single image limitation |
Use Cases
Recommended Applications
- Document digitization and data extraction
- Visual quality assurance and inspection
- Medical image analysis (with appropriate disclaimers)
- Product catalog enrichment
- Educational content analysis
- Accessibility features (image description)
- Content moderation and filtering
- Technical diagram understanding
- Scene understanding and navigation
- Research and academic analysis
Not Recommended For
- Real-time video processing
- Multi-image batch processing (without modification)
- Production systems requiring high availability without self-hosting
- Tasks requiring proprietary API guarantees
Related Models
By 01.AI
| Model | Type | Parameters | Context | Status |
|---|---|---|---|---|
| Yi-Lightning | LLM | - | 16K | Proprietary API |
| Yi-Large | LLM | - | 32K | Proprietary API |
| Yi-34B | Base LLM | 34B | 4K-200K | Open-source |
| Yi-9B | Base LLM | 9B | 4K-200K | Open-source |
| Yi-6B | Base LLM | 6B | 4K-200K | Open-source |
| Yi-Coder | Code LLM | - | 128K | Open-source |
| Yi-VL-34B | Vision-Language | 34B | Multi-round | Open-source |
| Yi-VL-6B | Vision-Language | 6B | Multi-round | Open-source |
Similar Vision-Language Models
| Model | Organization | Parameters | Open-source |
|---|---|---|---|
| LLaVA-34B | LLaMA team | 34B | Yes |
| Qwen-VL | Alibaba | 9.6B | Yes |
| Flamingo | DeepMind | 80B | No |
| GPT-4 Vision | OpenAI | - | No (API only) |
| Claude 3 Vision | Anthropic | - | No (API only) |
| Gemini Vision | - | No (API only) |
Model Specifications
Architecture & Capacity
| Specification | Value |
|---|---|
| Total Parameters | 34 billion |
| Vision Encoder | CLIP ViT-H/14 (Vision Transformer) |
| Projection Module | Two-layer MLP with layer normalization |
| Language Model | Yi-34B-Chat base |
| Input Image Resolution | 448×448 pixels |
| Bilingual Support | English and Chinese |
Input/Output Modalities
| Modality | Support |
|---|---|
| Image Input | Yes (single image per query) |
| Text Input | Yes |
| Text Output | Yes (multilingual: English & Chinese) |
| Multi-round Conversations | Yes (text-image conversations) |
Context & Processing
| Property | Value |
|---|---|
| Context Type | Multi-round visual question answering |
| Image Encoding | CLIP ViT-H/14 transformer |
| Maximum Image Resolution | 448×448 (fixed) |
| Image Input Limitation | Single image per query |
Performance Metrics
Benchmark Rankings (January 2024)
| Benchmark | Rank | Status |
|---|---|---|
| MMMU (English) | 1st among open-source | ✅ Best in class |
| CMMMU (Chinese) | 1st among open-source | ✅ Best in class |
Comparison with Alternatives
Yi-VL-34B stands out as:
- Largest open-source vision-language model (34B parameters)
- Bilingual capability (English & Chinese) from a single model
- Strongest performance on MMMU and CMMMU benchmarks among open-source models
- Lower inference cost compared to proprietary alternatives
Technical Architecture
Component Breakdown
1. Vision Encoder
- Model: CLIP ViT-H/14 (Vision Transformer)
- Purpose: Encodes images into feature vectors
- Resolution: Processes 448×448 pixel images
- Output: Visual embeddings aligned with text space
2. Projection Module
- Type: Two-layer MLP (Multi-Layer Perceptron)
- Layers: Dense layers with layer normalization
- Purpose: Aligns image feature space with language model feature space
- Key Innovation: Enables seamless integration between vision and language
3. Language Model
- Base Model: Yi-34B-Chat
- Capabilities: Text generation, understanding, and reasoning
- Languages: English and Chinese
- Purpose: Generates contextually relevant responses to visual and textual inputs
Training Methodology
Three-Stage Training Process:
| Stage | Resolution | Focus | Data Size | Purpose |
|---|---|---|---|---|
| Stage 1 | 224×224 | ViT-LLM alignment | 100M image-text pairs | Initial projection learning |
| Stage 2 | 448×448 | Fine visual detail | 25M image-text pairs | Enhanced visual discernment |
| Stage 3 | 448×448 | Multimodal chat | 1M conversational pairs | Chat proficiency & alignment |
Training Infrastructure:
- GPU Resources: 128 × NVIDIA A800 (80GB) GPUs
- Training Duration: ~10 days for full model
- Distributed Training: Large-scale parallel training
Data Composition:
- High-quality image-text pairs from diverse sources
- Multilingual training data (English and Chinese)
- Conversational data for chat fine-tuning
- Synthetic data generation for specific tasks
Hardware Requirements
For Inference
Recommended Setup:
- GPU VRAM: 80 GB total minimum
- GPU Configuration:
- 4 × RTX 4090 (24GB each), or
- 4 × NVIDIA A800 (80GB each), or
- 1 × NVIDIA A100 (80GB)
- CPU RAM: 64 GB minimum
- Storage: ~70 GB for model weights
Minimal Setup:
- Requires at least 80GB of GPU VRAM for full model loading
- Quantization or model parallelism may reduce requirements
For Training (Original)
- 128 × NVIDIA A800 (80GB) GPUs
- Multiple nodes with high-bandwidth interconnect
- Distributed training frameworks (DeepSpeed, PyTorch DDP)
API Providers & Access
Official Sources
| Provider | Type | Status | Access |
|---|---|---|---|
| Hugging Face | Model Hub | Available | https://huggingface.co/01-ai/Yi-VL-34B |
| ModelScope | Model Hub | Available | https://www.modelscope.cn/models/01ai/Yi-VL-34B |
| WiseModel | Model Hub | Available | https://wisemodel.cn/models/01.AI/Yi-VL-34B |
| LangMart | API Gateway | Not Available | - |
No Official API Endpoint
Unlike Yi-Lightning or other proprietary Yi models, Yi-VL-34B does not have an official API endpoint. Users must:
- Download and self-host the model, or
- Use integration services (Hugging Face Inference, Together AI, etc.)
Usage Examples
Local Deployment with Transformers
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
# Load model and processor
model_name = "01-ai/Yi-VL-34B"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load and process image
image = Image.open("path/to/image.jpg")
# Create messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is in this image?"}
]
}
]
# Prepare input
inputs = processor(messages, images=[image], return_tensors="pt").to("cuda")
# Generate response
output_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)
Using Hugging Face Inference API
from huggingface_hub import InferenceClient
from PIL import Image
import base64
import io
client = InferenceClient(api_key="your_hf_api_key")
# Load and encode image
image = Image.open("path/to/image.jpg")
img_buffer = io.BytesIO()
image.save(img_buffer, format="PNG")
img_base64 = base64.b64encode(img_buffer.getvalue()).decode()
# Create message with image
message = {
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}},
{"type": "text", "text": "Analyze this image in detail."}
]
}
# Call model
response = client.chat_completion(
model="01-ai/Yi-VL-34B",
messages=[message],
max_tokens=500
)
print(response.choices[0].message.content)
Docker Deployment
FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04
WORKDIR /app
RUN apt-get update && apt-get install -y \
python3.11 python3-pip git
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY model_server.py .
EXPOSE 8000
CMD ["python3", "model_server.py"]
REST API Server (Flask Example)
from flask import Flask, request, jsonify
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
import base64
import io
app = Flask(__name__)
# Load model once at startup
processor = AutoProcessor.from_pretrained("01-ai/Yi-VL-34B")
model = AutoModelForVision2Seq.from_pretrained(
"01-ai/Yi-VL-34B",
torch_dtype=torch.float16,
device_map="auto"
)
@app.route("/analyze", methods=["POST"])
def analyze_image():
data = request.json
image_b64 = data["image"]
question = data["question"]
# Decode base64 image
image_data = base64.b64decode(image_b64)
image = Image.open(io.BytesIO(image_data))
# Create messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": question}
]
}
]
# Process
inputs = processor(messages, images=[image], return_tensors="pt").to("cuda")
output_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output_ids[0], skip_special_tokens=True)
return jsonify({"response": response})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)
Known Limitations
Image Handling
| Limitation | Impact | Workaround |
|---|---|---|
| Single image per query | Cannot process multiple images simultaneously | Process images sequentially |
| Fixed 448×448 resolution | Images larger than this don't provide additional detail | Crop/resize appropriately |
| No 3D/video support | Cannot process video or 3D content | Extract frames for video |
Model Behavior
| Issue | Description | Mitigation |
|---|---|---|
| Hallucination | May generate non-existent content or miss objects | Use detailed prompts, validate output |
| Multi-object scenes | Can miss objects in complex scenes with many items | Ask for comprehensive scene descriptions |
| Small text | May struggle with very small text in images | Provide higher contrast images |
| Context limitations | Long conversations may lose context | Summarize previous context explicitly |
Deployment Options
Option 1: Self-Hosted (Recommended for Control)
Pros:
- Full control over inference
- No rate limits
- Privacy (data stays on your servers)
- Customization options
Cons:
- Requires significant GPU resources (80GB VRAM)
- Maintenance burden
- Infrastructure costs
Setup Time: 2-4 hours
Option 2: Third-Party Inference Services
Services:
- Hugging Face Inference Endpoints
- Together AI
- Replicate
- Modal
- Modal Petals
Pros:
- No infrastructure management
- Automatic scaling
- Easy API integration
Cons:
- Usage costs
- Latency (API calls)
- Privacy considerations
- Rate limitations
Option 3: Quantized Inference
Options:
- 4-bit quantization (bitsandbytes)
- 8-bit quantization (FP8)
- GGML format (for CPU inference)
Benefits:
- Reduced VRAM requirements (from 80GB to ~20GB)
- Faster inference
- Lower costs
Tradeoff:
- Slight accuracy reduction
Installation & Setup
Prerequisites
# Python 3.9+
python --version
# NVIDIA GPU with CUDA support
nvidia-smi
# At least 80GB GPU VRAM
# At least 64GB system RAM
# At least 100GB free storage
Quick Start Installation
# Create virtual environment
python -m venv yi-vl-env
source yi-vl-env/bin/activate # Linux/Mac
# or
yi-vl-env\Scripts\activate # Windows
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers pillow
# Download model (first run will cache it)
python -c "from transformers import AutoModel; AutoModel.from_pretrained('01-ai/Yi-VL-34B')"
Detailed Setup Guide
# Install transformers with vision support
pip install transformers>=4.36.0 pillow>=10.0.0
# For quantization (optional, to reduce VRAM)
pip install bitsandbytes
# For accelerated inference
pip install accelerate
# Verify installation
python -c "from transformers import AutoProcessor; print('Ready!')"
Performance & Benchmarks
Inference Speed
Typical Performance (RTX 4090 or A100):
- First token latency: 2-4 seconds
- Token generation rate: 15-25 tokens/second
- Full response time (200 tokens): 10-15 seconds
Note: Actual performance varies based on image complexity and hardware
Memory Usage
| Configuration | VRAM Required | System RAM |
|---|---|---|
| FP16 (full precision) | 80 GB | 64 GB |
| 8-bit quantization | 40 GB | 32 GB |
| 4-bit quantization | 20 GB | 16 GB |
Troubleshooting
Issue: "Out of Memory" Error
RuntimeError: CUDA out of memory
Solutions:
- Use 4-bit or 8-bit quantization
- Use model parallelism across multiple GPUs
- Use a service with larger GPU clusters
Issue: Slow Inference
Causes & Solutions:
- Running on CPU: Use GPU acceleration
- Full precision: Use quantization
- Large batch size: Reduce batch size to 1
- Competing processes: Check
nvidia-smi
Issue: Model Won't Download
# Set HF token for large models
huggingface-cli login
# Or set environment variable
export HF_TOKEN="your_token_here"
Sources & References
- Official Repository: https://github.com/01-ai/Yi
- Model Card (Hugging Face): https://huggingface.co/01-ai/Yi-VL-34B
- ModelScope Page: https://www.modelscope.cn/models/01ai/Yi-VL-34B
- WiseModel Page: https://wisemodel.cn/models/01.AI/Yi-VL-34B
- 01.AI Official Website: https://www.01.ai/
- Yi Paper (arXiv): https://arxiv.org/abs/2403.04652
Citation
@misc{ai2024yi,
title={Yi: Open Foundation Models by 01.AI},
author={01.AI and others},
year={2024},
eprint={2403.04652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
License
This model is released under the Apache 2.0 License, allowing:
- ✅ Commercial use
- ✅ Distribution
- ✅ Modification
- ✅ Private use
With the requirement to:
- Include license and copyright notice
- State changes made to the code
Last Updated: December 2024 Data Sources: Hugging Face Model Card, Official 01.AI Documentation, GitHub Repository Availability Status: Open-source, self-hosted deployment required