Yi Vision 34B Model Documentation

Note: This model (Yi-VL-34B) is not currently available on LangMart. It is available through Hugging Face, ModelScope, and WiseModel. The information below is compiled from official 01.AI documentation and Hugging Face model card.

Model Overview

Property	Value
Model Name	Yi-VL-34B (Vision-Language)
Developer	01.AI (Zero One Everything / Lingyiwanwu)
Model ID	`01-ai/Yi-VL-34B` (Hugging Face), `01ai/Yi-VL-34B` (ModelScope)
Release Date	January 2024
Model Type	Open-Source Vision-Language Model
Parameters	34 billion
Architecture	LLaVA-based multimodal model
Availability	Open-source (Apache 2.0 License)
License	Apache 2.0

Description

Yi-VL-34B is the world's first open-source 34 billion parameter vision-language model, combining advanced image understanding with multilingual text generation capabilities. It represents a significant advancement in open-source multimodal AI, enabling sophisticated interactions between visual and textual information.

The model integrates CLIP ViT-H/14 for image encoding with the Yi-34B-Chat language model, creating a powerful system capable of understanding images and generating contextually relevant responses in both English and Chinese. As of January 2024, it ranked first among all existing open-source models on key vision-language benchmarks.

Capabilities

Core Features

✅ Multi-round text-image conversations
✅ Visual question answering (VQA)
✅ Image content comprehension and recognition
✅ Optical character recognition (OCR) - text in images
✅ Information extraction from visual content
✅ Image summarization and analysis
✅ Bilingual responses (English & Chinese)
✅ Complex visual reasoning

Supported Tasks

Document analysis and extraction
Scene understanding and description
Object detection and identification
Text recognition and extraction from images
Visual reasoning and inference
Multi-step visual problem solving

Supported Parameters

Inference Configuration

Parameter	Type	Range	Description
`temperature`	float	0.0 - 2.0	Controls response randomness
`top_p`	float	0.0 - 1.0	Nucleus sampling threshold
`top_k`	integer	1 - 100	Top-k sampling count
`max_tokens`	integer	1 - 2048	Maximum output length
`repetition_penalty`	float	0.0 - 2.0	Penalty for token repetition
`stream`	boolean	true/false	Enable streaming output

Image Handling Parameters

Parameter	Value	Description
`image_resolution`	448×448	Fixed input resolution
`image_format`	JPG, PNG, BMP	Supported formats
`max_images_per_query`	1	Single image limitation

Use Cases

Recommended Applications

Document digitization and data extraction
Visual quality assurance and inspection
Medical image analysis (with appropriate disclaimers)
Product catalog enrichment
Educational content analysis
Accessibility features (image description)
Content moderation and filtering
Technical diagram understanding
Scene understanding and navigation
Research and academic analysis

Not Recommended For

Real-time video processing
Multi-image batch processing (without modification)
Production systems requiring high availability without self-hosting
Tasks requiring proprietary API guarantees

By 01.AI

Model	Type	Parameters	Context	Status
Yi-Lightning	LLM	-	16K	Proprietary API
Yi-Large	LLM	-	32K	Proprietary API
Yi-34B	Base LLM	34B	4K-200K	Open-source
Yi-9B	Base LLM	9B	4K-200K	Open-source
Yi-6B	Base LLM	6B	4K-200K	Open-source
Yi-Coder	Code LLM	-	128K	Open-source
Yi-VL-34B	Vision-Language	34B	Multi-round	Open-source
Yi-VL-6B	Vision-Language	6B	Multi-round	Open-source

Similar Vision-Language Models

Model	Organization	Parameters	Open-source
LLaVA-34B	LLaMA team	34B	Yes
Qwen-VL	Alibaba	9.6B	Yes
Flamingo	DeepMind	80B	No
GPT-4 Vision	OpenAI	-	No (API only)
Claude 3 Vision	Anthropic	-	No (API only)
Gemini Vision	Google	-	No (API only)

Model Specifications

Architecture & Capacity

Specification	Value
Total Parameters	34 billion
Vision Encoder	CLIP ViT-H/14 (Vision Transformer)
Projection Module	Two-layer MLP with layer normalization
Language Model	Yi-34B-Chat base
Input Image Resolution	448×448 pixels
Bilingual Support	English and Chinese

Input/Output Modalities

Modality	Support
Image Input	Yes (single image per query)
Text Input	Yes
Text Output	Yes (multilingual: English & Chinese)
Multi-round Conversations	Yes (text-image conversations)

Context & Processing

Property	Value
Context Type	Multi-round visual question answering
Image Encoding	CLIP ViT-H/14 transformer
Maximum Image Resolution	448×448 (fixed)
Image Input Limitation	Single image per query

Performance Metrics

Benchmark Rankings (January 2024)

Benchmark	Rank	Status
MMMU (English)	1st among open-source	✅ Best in class
CMMMU (Chinese)	1st among open-source	✅ Best in class

Comparison with Alternatives

Yi-VL-34B stands out as:

Largest open-source vision-language model (34B parameters)
Bilingual capability (English & Chinese) from a single model
Strongest performance on MMMU and CMMMU benchmarks among open-source models
Lower inference cost compared to proprietary alternatives

Technical Architecture

Component Breakdown

1. Vision Encoder

Model: CLIP ViT-H/14 (Vision Transformer)
Purpose: Encodes images into feature vectors
Resolution: Processes 448×448 pixel images
Output: Visual embeddings aligned with text space

2. Projection Module

Type: Two-layer MLP (Multi-Layer Perceptron)
Layers: Dense layers with layer normalization
Purpose: Aligns image feature space with language model feature space
Key Innovation: Enables seamless integration between vision and language

3. Language Model

Base Model: Yi-34B-Chat
Capabilities: Text generation, understanding, and reasoning
Languages: English and Chinese
Purpose: Generates contextually relevant responses to visual and textual inputs

Training Methodology

Three-Stage Training Process:

Stage	Resolution	Focus	Data Size	Purpose
Stage 1	224×224	ViT-LLM alignment	100M image-text pairs	Initial projection learning
Stage 2	448×448	Fine visual detail	25M image-text pairs	Enhanced visual discernment
Stage 3	448×448	Multimodal chat	1M conversational pairs	Chat proficiency & alignment

Training Infrastructure:

GPU Resources: 128 × NVIDIA A800 (80GB) GPUs
Training Duration: ~10 days for full model
Distributed Training: Large-scale parallel training

Data Composition:

High-quality image-text pairs from diverse sources
Multilingual training data (English and Chinese)
Conversational data for chat fine-tuning
Synthetic data generation for specific tasks

Hardware Requirements

For Inference

Recommended Setup:

GPU VRAM: 80 GB total minimum
GPU Configuration:
- 4 × RTX 4090 (24GB each), or
- 4 × NVIDIA A800 (80GB each), or
- 1 × NVIDIA A100 (80GB)
CPU RAM: 64 GB minimum
Storage: ~70 GB for model weights

Minimal Setup:

Requires at least 80GB of GPU VRAM for full model loading
Quantization or model parallelism may reduce requirements

For Training (Original)

128 × NVIDIA A800 (80GB) GPUs
Multiple nodes with high-bandwidth interconnect
Distributed training frameworks (DeepSpeed, PyTorch DDP)

API Providers & Access

Official Sources

Provider	Type	Status	Access
Hugging Face	Model Hub	Available	https://huggingface.co/01-ai/Yi-VL-34B
ModelScope	Model Hub	Available	https://www.modelscope.cn/models/01ai/Yi-VL-34B
WiseModel	Model Hub	Available	https://wisemodel.cn/models/01.AI/Yi-VL-34B
LangMart	API Gateway	Not Available	-

No Official API Endpoint

Unlike Yi-Lightning or other proprietary Yi models, Yi-VL-34B does not have an official API endpoint. Users must:

Download and self-host the model, or
Use integration services (Hugging Face Inference, Together AI, etc.)

Usage Examples

Local Deployment with Transformers

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

# Load model and processor
model_name = "01-ai/Yi-VL-34B"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load and process image
image = Image.open("path/to/image.jpg")

# Create messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is in this image?"}
        ]
    }
]

# Prepare input
inputs = processor(messages, images=[image], return_tensors="pt").to("cuda")

# Generate response
output_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)

Using Hugging Face Inference API

from huggingface_hub import InferenceClient
from PIL import Image
import base64
import io

client = InferenceClient(api_key="your_hf_api_key")

# Load and encode image
image = Image.open("path/to/image.jpg")
img_buffer = io.BytesIO()
image.save(img_buffer, format="PNG")
img_base64 = base64.b64encode(img_buffer.getvalue()).decode()

# Create message with image
message = {
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}},
        {"type": "text", "text": "Analyze this image in detail."}
    ]
}

# Call model
response = client.chat_completion(
    model="01-ai/Yi-VL-34B",
    messages=[message],
    max_tokens=500
)

print(response.choices[0].message.content)

Docker Deployment

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3.11 python3-pip git

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model_server.py .

EXPOSE 8000

CMD ["python3", "model_server.py"]

REST API Server (Flask Example)

from flask import Flask, request, jsonify
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
import base64
import io

app = Flask(__name__)

# Load model once at startup
processor = AutoProcessor.from_pretrained("01-ai/Yi-VL-34B")
model = AutoModelForVision2Seq.from_pretrained(
    "01-ai/Yi-VL-34B",
    torch_dtype=torch.float16,
    device_map="auto"
)

@app.route("/analyze", methods=["POST"])
def analyze_image():
    data = request.json
    image_b64 = data["image"]
    question = data["question"]

    # Decode base64 image
    image_data = base64.b64decode(image_b64)
    image = Image.open(io.BytesIO(image_data))

    # Create messages
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    # Process
    inputs = processor(messages, images=[image], return_tensors="pt").to("cuda")
    output_ids = model.generate(**inputs, max_new_tokens=200)
    response = processor.decode(output_ids[0], skip_special_tokens=True)

    return jsonify({"response": response})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Known Limitations

Image Handling

Limitation	Impact	Workaround
Single image per query	Cannot process multiple images simultaneously	Process images sequentially
Fixed 448×448 resolution	Images larger than this don't provide additional detail	Crop/resize appropriately
No 3D/video support	Cannot process video or 3D content	Extract frames for video

Model Behavior

Issue	Description	Mitigation
Hallucination	May generate non-existent content or miss objects	Use detailed prompts, validate output
Multi-object scenes	Can miss objects in complex scenes with many items	Ask for comprehensive scene descriptions
Small text	May struggle with very small text in images	Provide higher contrast images
Context limitations	Long conversations may lose context	Summarize previous context explicitly

Deployment Options

Option 1: Self-Hosted (Recommended for Control)

Pros:

Full control over inference
No rate limits
Privacy (data stays on your servers)
Customization options

Cons:

Requires significant GPU resources (80GB VRAM)
Maintenance burden
Infrastructure costs

Setup Time: 2-4 hours

Option 2: Third-Party Inference Services

Services:

Hugging Face Inference Endpoints
Together AI
Replicate
Modal
Modal Petals

Pros:

No infrastructure management
Automatic scaling
Easy API integration

Cons:

Usage costs
Latency (API calls)
Privacy considerations
Rate limitations

Option 3: Quantized Inference

Options:

4-bit quantization (bitsandbytes)
8-bit quantization (FP8)
GGML format (for CPU inference)

Benefits:

Reduced VRAM requirements (from 80GB to ~20GB)
Faster inference
Lower costs

Tradeoff:

Slight accuracy reduction

Installation & Setup

Prerequisites

# Python 3.9+
python --version

# NVIDIA GPU with CUDA support
nvidia-smi

# At least 80GB GPU VRAM
# At least 64GB system RAM
# At least 100GB free storage

Quick Start Installation

# Create virtual environment
python -m venv yi-vl-env
source yi-vl-env/bin/activate  # Linux/Mac
# or
yi-vl-env\Scripts\activate  # Windows

# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers pillow

# Download model (first run will cache it)
python -c "from transformers import AutoModel; AutoModel.from_pretrained('01-ai/Yi-VL-34B')"

Detailed Setup Guide

# Install transformers with vision support
pip install transformers>=4.36.0 pillow>=10.0.0

# For quantization (optional, to reduce VRAM)
pip install bitsandbytes

# For accelerated inference
pip install accelerate

# Verify installation
python -c "from transformers import AutoProcessor; print('Ready!')"

Performance & Benchmarks

Inference Speed

Typical Performance (RTX 4090 or A100):

First token latency: 2-4 seconds
Token generation rate: 15-25 tokens/second
Full response time (200 tokens): 10-15 seconds

Note: Actual performance varies based on image complexity and hardware

Memory Usage

Configuration	VRAM Required	System RAM
FP16 (full precision)	80 GB	64 GB
8-bit quantization	40 GB	32 GB
4-bit quantization	20 GB	16 GB

Troubleshooting

Issue: "Out of Memory" Error

RuntimeError: CUDA out of memory

Solutions:

Use 4-bit or 8-bit quantization
Use model parallelism across multiple GPUs
Use a service with larger GPU clusters

Issue: Slow Inference

Causes & Solutions:

Running on CPU: Use GPU acceleration
Full precision: Use quantization
Large batch size: Reduce batch size to 1
Competing processes: Check nvidia-smi

Issue: Model Won't Download

# Set HF token for large models
huggingface-cli login

# Or set environment variable
export HF_TOKEN="your_token_here"

Sources & References

Official Repository: https://github.com/01-ai/Yi
Model Card (Hugging Face): https://huggingface.co/01-ai/Yi-VL-34B
ModelScope Page: https://www.modelscope.cn/models/01ai/Yi-VL-34B
WiseModel Page: https://wisemodel.cn/models/01.AI/Yi-VL-34B
01.AI Official Website: https://www.01.ai/
Yi Paper (arXiv): https://arxiv.org/abs/2403.04652

Citation

@misc{ai2024yi,
    title={Yi: Open Foundation Models by 01.AI},
    author={01.AI and others},
    year={2024},
    eprint={2403.04652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

This model is released under the Apache 2.0 License, allowing:

✅ Commercial use
✅ Distribution
✅ Modification
✅ Private use

With the requirement to:

Include license and copyright notice
State changes made to the code

Last Updated: December 2024 Data Sources: Hugging Face Model Card, Official 01.AI Documentation, GitHub Repository Availability Status: Open-source, self-hosted deployment required

Yi Vision 34B Model Documentation

Yi Vision 34B Model Documentation

Model Overview

Description

Capabilities

Core Features

Supported Tasks

Supported Parameters

Inference Configuration

Image Handling Parameters

Use Cases

Recommended Applications

Not Recommended For

Related Models

By 01.AI

Similar Vision-Language Models

Model Specifications

Architecture & Capacity

Input/Output Modalities

Context & Processing

Performance Metrics

Benchmark Rankings (January 2024)

Comparison with Alternatives

Technical Architecture

Component Breakdown

1. Vision Encoder

2. Projection Module

3. Language Model

Training Methodology

Hardware Requirements

For Inference

For Training (Original)

API Providers & Access

Official Sources

No Official API Endpoint

Usage Examples

Local Deployment with Transformers

Using Hugging Face Inference API

Docker Deployment

REST API Server (Flask Example)

Known Limitations

Image Handling

Model Behavior

Deployment Options

Option 1: Self-Hosted (Recommended for Control)

Option 2: Third-Party Inference Services

Option 3: Quantized Inference

Installation & Setup

Prerequisites

Quick Start Installation

Detailed Setup Guide

Performance & Benchmarks

Inference Speed

Memory Usage

Troubleshooting

Issue: "Out of Memory" Error

Issue: Slow Inference

Issue: Model Won't Download

Sources & References

Citation

License