0

Yi Vision 34B Model Documentation

01.AI
Vision
N/A
Context
N/A
Input /1M
N/A
Output /1M
N/A
Max Output

Yi Vision 34B Model Documentation

Note: This model (Yi-VL-34B) is not currently available on LangMart. It is available through Hugging Face, ModelScope, and WiseModel. The information below is compiled from official 01.AI documentation and Hugging Face model card.

Model Overview

Property Value
Model Name Yi-VL-34B (Vision-Language)
Developer 01.AI (Zero One Everything / Lingyiwanwu)
Model ID 01-ai/Yi-VL-34B (Hugging Face), 01ai/Yi-VL-34B (ModelScope)
Release Date January 2024
Model Type Open-Source Vision-Language Model
Parameters 34 billion
Architecture LLaVA-based multimodal model
Availability Open-source (Apache 2.0 License)
License Apache 2.0

Description

Yi-VL-34B is the world's first open-source 34 billion parameter vision-language model, combining advanced image understanding with multilingual text generation capabilities. It represents a significant advancement in open-source multimodal AI, enabling sophisticated interactions between visual and textual information.

The model integrates CLIP ViT-H/14 for image encoding with the Yi-34B-Chat language model, creating a powerful system capable of understanding images and generating contextually relevant responses in both English and Chinese. As of January 2024, it ranked first among all existing open-source models on key vision-language benchmarks.

Capabilities

Core Features

  • ✅ Multi-round text-image conversations
  • ✅ Visual question answering (VQA)
  • ✅ Image content comprehension and recognition
  • ✅ Optical character recognition (OCR) - text in images
  • ✅ Information extraction from visual content
  • ✅ Image summarization and analysis
  • ✅ Bilingual responses (English & Chinese)
  • ✅ Complex visual reasoning

Supported Tasks

  • Document analysis and extraction
  • Scene understanding and description
  • Object detection and identification
  • Text recognition and extraction from images
  • Visual reasoning and inference
  • Multi-step visual problem solving

Supported Parameters

Inference Configuration

Parameter Type Range Description
temperature float 0.0 - 2.0 Controls response randomness
top_p float 0.0 - 1.0 Nucleus sampling threshold
top_k integer 1 - 100 Top-k sampling count
max_tokens integer 1 - 2048 Maximum output length
repetition_penalty float 0.0 - 2.0 Penalty for token repetition
stream boolean true/false Enable streaming output

Image Handling Parameters

Parameter Value Description
image_resolution 448×448 Fixed input resolution
image_format JPG, PNG, BMP Supported formats
max_images_per_query 1 Single image limitation

Use Cases

  • Document digitization and data extraction
  • Visual quality assurance and inspection
  • Medical image analysis (with appropriate disclaimers)
  • Product catalog enrichment
  • Educational content analysis
  • Accessibility features (image description)
  • Content moderation and filtering
  • Technical diagram understanding
  • Scene understanding and navigation
  • Research and academic analysis
  • Real-time video processing
  • Multi-image batch processing (without modification)
  • Production systems requiring high availability without self-hosting
  • Tasks requiring proprietary API guarantees

By 01.AI

Model Type Parameters Context Status
Yi-Lightning LLM - 16K Proprietary API
Yi-Large LLM - 32K Proprietary API
Yi-34B Base LLM 34B 4K-200K Open-source
Yi-9B Base LLM 9B 4K-200K Open-source
Yi-6B Base LLM 6B 4K-200K Open-source
Yi-Coder Code LLM - 128K Open-source
Yi-VL-34B Vision-Language 34B Multi-round Open-source
Yi-VL-6B Vision-Language 6B Multi-round Open-source

Similar Vision-Language Models

Model Organization Parameters Open-source
LLaVA-34B LLaMA team 34B Yes
Qwen-VL Alibaba 9.6B Yes
Flamingo DeepMind 80B No
GPT-4 Vision OpenAI - No (API only)
Claude 3 Vision Anthropic - No (API only)
Gemini Vision Google - No (API only)

Model Specifications

Architecture & Capacity

Specification Value
Total Parameters 34 billion
Vision Encoder CLIP ViT-H/14 (Vision Transformer)
Projection Module Two-layer MLP with layer normalization
Language Model Yi-34B-Chat base
Input Image Resolution 448×448 pixels
Bilingual Support English and Chinese

Input/Output Modalities

Modality Support
Image Input Yes (single image per query)
Text Input Yes
Text Output Yes (multilingual: English & Chinese)
Multi-round Conversations Yes (text-image conversations)

Context & Processing

Property Value
Context Type Multi-round visual question answering
Image Encoding CLIP ViT-H/14 transformer
Maximum Image Resolution 448×448 (fixed)
Image Input Limitation Single image per query

Performance Metrics

Benchmark Rankings (January 2024)

Benchmark Rank Status
MMMU (English) 1st among open-source ✅ Best in class
CMMMU (Chinese) 1st among open-source ✅ Best in class

Comparison with Alternatives

Yi-VL-34B stands out as:

  • Largest open-source vision-language model (34B parameters)
  • Bilingual capability (English & Chinese) from a single model
  • Strongest performance on MMMU and CMMMU benchmarks among open-source models
  • Lower inference cost compared to proprietary alternatives

Technical Architecture

Component Breakdown

1. Vision Encoder

  • Model: CLIP ViT-H/14 (Vision Transformer)
  • Purpose: Encodes images into feature vectors
  • Resolution: Processes 448×448 pixel images
  • Output: Visual embeddings aligned with text space

2. Projection Module

  • Type: Two-layer MLP (Multi-Layer Perceptron)
  • Layers: Dense layers with layer normalization
  • Purpose: Aligns image feature space with language model feature space
  • Key Innovation: Enables seamless integration between vision and language

3. Language Model

  • Base Model: Yi-34B-Chat
  • Capabilities: Text generation, understanding, and reasoning
  • Languages: English and Chinese
  • Purpose: Generates contextually relevant responses to visual and textual inputs

Training Methodology

Three-Stage Training Process:

Stage Resolution Focus Data Size Purpose
Stage 1 224×224 ViT-LLM alignment 100M image-text pairs Initial projection learning
Stage 2 448×448 Fine visual detail 25M image-text pairs Enhanced visual discernment
Stage 3 448×448 Multimodal chat 1M conversational pairs Chat proficiency & alignment

Training Infrastructure:

  • GPU Resources: 128 × NVIDIA A800 (80GB) GPUs
  • Training Duration: ~10 days for full model
  • Distributed Training: Large-scale parallel training

Data Composition:

  • High-quality image-text pairs from diverse sources
  • Multilingual training data (English and Chinese)
  • Conversational data for chat fine-tuning
  • Synthetic data generation for specific tasks

Hardware Requirements

For Inference

Recommended Setup:

  • GPU VRAM: 80 GB total minimum
  • GPU Configuration:
    • 4 × RTX 4090 (24GB each), or
    • 4 × NVIDIA A800 (80GB each), or
    • 1 × NVIDIA A100 (80GB)
  • CPU RAM: 64 GB minimum
  • Storage: ~70 GB for model weights

Minimal Setup:

  • Requires at least 80GB of GPU VRAM for full model loading
  • Quantization or model parallelism may reduce requirements

For Training (Original)

  • 128 × NVIDIA A800 (80GB) GPUs
  • Multiple nodes with high-bandwidth interconnect
  • Distributed training frameworks (DeepSpeed, PyTorch DDP)

API Providers & Access

Official Sources

Provider Type Status Access
Hugging Face Model Hub Available https://huggingface.co/01-ai/Yi-VL-34B
ModelScope Model Hub Available https://www.modelscope.cn/models/01ai/Yi-VL-34B
WiseModel Model Hub Available https://wisemodel.cn/models/01.AI/Yi-VL-34B
LangMart API Gateway Not Available -

No Official API Endpoint

Unlike Yi-Lightning or other proprietary Yi models, Yi-VL-34B does not have an official API endpoint. Users must:

  1. Download and self-host the model, or
  2. Use integration services (Hugging Face Inference, Together AI, etc.)

Usage Examples

Local Deployment with Transformers

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq

# Load model and processor
model_name = "01-ai/Yi-VL-34B"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load and process image
image = Image.open("path/to/image.jpg")

# Create messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is in this image?"}
        ]
    }
]

# Prepare input
inputs = processor(messages, images=[image], return_tensors="pt").to("cuda")

# Generate response
output_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)

Using Hugging Face Inference API

from huggingface_hub import InferenceClient
from PIL import Image
import base64
import io

client = InferenceClient(api_key="your_hf_api_key")

# Load and encode image
image = Image.open("path/to/image.jpg")
img_buffer = io.BytesIO()
image.save(img_buffer, format="PNG")
img_base64 = base64.b64encode(img_buffer.getvalue()).decode()

# Create message with image
message = {
    "role": "user",
    "content": [
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}},
        {"type": "text", "text": "Analyze this image in detail."}
    ]
}

# Call model
response = client.chat_completion(
    model="01-ai/Yi-VL-34B",
    messages=[message],
    max_tokens=500
)

print(response.choices[0].message.content)

Docker Deployment

FROM nvidia/cuda:12.2.0-runtime-ubuntu22.04

WORKDIR /app

RUN apt-get update && apt-get install -y \
    python3.11 python3-pip git

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model_server.py .

EXPOSE 8000

CMD ["python3", "model_server.py"]

REST API Server (Flask Example)

from flask import Flask, request, jsonify
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
import base64
import io

app = Flask(__name__)

# Load model once at startup
processor = AutoProcessor.from_pretrained("01-ai/Yi-VL-34B")
model = AutoModelForVision2Seq.from_pretrained(
    "01-ai/Yi-VL-34B",
    torch_dtype=torch.float16,
    device_map="auto"
)

@app.route("/analyze", methods=["POST"])
def analyze_image():
    data = request.json
    image_b64 = data["image"]
    question = data["question"]

    # Decode base64 image
    image_data = base64.b64decode(image_b64)
    image = Image.open(io.BytesIO(image_data))

    # Create messages
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": question}
            ]
        }
    ]

    # Process
    inputs = processor(messages, images=[image], return_tensors="pt").to("cuda")
    output_ids = model.generate(**inputs, max_new_tokens=200)
    response = processor.decode(output_ids[0], skip_special_tokens=True)

    return jsonify({"response": response})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Known Limitations

Image Handling

Limitation Impact Workaround
Single image per query Cannot process multiple images simultaneously Process images sequentially
Fixed 448×448 resolution Images larger than this don't provide additional detail Crop/resize appropriately
No 3D/video support Cannot process video or 3D content Extract frames for video

Model Behavior

Issue Description Mitigation
Hallucination May generate non-existent content or miss objects Use detailed prompts, validate output
Multi-object scenes Can miss objects in complex scenes with many items Ask for comprehensive scene descriptions
Small text May struggle with very small text in images Provide higher contrast images
Context limitations Long conversations may lose context Summarize previous context explicitly

Deployment Options

Pros:

  • Full control over inference
  • No rate limits
  • Privacy (data stays on your servers)
  • Customization options

Cons:

  • Requires significant GPU resources (80GB VRAM)
  • Maintenance burden
  • Infrastructure costs

Setup Time: 2-4 hours

Option 2: Third-Party Inference Services

Services:

  • Hugging Face Inference Endpoints
  • Together AI
  • Replicate
  • Modal
  • Modal Petals

Pros:

  • No infrastructure management
  • Automatic scaling
  • Easy API integration

Cons:

  • Usage costs
  • Latency (API calls)
  • Privacy considerations
  • Rate limitations

Option 3: Quantized Inference

Options:

  • 4-bit quantization (bitsandbytes)
  • 8-bit quantization (FP8)
  • GGML format (for CPU inference)

Benefits:

  • Reduced VRAM requirements (from 80GB to ~20GB)
  • Faster inference
  • Lower costs

Tradeoff:

  • Slight accuracy reduction

Installation & Setup

Prerequisites

# Python 3.9+
python --version

# NVIDIA GPU with CUDA support
nvidia-smi

# At least 80GB GPU VRAM
# At least 64GB system RAM
# At least 100GB free storage

Quick Start Installation

# Create virtual environment
python -m venv yi-vl-env
source yi-vl-env/bin/activate  # Linux/Mac
# or
yi-vl-env\Scripts\activate  # Windows

# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers pillow

# Download model (first run will cache it)
python -c "from transformers import AutoModel; AutoModel.from_pretrained('01-ai/Yi-VL-34B')"

Detailed Setup Guide

# Install transformers with vision support
pip install transformers>=4.36.0 pillow>=10.0.0

# For quantization (optional, to reduce VRAM)
pip install bitsandbytes

# For accelerated inference
pip install accelerate

# Verify installation
python -c "from transformers import AutoProcessor; print('Ready!')"

Performance & Benchmarks

Inference Speed

Typical Performance (RTX 4090 or A100):

  • First token latency: 2-4 seconds
  • Token generation rate: 15-25 tokens/second
  • Full response time (200 tokens): 10-15 seconds

Note: Actual performance varies based on image complexity and hardware

Memory Usage

Configuration VRAM Required System RAM
FP16 (full precision) 80 GB 64 GB
8-bit quantization 40 GB 32 GB
4-bit quantization 20 GB 16 GB

Troubleshooting

Issue: "Out of Memory" Error

RuntimeError: CUDA out of memory

Solutions:

  1. Use 4-bit or 8-bit quantization
  2. Use model parallelism across multiple GPUs
  3. Use a service with larger GPU clusters

Issue: Slow Inference

Causes & Solutions:

  • Running on CPU: Use GPU acceleration
  • Full precision: Use quantization
  • Large batch size: Reduce batch size to 1
  • Competing processes: Check nvidia-smi

Issue: Model Won't Download

# Set HF token for large models
huggingface-cli login

# Or set environment variable
export HF_TOKEN="your_token_here"

Sources & References

Citation

@misc{ai2024yi,
    title={Yi: Open Foundation Models by 01.AI},
    author={01.AI and others},
    year={2024},
    eprint={2403.04652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

License

This model is released under the Apache 2.0 License, allowing:

  • ✅ Commercial use
  • ✅ Distribution
  • ✅ Modification
  • ✅ Private use

With the requirement to:

  • Include license and copyright notice
  • State changes made to the code

Last Updated: December 2024 Data Sources: Hugging Face Model Card, Official 01.AI Documentation, GitHub Repository Availability Status: Open-source, self-hosted deployment required