Qwen2.5 VL 72B Instruct
Model Overview
| Property |
Value |
| Model ID |
qwen/qwen2.5-vl-72b-instruct |
| Full Name |
Qwen: Qwen2.5 VL 72B Instruct |
| Author |
Qwen (Alibaba Cloud) |
| Created |
February 1, 2025 |
| Model Type |
Multimodal (Vision-Language) |
| Architecture |
Transformer-based VL Model |
| Parameters |
72 Billion |
Description
Qwen2.5 VL 72B Instruct is a state-of-the-art multimodal model that excels at visual understanding and reasoning tasks. The model demonstrates exceptional capabilities in:
- Object Recognition: Recognizing common objects such as flowers, birds, fish, and insects with high accuracy
- Document Understanding: Analyzing texts, charts, icons, graphics, and layouts within images
- Visual Reasoning: Understanding complex visual scenes and providing detailed descriptions
- OCR Capabilities: Extracting and interpreting text from images
This model represents a significant advancement in vision-language AI, combining powerful visual perception with sophisticated language understanding.
Technical Specifications
Context and Capacity
| Specification |
Value |
| Context Length |
32,768 tokens |
| Max Completion Tokens |
32,768 tokens |
| Total Window |
32K tokens |
| Direction |
Supported Types |
| Input |
Text, Images |
| Output |
Text |
Capabilities
- Vision understanding and analysis
- Multi-turn conversations with image context
- Document and chart interpretation
- Object detection and classification
- Scene description and visual reasoning
- Text extraction from images (OCR)
- Layout analysis
Pricing
Via Chutes Provider
| Type |
Price (per million tokens) |
| Input Tokens |
$0.07 |
| Output Tokens |
$0.26 |
| Image Processing |
No additional charge |
Cost Estimation Examples
| Use Case |
Estimated Cost |
| 1M input tokens + 100K output |
$0.07 + $0.026 = $0.096 |
| 10M input tokens + 1M output |
$0.70 + $0.26 = $0.96 |
| Single image analysis (~1K tokens) |
~$0.0001 |
Supported Parameters
Generation Parameters
| Parameter |
Type |
Description |
temperature |
float |
Controls randomness (0.0-2.0) |
top_p |
float |
Nucleus sampling threshold (0.0-1.0) |
top_k |
integer |
Top-k sampling |
max_tokens |
integer |
Maximum tokens to generate |
stop |
array |
Stop sequences |
frequency_penalty |
float |
Penalize frequent tokens (-2.0 to 2.0) |
presence_penalty |
float |
Penalize repeated tokens (-2.0 to 2.0) |
repetition_penalty |
float |
Alternative repetition penalty |
seed |
integer |
Random seed for reproducibility |
Response Control
| Parameter |
Type |
Description |
response_format |
object |
Structured output format (JSON mode) |
stream |
boolean |
Enable streaming responses |
| Parameter |
Type |
Description |
tools |
array |
List of available tools/functions |
tool_choice |
string/object |
Tool selection mode: none, auto, required |
Best Practices
- Resolution: Use images with sufficient resolution for the task (recommended: 512x512 to 2048x2048)
- Format: Supports JPEG, PNG, GIF, WebP
- Size: Keep images under 20MB for optimal performance
- Multiple Images: Can process multiple images in a single request
Prompt Engineering Tips
- Be Specific: Clearly state what you want the model to analyze in the image
- Context First: Provide text context before the image when relevant
- Structured Queries: For document analysis, specify the format you want for extracted data
- Multi-turn: Use conversation history to refine image understanding
- Token Management: Monitor token usage as images consume context
- Streaming: Use streaming for long responses to improve perceived latency
- Batching: Group related image analyses when possible
qwen/qwen2.5-vl-7b-instruct - Smaller 7B parameter version
qwen/qwen2.5-72b-instruct - Text-only version
qwen/qwen-2-vl-7b-instruct - Previous generation VL model
Primary Provider: Chutes
| Property |
Value |
| Provider Name |
Chutes |
| Base URL |
https://llm.chutes.ai/v1 |
| Status |
Active |
Model Weights
Available on Hugging Face: Qwen/Qwen2.5-VL-72B-Instruct
Features
- Multipart Image Inputs: Support for multiple images in a single request
- Function Calling: Native support for tool use and function calling
- Structured Outputs: JSON mode and structured response formats
- Streaming: Real-time token streaming support
- Abortable Requests: Cancel in-flight requests
Usage Examples
Basic Text + Image Request (OpenAI Compatible)
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen2.5-vl-72b-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What objects do you see in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
],
"max_tokens": 1024
}'
Python SDK Example
from openai import OpenAI
client = OpenAI(
base_url="https://api.langmart.ai/v1",
api_key="your-langmart-api-key"
)
response = client.chat.completions.create(
model="qwen/qwen2.5-vl-72b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in detail."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.png"
}
}
]
}
],
max_tokens=2048,
temperature=0.7
)
print(response.choices[0].message.content)
Document Analysis Example
import base64
# Load image as base64
with open("document.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="qwen/qwen2.5-vl-72b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract all text from this document and summarize its contents."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{image_data}"
}
}
]
}
],
max_tokens=4096
)
Multi-Image Comparison
response = client.chat.completions.create(
model="qwen/qwen2.5-vl-72b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images and describe the differences."
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/image1.jpg"}
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/image2.jpg"}
}
]
}
],
max_tokens=2048
)
Function Calling Example
response = client.chat.completions.create(
model="qwen/qwen2.5-vl-72b-instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Identify all products in this image and get their prices."
},
{
"type": "image_url",
"image_url": {"url": "https://example.com/products.jpg"}
}
]
}
],
tools=[
{
"type": "function",
"function": {
"name": "get_product_price",
"description": "Get the price of a product by name",
"parameters": {
"type": "object",
"properties": {
"product_name": {
"type": "string",
"description": "Name of the product"
}
},
"required": ["product_name"]
}
}
}
],
tool_choice="auto"
)
LangMart Integration
Model ID for LangMart
qwen/qwen2.5-vl-72b-instruct
LangMart API Example
curl -X POST https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen/qwen2.5-vl-72b-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is shown in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
]
}'
Comparison with Similar Models
| Model |
Context |
Vision |
Pricing (Input/Output) |
| Qwen2.5 VL 72B |
32K |
Yes |
$0.07 / $0.26 |
| GPT-4 Vision |
128K |
Yes |
$10.00 / $30.00 |
| Claude 3.5 Sonnet |
200K |
Yes |
$3.00 / $15.00 |
| Llama 3.2 90B Vision |
128K |
Yes |
$0.90 / $0.90 |
References
Last Updated: December 2024
Source: LangMart API