Q

Qwen2.5 VL 72B Instruct

Qwen
Vision
33K
Context
$0.0700
Input /1M
$0.2600
Output /1M
33K
Max Output

Qwen2.5 VL 72B Instruct

Model Overview

Property Value
Model ID qwen/qwen2.5-vl-72b-instruct
Full Name Qwen: Qwen2.5 VL 72B Instruct
Author Qwen (Alibaba Cloud)
Created February 1, 2025
Model Type Multimodal (Vision-Language)
Architecture Transformer-based VL Model
Parameters 72 Billion

Description

Qwen2.5 VL 72B Instruct is a state-of-the-art multimodal model that excels at visual understanding and reasoning tasks. The model demonstrates exceptional capabilities in:

  • Object Recognition: Recognizing common objects such as flowers, birds, fish, and insects with high accuracy
  • Document Understanding: Analyzing texts, charts, icons, graphics, and layouts within images
  • Visual Reasoning: Understanding complex visual scenes and providing detailed descriptions
  • OCR Capabilities: Extracting and interpreting text from images

This model represents a significant advancement in vision-language AI, combining powerful visual perception with sophisticated language understanding.

Technical Specifications

Context and Capacity

Specification Value
Context Length 32,768 tokens
Max Completion Tokens 32,768 tokens
Total Window 32K tokens

Input/Output Modalities

Direction Supported Types
Input Text, Images
Output Text

Capabilities

  • Vision understanding and analysis
  • Multi-turn conversations with image context
  • Document and chart interpretation
  • Object detection and classification
  • Scene description and visual reasoning
  • Text extraction from images (OCR)
  • Layout analysis

Pricing

Via Chutes Provider

Type Price (per million tokens)
Input Tokens $0.07
Output Tokens $0.26
Image Processing No additional charge

Cost Estimation Examples

Use Case Estimated Cost
1M input tokens + 100K output $0.07 + $0.026 = $0.096
10M input tokens + 1M output $0.70 + $0.26 = $0.96
Single image analysis (~1K tokens) ~$0.0001

Supported Parameters

Generation Parameters

Parameter Type Description
temperature float Controls randomness (0.0-2.0)
top_p float Nucleus sampling threshold (0.0-1.0)
top_k integer Top-k sampling
max_tokens integer Maximum tokens to generate
stop array Stop sequences
frequency_penalty float Penalize frequent tokens (-2.0 to 2.0)
presence_penalty float Penalize repeated tokens (-2.0 to 2.0)
repetition_penalty float Alternative repetition penalty
seed integer Random seed for reproducibility

Response Control

Parameter Type Description
response_format object Structured output format (JSON mode)
stream boolean Enable streaming responses

Tool/Function Calling

Parameter Type Description
tools array List of available tools/functions
tool_choice string/object Tool selection mode: none, auto, required

Best Practices

Image Input Recommendations

  1. Resolution: Use images with sufficient resolution for the task (recommended: 512x512 to 2048x2048)
  2. Format: Supports JPEG, PNG, GIF, WebP
  3. Size: Keep images under 20MB for optimal performance
  4. Multiple Images: Can process multiple images in a single request

Prompt Engineering Tips

  1. Be Specific: Clearly state what you want the model to analyze in the image
  2. Context First: Provide text context before the image when relevant
  3. Structured Queries: For document analysis, specify the format you want for extracted data
  4. Multi-turn: Use conversation history to refine image understanding

Performance Optimization

  1. Token Management: Monitor token usage as images consume context
  2. Streaming: Use streaming for long responses to improve perceived latency
  3. Batching: Group related image analyses when possible
  • qwen/qwen2.5-vl-7b-instruct - Smaller 7B parameter version
  • qwen/qwen2.5-72b-instruct - Text-only version
  • qwen/qwen-2-vl-7b-instruct - Previous generation VL model

Provider Information

Primary Provider: Chutes

Property Value
Provider Name Chutes
Base URL https://llm.chutes.ai/v1
Status Active

Model Weights

Available on Hugging Face: Qwen/Qwen2.5-VL-72B-Instruct

Features

  • Multipart Image Inputs: Support for multiple images in a single request
  • Function Calling: Native support for tool use and function calling
  • Structured Outputs: JSON mode and structured response formats
  • Streaming: Real-time token streaming support
  • Abortable Requests: Cancel in-flight requests

Usage Examples

Basic Text + Image Request (OpenAI Compatible)

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $LANGMART_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen2.5-vl-72b-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What objects do you see in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1024
  }'

Python SDK Example

from openai import OpenAI

client = OpenAI(
    base_url="https://api.langmart.ai/v1",
    api_key="your-langmart-api-key"
)

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/photo.png"
                    }
                }
            ]
        }
    ],
    max_tokens=2048,
    temperature=0.7
)

print(response.choices[0].message.content)

Document Analysis Example

import base64

# Load image as base64
with open("document.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all text from this document and summarize its contents."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}"
                    }
                }
            ]
        }
    ],
    max_tokens=4096
)

Multi-Image Comparison

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two images and describe the differences."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image1.jpg"}
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image2.jpg"}
                }
            ]
        }
    ],
    max_tokens=2048
)

Function Calling Example

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Identify all products in this image and get their prices."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/products.jpg"}
                }
            ]
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_product_price",
                "description": "Get the price of a product by name",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "product_name": {
                            "type": "string",
                            "description": "Name of the product"
                        }
                    },
                    "required": ["product_name"]
                }
            }
        }
    ],
    tool_choice="auto"
)

LangMart Integration

Model ID for LangMart

qwen/qwen2.5-vl-72b-instruct

LangMart API Example

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen2.5-vl-72b-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is shown in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          }
        ]
      }
    ]
  }'

Comparison with Similar Models

Model Context Vision Pricing (Input/Output)
Qwen2.5 VL 72B 32K Yes $0.07 / $0.26
GPT-4 Vision 128K Yes $10.00 / $30.00
Claude 3.5 Sonnet 200K Yes $3.00 / $15.00
Llama 3.2 90B Vision 128K Yes $0.90 / $0.90

References


Last Updated: December 2024 Source: LangMart API