Qwen2.5 VL 72B Instruct

Model Overview

Property	Value
Model ID	`qwen/qwen2.5-vl-72b-instruct`
Full Name	Qwen: Qwen2.5 VL 72B Instruct
Author	Qwen (Alibaba Cloud)
Created	February 1, 2025
Model Type	Multimodal (Vision-Language)
Architecture	Transformer-based VL Model
Parameters	72 Billion

Description

Qwen2.5 VL 72B Instruct is a state-of-the-art multimodal model that excels at visual understanding and reasoning tasks. The model demonstrates exceptional capabilities in:

Object Recognition: Recognizing common objects such as flowers, birds, fish, and insects with high accuracy
Document Understanding: Analyzing texts, charts, icons, graphics, and layouts within images
Visual Reasoning: Understanding complex visual scenes and providing detailed descriptions
OCR Capabilities: Extracting and interpreting text from images

This model represents a significant advancement in vision-language AI, combining powerful visual perception with sophisticated language understanding.

Technical Specifications

Context and Capacity

Specification	Value
Context Length	32,768 tokens
Max Completion Tokens	32,768 tokens
Total Window	32K tokens

Input/Output Modalities

Direction	Supported Types
Input	Text, Images
Output	Text

Capabilities

Vision understanding and analysis
Multi-turn conversations with image context
Document and chart interpretation
Object detection and classification
Scene description and visual reasoning
Text extraction from images (OCR)
Layout analysis

Pricing

Via Chutes Provider

Type	Price (per million tokens)
Input Tokens	$0.07
Output Tokens	$0.26
Image Processing	No additional charge

Cost Estimation Examples

Use Case	Estimated Cost
1M input tokens + 100K output	$0.07 + $0.026 = $0.096
10M input tokens + 1M output	$0.70 + $0.26 = $0.96
Single image analysis (~1K tokens)	~$0.0001

Supported Parameters

Generation Parameters

Parameter	Type	Description
`temperature`	float	Controls randomness (0.0-2.0)
`top_p`	float	Nucleus sampling threshold (0.0-1.0)
`top_k`	integer	Top-k sampling
`max_tokens`	integer	Maximum tokens to generate
`stop`	array	Stop sequences
`frequency_penalty`	float	Penalize frequent tokens (-2.0 to 2.0)
`presence_penalty`	float	Penalize repeated tokens (-2.0 to 2.0)
`repetition_penalty`	float	Alternative repetition penalty
`seed`	integer	Random seed for reproducibility

Response Control

Parameter	Type	Description
`response_format`	object	Structured output format (JSON mode)
`stream`	boolean	Enable streaming responses

Tool/Function Calling

Parameter	Type	Description
`tools`	array	List of available tools/functions
`tool_choice`	string/object	Tool selection mode: `none`, `auto`, `required`

Best Practices

Image Input Recommendations

Resolution: Use images with sufficient resolution for the task (recommended: 512x512 to 2048x2048)
Format: Supports JPEG, PNG, GIF, WebP
Size: Keep images under 20MB for optimal performance
Multiple Images: Can process multiple images in a single request

Prompt Engineering Tips

Be Specific: Clearly state what you want the model to analyze in the image
Context First: Provide text context before the image when relevant
Structured Queries: For document analysis, specify the format you want for extracted data
Multi-turn: Use conversation history to refine image understanding

Performance Optimization

Token Management: Monitor token usage as images consume context
Streaming: Use streaming for long responses to improve perceived latency
Batching: Group related image analyses when possible

qwen/qwen2.5-vl-7b-instruct - Smaller 7B parameter version
qwen/qwen2.5-72b-instruct - Text-only version
qwen/qwen-2-vl-7b-instruct - Previous generation VL model

Provider Information

Primary Provider: Chutes

Property	Value
Provider Name	Chutes
Base URL	`https://llm.chutes.ai/v1`
Status	Active

Model Weights

Available on Hugging Face: Qwen/Qwen2.5-VL-72B-Instruct

Features

Multipart Image Inputs: Support for multiple images in a single request
Function Calling: Native support for tool use and function calling
Structured Outputs: JSON mode and structured response formats
Streaming: Real-time token streaming support
Abortable Requests: Cancel in-flight requests

Usage Examples

Basic Text + Image Request (OpenAI Compatible)

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $LANGMART_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen2.5-vl-72b-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What objects do you see in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 1024
  }'

Python SDK Example

from openai import OpenAI

client = OpenAI(
    base_url="https://api.langmart.ai/v1",
    api_key="your-langmart-api-key"
)

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/photo.png"
                    }
                }
            ]
        }
    ],
    max_tokens=2048,
    temperature=0.7
)

print(response.choices[0].message.content)

Document Analysis Example

import base64

# Load image as base64
with open("document.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all text from this document and summarize its contents."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}"
                    }
                }
            ]
        }
    ],
    max_tokens=4096
)

Multi-Image Comparison

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Compare these two images and describe the differences."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image1.jpg"}
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image2.jpg"}
                }
            ]
        }
    ],
    max_tokens=2048
)

Function Calling Example

response = client.chat.completions.create(
    model="qwen/qwen2.5-vl-72b-instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Identify all products in this image and get their prices."
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/products.jpg"}
                }
            ]
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_product_price",
                "description": "Get the price of a product by name",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "product_name": {
                            "type": "string",
                            "description": "Name of the product"
                        }
                    },
                    "required": ["product_name"]
                }
            }
        }
    ],
    tool_choice="auto"
)

LangMart Integration

Model ID for LangMart

qwen/qwen2.5-vl-72b-instruct

LangMart API Example

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen/qwen2.5-vl-72b-instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is shown in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/image.jpg"
            }
          }
        ]
      }
    ]
  }'

Comparison with Similar Models

Model	Context	Vision	Pricing (Input/Output)
Qwen2.5 VL 72B	32K	Yes	$0.07 / $0.26
GPT-4 Vision	128K	Yes	$10.00 / $30.00
Claude 3.5 Sonnet	200K	Yes	$3.00 / $15.00
Llama 3.2 90B Vision	128K	Yes	$0.90 / $0.90

References

Last Updated: December 2024 Source: LangMart API

Qwen2.5 VL 72B Instruct

Qwen2.5 VL 72B Instruct

Model Overview

Description

Technical Specifications

Context and Capacity

Input/Output Modalities

Capabilities

Pricing

Via Chutes Provider

Cost Estimation Examples

Supported Parameters

Generation Parameters

Response Control

Tool/Function Calling

Best Practices

Image Input Recommendations

Prompt Engineering Tips

Performance Optimization

Related Models

Provider Information

Primary Provider: Chutes

Model Weights

Features

Usage Examples

Basic Text + Image Request (OpenAI Compatible)

Python SDK Example

Document Analysis Example

Multi-Image Comparison

Function Calling Example

LangMart Integration

Model ID for LangMart

LangMart API Example

Comparison with Similar Models

References