Collections: Flash 2.5
Overview
| Property | Value |
|---|---|
| Model ID | collection/flash2.5 |
| Display Name | Flash 2.5 Collection |
| Type | Organization Collection |
| Access Method | collection/flash2.5 |
| Scope | Organization-level |
| Routing Strategy | Round Robin |
Description
The Flash 2.5 Collection is a curated organization-level collection of fast-responding, lightweight language models optimized for speed and cost-effectiveness. This collection focuses on models that provide excellent performance for real-time conversational AI applications while maintaining low latency.
Key Characteristics
- Speed-Optimized: Models selected for low response latency and high throughput
- Cost-Effective: Emphasis on efficient token consumption and reasonable pricing
- Organization-Scoped: Shared resource for all members within an organization
- Load Distribution: Uses round-robin routing to balance requests across models
- Flexible Membership: Organization admins can add/remove models from the collection
Specifications
| Aspect | Details |
|---|---|
| Context Window | 10,000 tokens |
| Collection Type | Organization-level |
| Access Level | Organization members only |
| Scope | organization |
| Routing Strategy | Round Robin (sequential model selection) |
| Visibility | Organization (shared with all org members) |
| Minimum Members | 1 model |
| Maximum Members | Unlimited |
| Model Heterogeneity | Can mix different providers and model sizes |
Use Cases
1. High-Throughput Chat Applications
- Customer support chatbots serving thousands of concurrent users
- Multi-tenant SaaS platforms with real-time chat features
- AI assistant applications requiring low latency
- Load distribution across lightweight model variants
2. Cost-Optimized Production Deployments
- Reduce per-request inference costs while maintaining quality
- Run at scale with budget constraints
- Balance quality and cost for non-critical features
- Implement tiered service offerings
3. Real-Time Assistant Interactions
- Live chat with human agents
- Real-time translation services
- Immediate response requirements (<2s)
- Interactive web application backends
4. Educational Demos and Prototypes
- Fast feedback loops for development
- Educational institution deployments
- Proof-of-concept applications
- Internal tool prototyping
5. Distributed Workload Balancing
- Prevent bottleneck on single model
- Handle burst traffic gracefully
- Optimize resource utilization
- Smooth degradation under load
Characteristics of Included Models
Models in Flash 2.5 typically exhibit:
| Characteristic | Value |
|---|---|
| Type | Lightweight LLMs and fast variants |
| Average Latency | <2 seconds response time |
| Throughput | >100 requests/second |
| Context Windows | 4K - 32K tokens (typically 8K-16K) |
| Inference Cost | Budget-friendly (lower per-token pricing) |
| Use Case Focus | Conversational, real-time applications |
| Tool Support | Most support function calling |
| Multi-modal | Mix of text-only and vision-capable |
Model Selection Strategy
Round Robin Routing
The Flash 2.5 collection uses round-robin routing to distribute requests:
- Sequential Selection: Models are selected in order
- Load Balancing: Requests distributed evenly across collection members
- Cycling: Returns to first model after last model is selected
- Stateful: Maintains selection index per user/organization
Example Rotation
Collection with 3 models: [Model A, Model B, Model C]
Request 1 → Model A
Request 2 → Model B
Request 3 → Model C
Request 4 → Model A
Request 5 → Model B
...
Fallback Behavior
- If selected model is unavailable: Skip to next available model
- If all models unavailable: Return error with available models list
- If insufficient permissions: Skip to next accessible model
Usage
Basic Request
curl -X POST https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "collection/flash2.5",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"temperature": 0.7
}'
With Parameters
curl -X POST https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "collection/flash2.5",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100,
"temperature": 0.5,
"top_p": 0.9
}'
Collection Management
Viewing Collection Details
SELECT
id,
collection_name,
display_name,
description,
scope,
routing_strategy,
organization_id
FROM model_collections
WHERE collection_name = 'flash2.5'
AND is_active = true;
Viewing Collection Members
SELECT
mcm.id,
mc.category_display_id as model_id,
mc.model_name,
mcm.priority,
mcm.weight,
mcm.is_active
FROM model_collection_members mcm
JOIN model_categories mc ON mcm.model_category_id = mc.id
WHERE mcm.collection_id = (
SELECT id FROM model_collections
WHERE collection_name = 'flash2.5'
)
ORDER BY mcm.priority DESC;
API: List Collection Models
curl -X GET https://api.langmart.ai/api/user/model-collections/COLLECTION_ID \
-H "Authorization: Bearer sk-your-api-key" | jq '.models'
API: Add Model to Collection
curl -X POST https://api.langmart.ai/api/user/model-collections/COLLECTION_ID/members \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model_id": "openai/gpt-3.5-turbo"
}'
API: Remove Model from Collection
curl -X DELETE "https://api.langmart.ai/api/user/model-collections/COLLECTION_ID/members/openai%2Fgpt-3.5-turbo" \
-H "Authorization: Bearer sk-your-api-key"
Response Format
Responses include collection routing metadata:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1234567890,
"model": "google/gemini-flash-latest",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Quantum computing uses quantum bits (qubits)..."
}
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 45,
"total_tokens": 55
},
"_collection_routed": true,
"_collection_name": "flash2.5",
"_selected_model": "google/gemini-flash-latest"
}
Performance Characteristics
| Metric | Target |
|---|---|
| Selection Latency | <1ms |
| P50 Response Time | <1 second |
| P95 Response Time | <3 seconds |
| P99 Response Time | <5 seconds |
| Throughput | >100 req/s (model-dependent) |
| Cache TTL | 60 seconds |
| Request Success Rate | >99% (with fallback) |
Typical Collection Composition
A typical Flash 2.5 collection might include:
1. google/gemini-flash-latest (Google)
- Fast, multimodal, 10K context window
2. openai/gpt-3.5-turbo (OpenAI)
- Proven performance, 4K context window
3. mistralai/mistral-7b-instruct (Mistral)
- Open source, lightweight, fast
4. anthropic/claude-3-haiku (Anthropic)
- Balanced speed/quality, 200K context
Configuration and Metadata
Collection Metadata Storage
Collection-specific settings stored in JSONB metadata field:
{
"max_context_window": 16000,
"preferred_providers": ["google", "openai", "mistral"],
"description": "Fast models for real-time chat",
"tags": ["production", "cost-optimized", "low-latency"],
"sla_target_latency_ms": 2000,
"min_models_available": 2
}
Member Configuration
Per-model settings in model_collection_members:
{
"priority": 1,
"weight": 1,
"tags": ["fallback"],
"health_check": true
}
Error Handling
Collection Not Found
{
"error": {
"code": "collection_not_found",
"message": "Collection 'flash2.5' not found or access denied"
}
}
No Available Models
{
"error": {
"code": "collection_no_models",
"message": "No models available in flash2.5 collection. All models are rate-limited or inaccessible.",
"available_models": [],
"total_members": 3
}
}
User Access Denied
{
"error": {
"code": "access_denied",
"message": "Your organization does not have access to flash2.5 collection",
"required_scope": "organization"
}
}
Model Error with Fallback
{
"error": {
"code": "model_overloaded",
"message": "Selected model is overloaded, attempting fallback...",
"fallback_attempted": true,
"fallback_model": "openai/gpt-3.5-turbo"
}
}
Billing and Credits
- Cost: Based on selected model
- Tracking: Per-model usage tracked in request_logs
- Attribution: Each request logs actual model used
- Organization Quotas: Collection requests count toward org quota
- Cost Optimization: Round-robin balances expensive/cheap models
Limits and Constraints
| Constraint | Value |
|---|---|
| Min Models | 1 |
| Max Models | Unlimited |
| Name Length | 100 characters |
| Description Length | Text field |
| Routing Strategies | round_robin, random, priority, least_used |
| Request Rate | Model-dependent |
| Max Collection Name | 100 characters (lowercase, alphanumeric, hyphens) |
Integration Examples
Example 1: Customer Support Bot
import anthropic
client = anthropic.Anthropic(
api_key="sk-your-api-key",
base_url="https://api.langmart.ai/v1"
)
response = client.messages.create(
model="collection/flash2.5",
max_tokens=256,
messages=[
{"role": "user", "content": "I can't reset my password"}
]
)
print(response.content[0].text)
Example 2: Load Balancing with Fallback
async function chat(userMessage) {
try {
const response = await fetch('https://api.langmart.ai/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': 'Bearer sk-your-api-key',
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'collection/flash2.5',
messages: [{ role: 'user', content: userMessage }],
temperature: 0.7
})
});
if (!response.ok) {
console.error('Collection error:', await response.json());
}
return await response.json();
} catch (error) {
console.error('Request failed:', error);
// Fallback to single model or different collection
}
}
Monitoring and Analytics
Request Distribution
SELECT
-- Extract model from response metadata
response_data->>'_selected_model' as actual_model,
COUNT(*) as request_count,
ROUND(AVG(CAST(response_data->>'_latency_ms' AS NUMERIC)), 2) as avg_latency_ms
FROM request_logs
WHERE request_data->>'model' = 'collection/flash2.5'
AND DATE(created_at) = CURRENT_DATE
GROUP BY actual_model
ORDER BY request_count DESC;
Collection Performance
SELECT
HOUR(created_at) as hour,
COUNT(*) as requests,
ROUND(AVG(CAST(response_data->>'_latency_ms' AS NUMERIC)), 2) as avg_latency,
SUM(CAST(response_data->>'total_tokens' AS INTEGER)) as total_tokens
FROM request_logs
WHERE request_data->>'model' = 'collection/flash2.5'
AND DATE(created_at) = CURRENT_DATE
GROUP BY HOUR(created_at)
ORDER BY hour DESC;
Database Schema
See /datastore/tables/99_model_collections.sql for complete schema details.
Key Tables
model_collections: Collection metadatamodel_collection_members: Collection membership and routing weightsmodel_categories: Available models that can be added to collectionsrequest_logs: Tracks all requests and selected models
Related Resources
- Collection Tools:
/gateway-type3/collection-tools.ts - Model Collection Router:
/gateway-type1/lib/services/model-collection-router.ts - Database Schema:
/datastore/tables/99_model_collections.sql - Migrations:
/datastore/migrations/20251220_*.sql
Version History
- v1.0 (2025-12-20): Initial organization collection implementation
- Round-robin routing strategy implemented
- Collection member management API created
- Metadata storage for collection configuration