C

Collections: Flash 2.5

Collection
Vision
10K
Context
N/A
Input /1M
N/A
Output /1M
N/A
Max Output

Collections: Flash 2.5

Overview

Property Value
Model ID collection/flash2.5
Display Name Flash 2.5 Collection
Type Organization Collection
Access Method collection/flash2.5
Scope Organization-level
Routing Strategy Round Robin

Description

The Flash 2.5 Collection is a curated organization-level collection of fast-responding, lightweight language models optimized for speed and cost-effectiveness. This collection focuses on models that provide excellent performance for real-time conversational AI applications while maintaining low latency.

Key Characteristics

  • Speed-Optimized: Models selected for low response latency and high throughput
  • Cost-Effective: Emphasis on efficient token consumption and reasonable pricing
  • Organization-Scoped: Shared resource for all members within an organization
  • Load Distribution: Uses round-robin routing to balance requests across models
  • Flexible Membership: Organization admins can add/remove models from the collection

Specifications

Aspect Details
Context Window 10,000 tokens
Collection Type Organization-level
Access Level Organization members only
Scope organization
Routing Strategy Round Robin (sequential model selection)
Visibility Organization (shared with all org members)
Minimum Members 1 model
Maximum Members Unlimited
Model Heterogeneity Can mix different providers and model sizes

Use Cases

1. High-Throughput Chat Applications

  • Customer support chatbots serving thousands of concurrent users
  • Multi-tenant SaaS platforms with real-time chat features
  • AI assistant applications requiring low latency
  • Load distribution across lightweight model variants

2. Cost-Optimized Production Deployments

  • Reduce per-request inference costs while maintaining quality
  • Run at scale with budget constraints
  • Balance quality and cost for non-critical features
  • Implement tiered service offerings

3. Real-Time Assistant Interactions

  • Live chat with human agents
  • Real-time translation services
  • Immediate response requirements (<2s)
  • Interactive web application backends

4. Educational Demos and Prototypes

  • Fast feedback loops for development
  • Educational institution deployments
  • Proof-of-concept applications
  • Internal tool prototyping

5. Distributed Workload Balancing

  • Prevent bottleneck on single model
  • Handle burst traffic gracefully
  • Optimize resource utilization
  • Smooth degradation under load

Characteristics of Included Models

Models in Flash 2.5 typically exhibit:

Characteristic Value
Type Lightweight LLMs and fast variants
Average Latency <2 seconds response time
Throughput >100 requests/second
Context Windows 4K - 32K tokens (typically 8K-16K)
Inference Cost Budget-friendly (lower per-token pricing)
Use Case Focus Conversational, real-time applications
Tool Support Most support function calling
Multi-modal Mix of text-only and vision-capable

Model Selection Strategy

Round Robin Routing

The Flash 2.5 collection uses round-robin routing to distribute requests:

  1. Sequential Selection: Models are selected in order
  2. Load Balancing: Requests distributed evenly across collection members
  3. Cycling: Returns to first model after last model is selected
  4. Stateful: Maintains selection index per user/organization

Example Rotation

Collection with 3 models: [Model A, Model B, Model C]

Request 1 → Model A
Request 2 → Model B
Request 3 → Model C
Request 4 → Model A
Request 5 → Model B
...

Fallback Behavior

  • If selected model is unavailable: Skip to next available model
  • If all models unavailable: Return error with available models list
  • If insufficient permissions: Skip to next accessible model

Usage

Basic Request

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "collection/flash2.5",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "temperature": 0.7
  }'

With Parameters

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "collection/flash2.5",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100,
    "temperature": 0.5,
    "top_p": 0.9
  }'

Collection Management

Viewing Collection Details

SELECT
    id,
    collection_name,
    display_name,
    description,
    scope,
    routing_strategy,
    organization_id
FROM model_collections
WHERE collection_name = 'flash2.5'
  AND is_active = true;

Viewing Collection Members

SELECT
    mcm.id,
    mc.category_display_id as model_id,
    mc.model_name,
    mcm.priority,
    mcm.weight,
    mcm.is_active
FROM model_collection_members mcm
JOIN model_categories mc ON mcm.model_category_id = mc.id
WHERE mcm.collection_id = (
    SELECT id FROM model_collections
    WHERE collection_name = 'flash2.5'
)
ORDER BY mcm.priority DESC;

API: List Collection Models

curl -X GET https://api.langmart.ai/api/user/model-collections/COLLECTION_ID \
  -H "Authorization: Bearer sk-your-api-key" | jq '.models'

API: Add Model to Collection

curl -X POST https://api.langmart.ai/api/user/model-collections/COLLECTION_ID/members \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "openai/gpt-3.5-turbo"
  }'

API: Remove Model from Collection

curl -X DELETE "https://api.langmart.ai/api/user/model-collections/COLLECTION_ID/members/openai%2Fgpt-3.5-turbo" \
  -H "Authorization: Bearer sk-your-api-key"

Response Format

Responses include collection routing metadata:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "google/gemini-flash-latest",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum bits (qubits)..."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 45,
    "total_tokens": 55
  },
  "_collection_routed": true,
  "_collection_name": "flash2.5",
  "_selected_model": "google/gemini-flash-latest"
}

Performance Characteristics

Metric Target
Selection Latency <1ms
P50 Response Time <1 second
P95 Response Time <3 seconds
P99 Response Time <5 seconds
Throughput >100 req/s (model-dependent)
Cache TTL 60 seconds
Request Success Rate >99% (with fallback)

Typical Collection Composition

A typical Flash 2.5 collection might include:

1. google/gemini-flash-latest (Google)
   - Fast, multimodal, 10K context window

2. openai/gpt-3.5-turbo (OpenAI)
   - Proven performance, 4K context window

3. mistralai/mistral-7b-instruct (Mistral)
   - Open source, lightweight, fast

4. anthropic/claude-3-haiku (Anthropic)
   - Balanced speed/quality, 200K context

Configuration and Metadata

Collection Metadata Storage

Collection-specific settings stored in JSONB metadata field:

{
  "max_context_window": 16000,
  "preferred_providers": ["google", "openai", "mistral"],
  "description": "Fast models for real-time chat",
  "tags": ["production", "cost-optimized", "low-latency"],
  "sla_target_latency_ms": 2000,
  "min_models_available": 2
}

Member Configuration

Per-model settings in model_collection_members:

{
  "priority": 1,
  "weight": 1,
  "tags": ["fallback"],
  "health_check": true
}

Error Handling

Collection Not Found

{
  "error": {
    "code": "collection_not_found",
    "message": "Collection 'flash2.5' not found or access denied"
  }
}

No Available Models

{
  "error": {
    "code": "collection_no_models",
    "message": "No models available in flash2.5 collection. All models are rate-limited or inaccessible.",
    "available_models": [],
    "total_members": 3
  }
}

User Access Denied

{
  "error": {
    "code": "access_denied",
    "message": "Your organization does not have access to flash2.5 collection",
    "required_scope": "organization"
  }
}

Model Error with Fallback

{
  "error": {
    "code": "model_overloaded",
    "message": "Selected model is overloaded, attempting fallback...",
    "fallback_attempted": true,
    "fallback_model": "openai/gpt-3.5-turbo"
  }
}

Billing and Credits

  • Cost: Based on selected model
  • Tracking: Per-model usage tracked in request_logs
  • Attribution: Each request logs actual model used
  • Organization Quotas: Collection requests count toward org quota
  • Cost Optimization: Round-robin balances expensive/cheap models

Limits and Constraints

Constraint Value
Min Models 1
Max Models Unlimited
Name Length 100 characters
Description Length Text field
Routing Strategies round_robin, random, priority, least_used
Request Rate Model-dependent
Max Collection Name 100 characters (lowercase, alphanumeric, hyphens)

Integration Examples

Example 1: Customer Support Bot

import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-api-key",
    base_url="https://api.langmart.ai/v1"
)

response = client.messages.create(
    model="collection/flash2.5",
    max_tokens=256,
    messages=[
        {"role": "user", "content": "I can't reset my password"}
    ]
)
print(response.content[0].text)

Example 2: Load Balancing with Fallback

async function chat(userMessage) {
    try {
        const response = await fetch('https://api.langmart.ai/v1/chat/completions', {
            method: 'POST',
            headers: {
                'Authorization': 'Bearer sk-your-api-key',
                'Content-Type': 'application/json'
            },
            body: JSON.stringify({
                model: 'collection/flash2.5',
                messages: [{ role: 'user', content: userMessage }],
                temperature: 0.7
            })
        });

        if (!response.ok) {
            console.error('Collection error:', await response.json());
        }

        return await response.json();
    } catch (error) {
        console.error('Request failed:', error);
        // Fallback to single model or different collection
    }
}

Monitoring and Analytics

Request Distribution

SELECT
    -- Extract model from response metadata
    response_data->>'_selected_model' as actual_model,
    COUNT(*) as request_count,
    ROUND(AVG(CAST(response_data->>'_latency_ms' AS NUMERIC)), 2) as avg_latency_ms
FROM request_logs
WHERE request_data->>'model' = 'collection/flash2.5'
  AND DATE(created_at) = CURRENT_DATE
GROUP BY actual_model
ORDER BY request_count DESC;

Collection Performance

SELECT
    HOUR(created_at) as hour,
    COUNT(*) as requests,
    ROUND(AVG(CAST(response_data->>'_latency_ms' AS NUMERIC)), 2) as avg_latency,
    SUM(CAST(response_data->>'total_tokens' AS INTEGER)) as total_tokens
FROM request_logs
WHERE request_data->>'model' = 'collection/flash2.5'
  AND DATE(created_at) = CURRENT_DATE
GROUP BY HOUR(created_at)
ORDER BY hour DESC;

Database Schema

See /datastore/tables/99_model_collections.sql for complete schema details.

Key Tables

  • model_collections: Collection metadata
  • model_collection_members: Collection membership and routing weights
  • model_categories: Available models that can be added to collections
  • request_logs: Tracks all requests and selected models
  • Collection Tools: /gateway-type3/collection-tools.ts
  • Model Collection Router: /gateway-type1/lib/services/model-collection-router.ts
  • Database Schema: /datastore/tables/99_model_collections.sql
  • Migrations: /datastore/migrations/20251220_*.sql

Version History

  • v1.0 (2025-12-20): Initial organization collection implementation
  • Round-robin routing strategy implemented
  • Collection member management API created
  • Metadata storage for collection configuration