Collections: Flash 2.5

Overview

Property	Value
Model ID	collection/flash2.5
Display Name	Flash 2.5 Collection
Type	Organization Collection
Access Method	collection/flash2.5
Scope	Organization-level
Routing Strategy	Round Robin

Description

The Flash 2.5 Collection is a curated organization-level collection of fast-responding, lightweight language models optimized for speed and cost-effectiveness. This collection focuses on models that provide excellent performance for real-time conversational AI applications while maintaining low latency.

Key Characteristics

Speed-Optimized: Models selected for low response latency and high throughput
Cost-Effective: Emphasis on efficient token consumption and reasonable pricing
Organization-Scoped: Shared resource for all members within an organization
Load Distribution: Uses round-robin routing to balance requests across models
Flexible Membership: Organization admins can add/remove models from the collection

Specifications

Aspect	Details
Context Window	10,000 tokens
Collection Type	Organization-level
Access Level	Organization members only
Scope	organization
Routing Strategy	Round Robin (sequential model selection)
Visibility	Organization (shared with all org members)
Minimum Members	1 model
Maximum Members	Unlimited
Model Heterogeneity	Can mix different providers and model sizes

Use Cases

1. High-Throughput Chat Applications

Customer support chatbots serving thousands of concurrent users
Multi-tenant SaaS platforms with real-time chat features
AI assistant applications requiring low latency
Load distribution across lightweight model variants

2. Cost-Optimized Production Deployments

Reduce per-request inference costs while maintaining quality
Run at scale with budget constraints
Balance quality and cost for non-critical features
Implement tiered service offerings

3. Real-Time Assistant Interactions

Live chat with human agents
Real-time translation services
Immediate response requirements (<2s)
Interactive web application backends

4. Educational Demos and Prototypes

Fast feedback loops for development
Educational institution deployments
Proof-of-concept applications
Internal tool prototyping

5. Distributed Workload Balancing

Prevent bottleneck on single model
Handle burst traffic gracefully
Optimize resource utilization
Smooth degradation under load

Characteristics of Included Models

Models in Flash 2.5 typically exhibit:

Characteristic	Value
Type	Lightweight LLMs and fast variants
Average Latency	<2 seconds response time
Throughput	>100 requests/second
Context Windows	4K - 32K tokens (typically 8K-16K)
Inference Cost	Budget-friendly (lower per-token pricing)
Use Case Focus	Conversational, real-time applications
Tool Support	Most support function calling
Multi-modal	Mix of text-only and vision-capable

Model Selection Strategy

Round Robin Routing

The Flash 2.5 collection uses round-robin routing to distribute requests:

Sequential Selection: Models are selected in order
Load Balancing: Requests distributed evenly across collection members
Cycling: Returns to first model after last model is selected
Stateful: Maintains selection index per user/organization

Example Rotation

Collection with 3 models: [Model A, Model B, Model C]

Request 1 → Model A
Request 2 → Model B
Request 3 → Model C
Request 4 → Model A
Request 5 → Model B
...

Fallback Behavior

If selected model is unavailable: Skip to next available model
If all models unavailable: Return error with available models list
If insufficient permissions: Skip to next accessible model

Usage

Basic Request

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "collection/flash2.5",
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "temperature": 0.7
  }'

With Parameters

curl -X POST https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "collection/flash2.5",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100,
    "temperature": 0.5,
    "top_p": 0.9
  }'

Collection Management

Viewing Collection Details

SELECT
    id,
    collection_name,
    display_name,
    description,
    scope,
    routing_strategy,
    organization_id
FROM model_collections
WHERE collection_name = 'flash2.5'
  AND is_active = true;

Viewing Collection Members

SELECT
    mcm.id,
    mc.category_display_id as model_id,
    mc.model_name,
    mcm.priority,
    mcm.weight,
    mcm.is_active
FROM model_collection_members mcm
JOIN model_categories mc ON mcm.model_category_id = mc.id
WHERE mcm.collection_id = (
    SELECT id FROM model_collections
    WHERE collection_name = 'flash2.5'
)
ORDER BY mcm.priority DESC;

API: List Collection Models

curl -X GET https://api.langmart.ai/api/user/model-collections/COLLECTION_ID \
  -H "Authorization: Bearer sk-your-api-key" | jq '.models'

API: Add Model to Collection

curl -X POST https://api.langmart.ai/api/user/model-collections/COLLECTION_ID/members \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "openai/gpt-3.5-turbo"
  }'

API: Remove Model from Collection

curl -X DELETE "https://api.langmart.ai/api/user/model-collections/COLLECTION_ID/members/openai%2Fgpt-3.5-turbo" \
  -H "Authorization: Bearer sk-your-api-key"

Response Format

Responses include collection routing metadata:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "google/gemini-flash-latest",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum bits (qubits)..."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 45,
    "total_tokens": 55
  },
  "_collection_routed": true,
  "_collection_name": "flash2.5",
  "_selected_model": "google/gemini-flash-latest"
}

Performance Characteristics

Metric	Target
Selection Latency	<1ms
P50 Response Time	<1 second
P95 Response Time	<3 seconds
P99 Response Time	<5 seconds
Throughput	>100 req/s (model-dependent)
Cache TTL	60 seconds
Request Success Rate	>99% (with fallback)

Typical Collection Composition

A typical Flash 2.5 collection might include:

1. google/gemini-flash-latest (Google)
   - Fast, multimodal, 10K context window

2. openai/gpt-3.5-turbo (OpenAI)
   - Proven performance, 4K context window

3. mistralai/mistral-7b-instruct (Mistral)
   - Open source, lightweight, fast

4. anthropic/claude-3-haiku (Anthropic)
   - Balanced speed/quality, 200K context

Configuration and Metadata

Collection Metadata Storage

Collection-specific settings stored in JSONB metadata field:

{
  "max_context_window": 16000,
  "preferred_providers": ["google", "openai", "mistral"],
  "description": "Fast models for real-time chat",
  "tags": ["production", "cost-optimized", "low-latency"],
  "sla_target_latency_ms": 2000,
  "min_models_available": 2
}

Member Configuration

Per-model settings in model_collection_members:

{
  "priority": 1,
  "weight": 1,
  "tags": ["fallback"],
  "health_check": true
}

Error Handling

Collection Not Found

{
  "error": {
    "code": "collection_not_found",
    "message": "Collection 'flash2.5' not found or access denied"
  }
}

No Available Models

{
  "error": {
    "code": "collection_no_models",
    "message": "No models available in flash2.5 collection. All models are rate-limited or inaccessible.",
    "available_models": [],
    "total_members": 3
  }
}

User Access Denied

{
  "error": {
    "code": "access_denied",
    "message": "Your organization does not have access to flash2.5 collection",
    "required_scope": "organization"
  }
}

Model Error with Fallback

{
  "error": {
    "code": "model_overloaded",
    "message": "Selected model is overloaded, attempting fallback...",
    "fallback_attempted": true,
    "fallback_model": "openai/gpt-3.5-turbo"
  }
}

Billing and Credits

Cost: Based on selected model
Tracking: Per-model usage tracked in request_logs
Attribution: Each request logs actual model used
Organization Quotas: Collection requests count toward org quota
Cost Optimization: Round-robin balances expensive/cheap models

Limits and Constraints

Constraint	Value
Min Models	1
Max Models	Unlimited
Name Length	100 characters
Description Length	Text field
Routing Strategies	round_robin, random, priority, least_used
Request Rate	Model-dependent
Max Collection Name	100 characters (lowercase, alphanumeric, hyphens)

Integration Examples

Example 1: Customer Support Bot

import anthropic

client = anthropic.Anthropic(
    api_key="sk-your-api-key",
    base_url="https://api.langmart.ai/v1"
)

response = client.messages.create(
    model="collection/flash2.5",
    max_tokens=256,
    messages=[
        {"role": "user", "content": "I can't reset my password"}
    ]
)
print(response.content[0].text)

Example 2: Load Balancing with Fallback

async function chat(userMessage) {
    try {
        const response = await fetch('https://api.langmart.ai/v1/chat/completions', {
            method: 'POST',
            headers: {
                'Authorization': 'Bearer sk-your-api-key',
                'Content-Type': 'application/json'
            },
            body: JSON.stringify({
                model: 'collection/flash2.5',
                messages: [{ role: 'user', content: userMessage }],
                temperature: 0.7
            })
        });

        if (!response.ok) {
            console.error('Collection error:', await response.json());
        }

        return await response.json();
    } catch (error) {
        console.error('Request failed:', error);
        // Fallback to single model or different collection
    }
}

Monitoring and Analytics

Request Distribution

SELECT
    -- Extract model from response metadata
    response_data->>'_selected_model' as actual_model,
    COUNT(*) as request_count,
    ROUND(AVG(CAST(response_data->>'_latency_ms' AS NUMERIC)), 2) as avg_latency_ms
FROM request_logs
WHERE request_data->>'model' = 'collection/flash2.5'
  AND DATE(created_at) = CURRENT_DATE
GROUP BY actual_model
ORDER BY request_count DESC;

Collection Performance

SELECT
    HOUR(created_at) as hour,
    COUNT(*) as requests,
    ROUND(AVG(CAST(response_data->>'_latency_ms' AS NUMERIC)), 2) as avg_latency,
    SUM(CAST(response_data->>'total_tokens' AS INTEGER)) as total_tokens
FROM request_logs
WHERE request_data->>'model' = 'collection/flash2.5'
  AND DATE(created_at) = CURRENT_DATE
GROUP BY HOUR(created_at)
ORDER BY hour DESC;

Database Schema

See /datastore/tables/99_model_collections.sql for complete schema details.

Key Tables

model_collections: Collection metadata
model_collection_members: Collection membership and routing weights
model_categories: Available models that can be added to collections
request_logs: Tracks all requests and selected models

Collection Tools: /gateway-type3/collection-tools.ts
Model Collection Router: /gateway-type1/lib/services/model-collection-router.ts
Database Schema: /datastore/tables/99_model_collections.sql
Migrations: /datastore/migrations/20251220_*.sql

Version History

v1.0 (2025-12-20): Initial organization collection implementation
Round-robin routing strategy implemented
Collection member management API created
Metadata storage for collection configuration

Collections: Flash 2.5

Collections: Flash 2.5

Overview

Description

Key Characteristics

Specifications

Use Cases

1. High-Throughput Chat Applications

2. Cost-Optimized Production Deployments

3. Real-Time Assistant Interactions

4. Educational Demos and Prototypes

5. Distributed Workload Balancing

Characteristics of Included Models

Model Selection Strategy

Round Robin Routing

Example Rotation

Fallback Behavior

Usage

Basic Request

With Parameters

Collection Management

Viewing Collection Details

Viewing Collection Members

API: List Collection Models

API: Add Model to Collection

API: Remove Model from Collection

Response Format

Performance Characteristics

Typical Collection Composition

Configuration and Metadata

Collection Metadata Storage

Member Configuration

Error Handling

Collection Not Found

No Available Models

User Access Denied

Model Error with Fallback

Billing and Credits

Limits and Constraints

Integration Examples

Example 1: Customer Support Bot

Example 2: Load Balancing with Fallback

Monitoring and Analytics

Request Distribution

Collection Performance

Database Schema

Key Tables

Related Resources

Version History