Nous: Hermes 3.1 Llama 3.1 405B
Model Overview
| Property |
Value |
| Model Name |
Nous: Hermes 3.1 Llama 3.1 405B Instruct |
| Model ID |
nousresearch/nous-hermes-3.1-llama-3.1-405b |
| Author/Organization |
Nous Research |
| Release Date |
November 2024 |
| Base Model |
Llama-3.1 405B (full-parameter finetune) |
| Architecture |
Transformer (Llama 3.1 architecture) |
Description
Nous Hermes 3.1 is an advanced iterative refinement of the Hermes 3 model family, built on top of Llama-3.1 405B. This model represents the cutting edge of open-source instruction-tuned models with enhanced performance across reasoning, agentic capabilities, and long-context understanding.
Key Improvements Over Hermes 3
- Enhanced Reasoning Capabilities: Improved logical reasoning and problem-solving with better accuracy
- Advanced Agentic Performance: Superior autonomous agent behavior with improved planning
- Extended Context Handling: Better utilization of full 131K token context window
- Improved Instruction Following: More precise adherence to complex instructions
- Better Multi-turn Coherence: Enhanced context retention and conversation continuity
- Refined Function Calling: More reliable structured output and tool invocation
- Advanced Code Generation: Improved code quality across programming languages
- Enhanced Roleplay Capabilities: Better character consistency and creativity
Technical Specifications
| Specification |
Value |
| Context Window |
128,000 tokens |
| Context Length |
131,072 tokens |
| Max Completion Tokens |
16,384 tokens |
| Input Modalities |
Text |
| Output Modalities |
Text |
| Instruction Format |
ChatML |
| Quantization |
FP8 |
| Parameters |
405 Billion |
| Training Data Cutoff |
November 2024 |
Pricing
| Type |
Price |
| Input Tokens |
$1.00 per 1M tokens |
| Output Tokens |
$1.00 per 1M tokens |
Cost Examples
| Use Case |
Input Tokens |
Output Tokens |
Estimated Cost |
| Short conversation |
1,000 |
500 |
$0.0015 |
| Code generation task |
5,000 |
2,000 |
$0.007 |
| Long document analysis |
50,000 |
10,000 |
$0.06 |
| Extended agent session |
100,000 |
50,000 |
$0.15 |
| Full context research task |
130,000 |
10,000 |
$0.14 |
Capabilities
Core Capabilities
- Text Generation: High-quality text completion and generation
- Function Calling: Structured tool invocation with JSON schemas
- Code Generation: Multi-language code writing, debugging, and refactoring
- Reasoning: Complex logical reasoning, analysis, and problem-solving
- Multi-turn Conversation: Extended dialogue with superior context retention
- Agentic Tasks: Autonomous task execution with planning and tool use
- Long-context Processing: Efficient handling of documents up to 131K tokens
- Instruction Following: Precise adherence to complex, multi-step instructions
| Tool Choice Option |
Description |
none |
Disable tool calling |
auto |
Model decides whether to use tools |
required |
Force tool usage for all responses |
function |
Specify exact function to call |
Structured Outputs
Supports response_format parameter for:
- JSON Mode: Generate valid JSON output
- JSON Schema: Validate output against custom schemas
- Custom Structured Outputs: Define specific response structures
- XML Mode: Generate XML-formatted outputs
Supported Parameters
| Parameter |
Type |
Range |
Default |
Description |
temperature |
float |
0.0 - 2.0 |
0.7 |
Controls randomness in responses |
top_p |
float |
0.0 - 1.0 |
0.9 |
Nucleus sampling threshold |
top_k |
integer |
1 - 100 |
40 |
Limits vocabulary to top K tokens |
stop |
array |
- |
- |
Stop sequences to end generation |
frequency_penalty |
float |
-2.0 - 2.0 |
0.0 |
Reduces repetition of frequent tokens |
presence_penalty |
float |
-2.0 - 2.0 |
0.0 |
Reduces repetition of any repeated tokens |
repetition_penalty |
float |
0.0 - 2.0 |
1.0 |
Alternative repetition control |
seed |
integer |
0 - 2^32 |
- |
Random seed for reproducibility |
min_p |
float |
0.0 - 1.0 |
0.0 |
Minimum probability threshold |
max_tokens |
integer |
1 - 16384 |
- |
Maximum tokens to generate |
response_format |
object |
- |
- |
Structured output format specification |
Use Cases
Recommended For
- Advanced Agentic Applications: Complex autonomous agents, multi-step workflows
- Technical Problem Solving: Debugging, optimization, architectural design
- High-quality Code Development: Complex algorithms, system design, refactoring
- Advanced Reasoning Tasks: Logic puzzles, mathematical proofs, analysis
- Long-form Content Creation: Books, technical documentation, research papers
- Complex Multi-turn Interactions: Extended conversations with context preservation
- Roleplaying & Creative Writing: Character development, narrative creation
- Knowledge Integration: Combining information from extensive documents
Not Recommended For
- Low-latency Requirements: Large model size results in higher latency
- Cost-sensitive Applications: Premium pricing compared to smaller models
- Simple Queries: Overkill for basic Q&A or simple tasks
- Mobile/Edge Deployment: Requires cloud infrastructure
- Real-time Requirements: Not suitable for sub-second response needs
Hermes 3.1 Family
| Model |
Parameters |
Context |
Characteristics |
| Nous Hermes 3.1 Llama 3.1 405B |
405B |
131K |
Maximum capability, latest iteration |
| Nous Hermes 3 Llama 3.1 405B |
405B |
131K |
Previous version, stable |
| Nous Hermes 3.1 Llama 3.1 70B |
70B |
131K |
Balanced performance/cost |
| Nous Hermes 3.1 Llama 3.1 8B |
8B |
131K |
Fast, cost-effective |
Alternative 405B Options
| Model |
Provider |
Context |
| Llama 3.1 405B Instruct |
Meta |
131K |
| Llama 4 Maverick |
Meta |
131K |
| Mixtral 8x22B |
Mistral |
65K |
Providers
Available Providers
| Provider |
Status |
Details |
| DeepInfra |
Primary |
Full model availability |
| OpenRouter |
Secondary |
Aggregated access (if available) |
| Nous Research |
Official |
Direct access via Nous API |
Provider Details
DeepInfra:
- Provider Model ID:
NousResearch/Nous-Hermes-3.1-Llama-3.1-405B
- Max Completion Tokens: 16,384
- Request Rate: Default limits
- Availability: Full model access
Strengths
- Exceptional Reasoning: Superior logical reasoning from 405B parameters
- Industry-leading Agentic Performance: Top-tier autonomous agent capabilities for open models
- Strong Multi-turn Coherence: Maintains context quality over 100K+ token conversations
- Reliable Function Calling: Consistent structured output and tool invocation
- Excellent Code Quality: High-quality code generation across multiple languages
- Superior Instruction Following: Precise execution of complex instructions
- Extended Context Utilization: Efficiently uses full 131K token context window
Considerations
- Higher Latency: Larger model size increases response time vs. smaller models
- Premium Pricing: Higher costs compared to 70B or smaller models
- Infrastructure Requirements: Requires substantial compute resources
- Memory Footprint: Large model requires significant GPU/TPU memory (FP8 quantization required for practical deployment)
Comparison with Other 405B Models
| Model |
Organization |
Parameters |
Context |
Use Case |
| Nous Hermes 3.1 Llama 3.1 405B |
Nous Research |
405B |
131K |
Advanced reasoning & agentic |
| Llama 3.1 405B Instruct |
Meta |
405B |
131K |
General-purpose, official base |
| Claude 3 Opus |
Anthropic |
~200B* |
200K |
Enterprise, safety-focused |
| GPT-4 Turbo |
OpenAI |
~1.7T* |
128K |
Premium closed-source |
| Mixtral 8x22B |
Mistral |
141B |
65K |
Efficient, open-source |
*Estimated parameters
API Usage Examples
Basic Chat Completion
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nousresearch/nous-hermes-3.1-llama-3.1-405b",
"messages": [
{
"role": "system",
"content": "You are an expert software architect with deep knowledge of system design patterns."
},
{
"role": "user",
"content": "Design a distributed caching system for a high-traffic web application. Consider consistency, fault tolerance, and performance."
}
],
"temperature": 0.7,
"max_tokens": 4096
}'
Function Calling Example
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nousresearch/nous-hermes-3.1-llama-3.1-405b",
"messages": [
{
"role": "user",
"content": "I need to retrieve the weather for New York, Boston, and Los Angeles for my trip planning."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather and forecast for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state or country"
},
"days": {
"type": "integer",
"description": "Number of forecast days (1-7)"
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}'
Structured Output with JSON Schema
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nousresearch/nous-hermes-3.1-llama-3.1-405b",
"messages": [
{
"role": "user",
"content": "Analyze this code snippet and provide a detailed review with improvements."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "CodeReview",
"schema": {
"type": "object",
"properties": {
"issues": {
"type": "array",
"items": {
"type": "object",
"properties": {
"severity": {"type": "string"},
"description": {"type": "string"},
"fix": {"type": "string"}
}
}
},
"improvements": {
"type": "array",
"items": {"type": "string"}
},
"overall_rating": {"type": "number"}
}
}
}
}
}'
Long-context Processing
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nousresearch/nous-hermes-3.1-llama-3.1-405b",
"messages": [
{
"role": "system",
"content": "You are a research assistant. Analyze the provided documents and extract key insights."
},
{
"role": "user",
"content": "[Long document content - up to 131K tokens]\n\nProvide a comprehensive summary with key findings."
}
],
"temperature": 0.3,
"max_tokens": 2048
}'
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "nousresearch/nous-hermes-3.1-llama-3.1-405b",
"messages": [
{
"role": "user",
"content": "Help me plan a business trip to San Francisco. I need flight bookings, hotel recommendations, and weather information."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "search_flights",
"description": "Search for flight options",
"parameters": {
"type": "object",
"properties": {
"from": {"type": "string"},
"to": {"type": "string"},
"date": {"type": "string"}
},
"required": ["from", "to", "date"]
}
}
},
{
"type": "function",
"function": {
"name": "search_hotels",
"description": "Find hotel recommendations",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"check_in": {"type": "string"},
"check_out": {"type": "string"},
"price_range": {"type": "string"}
},
"required": ["city", "check_in", "check_out"]
}
}
},
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather forecast",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"days": {"type": "integer"}
},
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}'
Nous Hermes 3.1 uses the ChatML instruction format:
<|im_start|>system
You are a helpful and knowledgeable AI assistant with expertise in multiple domains.<|im_end|>
<|im_start|>user
What is quantum entanglement?<|im_end|>
<|im_start|>assistant
Quantum entanglement is a phenomenon in quantum mechanics where two or more particles become correlated in such a way that the quantum state of each particle cannot be described independently, even when the particles are separated by large distances.<|im_end|>
<|im_start|>user
How is it used in quantum computing?<|im_end|>
<|im_start|>assistant
In quantum computing, entanglement is fundamental to creating quantum gates and circuits. It allows quantum computers to process information in ways that classical computers cannot...
Reasoning & Analysis
| Task |
Performance |
Notes |
| Complex Problem Solving |
Excellent |
Superior reasoning chains |
| Mathematical Proofs |
Excellent |
Handles complex logic |
| Code Generation |
Excellent |
Production-quality code |
| Analysis & Synthesis |
Excellent |
Integrates complex information |
Conversation Quality
| Metric |
Performance |
| Context Retention (50K tokens) |
Excellent |
| Context Retention (100K+ tokens) |
Excellent |
| Multi-turn Coherence |
Excellent |
| Instruction Following |
Excellent |
Optimization Tips
For Best Results
- Use Clear System Prompts: Provide detailed role definitions for optimal performance
- Structure Complex Requests: Break multi-step tasks into clear steps
- Leverage Tool Use: Use function calling for structured information needs
- Set Appropriate Temperature: Use 0.3-0.5 for deterministic tasks, 0.7-0.9 for creative content
- Use Full Context: This model excels with extended context (50K+ tokens)
- Enable Structured Output: Use JSON schema for consistent, machine-readable responses
Cost Optimization
- Consider Hermes 3.1 70B for similar tasks to reduce costs
- Batch requests to reduce overhead
- Use
max_tokens to avoid unnecessary token generation
- Implement prompt caching for repeated queries
Limitations & Considerations
- Knowledge Cutoff: Information current only through November 2024
- No Real-time Information: Cannot access current data, weather, or news
- No Internet Access: Cannot browse the web or fetch external URLs
- Training Data Bias: May reflect biases present in training data
- Hallucinations: Can generate plausible but incorrect information
- Token Limits: Context and completion tokens have hard limits
- Processing Speed: Large model means slower response times than smaller alternatives
Source & Documentation
Support & Issues
For model-specific issues or questions:
Last updated: December 23, 2024
Note: This model documentation is based on publicly available information. Model availability and pricing may vary by provider. Please verify current availability and pricing with your chosen provider before implementation.