Overview
| Attribute |
Value |
| Model Name |
Meta: Llama 3.1 405B Instruct |
| Model ID |
meta-llama/llama-3.1-405b-instruct |
| Creator |
Meta (meta-llama) |
| Release Date |
July 23, 2024 |
| Parameters |
405 billion |
| Architecture |
Auto-regressive transformer with Grouped-Query Attention (GQA) |
| Context Length |
130,815 tokens (~128K) |
| Knowledge Cutoff |
December 2023 |
Description
Llama 3.1 405B Instruct is Meta's largest and most capable open-source language model, representing their flagship offering in the Llama 3.1 series. This 405-billion parameter model features a 128K token context window and demonstrates competitive performance against leading closed-source models including GPT-4o and Claude 3.5 Sonnet in benchmark evaluations.
The model was fine-tuned using:
- Supervised Fine-Tuning (SFT)
- Reinforcement Learning with Human Feedback (RLHF)
- Over 25 million synthetically generated examples plus human-generated data
Supported Languages
- English
- German
- French
- Italian
- Portuguese
- Hindi
- Spanish
- Thai
Technical Specifications
Model Architecture
- Type: Auto-regressive transformer
- Attention: Grouped-Query Attention (GQA) for improved inference scalability
- Input/Output: Text only
- Instruction Type: Llama3
- Vocabulary Size: 128,256 tokens
Training Data
| Attribute |
Value |
| Context Window |
128,000 tokens |
| Training Data Size |
~15 trillion tokens from publicly available sources |
| Fine-tuning Data |
>25M synthetically generated examples + human-generated data |
| Training Infrastructure |
Custom Meta GPU cluster |
Training Compute
| Metric |
Value |
| GPU Hours |
~30.8M GPU hours |
| Hardware |
H100-80GB GPUs (700W TDP) |
| Total Compute |
Approximately 3.8 x 10^25 FLOPs |
| Location-Based Emissions |
~8,930 tons CO2eq |
| Market-Based Emissions |
0 tons CO2eq (100% renewable energy) |
Supported Parameters
| Parameter |
Supported |
max_tokens |
Yes |
temperature |
Yes |
top_p |
Yes |
top_k |
Yes |
stop |
Yes |
frequency_penalty |
Yes |
presence_penalty |
Yes |
repetition_penalty |
Yes |
logit_bias |
Yes |
min_p |
Yes |
tools |
Yes |
tool_choice |
Yes |
Default Stop Tokens
<|eot_id|>
<|end_of_text|>
Features
| Feature |
Status |
| Tool Calling |
Supported |
| Multipart Requests |
Supported |
| Abortable Requests |
Supported |
| Reasoning Capabilities |
Not Supported |
| JSON Mode |
Supported |
| Model |
Parameters |
Context |
Use Case |
| Llama 3.1 8B Instruct |
8B |
128K |
Lightweight deployment |
| Llama 3.1 70B Instruct |
70B |
128K |
Balanced performance/cost |
| Llama 3.1 405B Instruct |
405B |
128K |
Maximum capability |
| Llama 3.2 Vision |
Various |
Various |
Multimodal (image + text) |
| Llama 3.3 70B Instruct |
70B |
128K |
Next-gen optimized 70B |
Providers
Together AI - Primary Provider
| Attribute |
Value |
| Provider Slug |
together |
| Model ID at Provider |
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo |
| Base URL |
https://api.langmart.ai/v1 |
| Region |
US |
| Data Training |
No |
| Prompt Retention |
No |
| Terms |
https://www.together.ai/terms-of-service |
Additional Providers
The Llama 3.1 405B model is also available through:
- Fireworks AI - High-performance inference
- Lepton AI - Alternative hosting
- Novita AI - Cost-effective option
Pricing (via LangMart)
Standard Tier (Together AI)
| Type |
Price per Million Tokens |
| Input |
$3.50 |
| Output |
$3.50 |
- Quantization: FP8
- Provider Model ID:
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Free Tier
| Type |
Price per Million Tokens |
| Input |
$0.00 |
| Output |
$0.00 |
- Model ID:
meta-llama/llama-3.1-405b-instruct:free
- Limited rate/quota applies
Comparison with Other Leading Models
| Benchmark |
Category |
Llama 3.1 70B |
Llama 3.1 405B |
GPT-4o |
Claude 3.5 Sonnet |
| MMLU (CoT) |
Knowledge |
86.0% |
88.6% |
88.7% |
88.7% |
| MMLU Pro (CoT) |
Knowledge |
66.4% |
73.3% |
- |
- |
| IFEval |
Steerability |
87.5% |
88.6% |
- |
- |
| GPQA Diamond |
Reasoning |
48.0% |
49.0% |
- |
- |
| HumanEval |
Code |
80.5% |
89.0% |
90.2% |
92.0% |
| MBPP EvalPlus |
Code |
86.0% |
88.6% |
- |
- |
| MATH (CoT) |
Math |
68.0% |
73.8% |
76.6% |
78.3% |
| BFCL v2 |
Tool Use |
77.5% |
81.1% |
- |
- |
| MGSM |
Multilingual |
86.9% |
91.6% |
- |
- |
- Flagship open-source model with 405B parameters
- Competitive with GPT-4o and Claude 3.5 Sonnet across major benchmarks
- 91.6% on MGSM (multilingual reasoning)
- 89.0% on HumanEval (code generation)
- 81.1% on BFCL v2 (tool use / function calling)
Hardware Requirements
Inference
The 405B model requires significant GPU resources for inference:
| Configuration |
VRAM Required |
| FP16 (full precision) |
~810 GB |
| FP8 (quantized) |
~405 GB |
| INT4 (quantized) |
~203 GB |
Recommended Setup
For FP8 inference (used by most providers):
- 8x A100 80GB GPUs, or
- 8x H100 80GB GPUs, or
- Multi-node setup with smaller GPUs
Using vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3.1-405B-Instruct",
tensor_parallel_size=8, # 8 GPUs
dtype="bfloat16"
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048
)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
The model supports function calling with the following pattern:
def get_current_weather(location: str, unit: str = "celsius") -> dict:
"""Get the current weather in a given location.
Args:
location: The city and country (e.g., "Paris, France")
unit: Temperature unit ("celsius" or "fahrenheit")
Returns:
Weather information as a dictionary
"""
return {"temperature": 22, "condition": "sunny"}
messages = [
{"role": "system", "content": "You are a helpful weather assistant."},
{"role": "user", "content": "What's the weather in Paris?"}
]
# Using transformers
inputs = tokenizer.apply_chat_template(
messages,
tools=[get_current_weather],
add_generation_prompt=True
)
# After model generates tool call
tool_call = {
"name": "get_current_weather",
"arguments": {"location": "Paris, France", "unit": "celsius"}
}
messages.append({
"role": "assistant",
"tool_calls": [{"type": "function", "function": tool_call}]
})
# Append tool result
messages.append({
"role": "tool",
"name": "get_current_weather",
"content": '{"temperature": 22, "condition": "sunny"}'
})
License
License: Llama 3.1 Community License Agreement
Key Terms
- Non-exclusive, worldwide, non-transferable, royalty-free limited license
- Use, reproduce, distribute, copy, create derivative works
- Modify the Llama Materials
Commercial Requirements
- If monthly active users exceed 700M, you must request a license from Meta
- Must include "Built with Llama" on related websites, user interfaces, and documentation
- Must include "Llama" at the beginning of any AI model name built with Llama 3.1
Attribution Required
Llama 3.1 is licensed under the Llama 3.1 Community License,
Copyright Meta Platforms, Inc. All Rights Reserved.
Prohibited Uses
- Violence, terrorism, and illegal activities
- Child exploitation and abuse material
- Human trafficking and sexual violence
- Harassment and bullying
- Discrimination in employment, credit, housing
- Unauthorized professional practice (legal, medical, financial)
- Malware and malicious code creation
- Fraud, disinformation, and defamation
- Impersonation and misrepresentation
- Violations of ITAR, biological/chemical weapons regulations
Safety Considerations
Critical Risk Mitigation Areas
- CBRNE Materials - Uplift testing to assess proliferation risks
- Child Safety - Expert red teaming across supported languages
- Cyber Attack Enablement - Hacking task capability evaluation
Recommended Safeguards
| Tool |
Purpose |
| Llama Guard 3 |
Input/output filtering |
| Prompt Guard |
Prompt injection detection |
| Code Shield |
Code security analysis |
Multilinguality Caution
The model supports 7 non-English languages with safety thresholds met. Use in non-supported languages is strongly discouraged without:
- Fine-tuning
- System controls aligned with use case policies
Data Policy
API Usage Examples
LangMart API
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.1-405b-instruct",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Free Tier
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.1-405b-instruct:free",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Together AI Direct
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://api.langmart.ai/v1",
api_key="YOUR_LANGMART_API_KEY"
)
response = client.chat.completions.create(
model="meta-llama/llama-3.1-405b-instruct",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
temperature=0.7,
max_tokens=1024
)
print(response.choices[0].message.content)
Resources
Official Documentation
Issue Reporting
Last updated: December 2024
Source: LangMart, Hugging Face, Meta Model Card