M

Meta Llama 3.1 405B Instruct

Meta
Vision
128K
Context
$3.50
Input /1M
$3.50
Output /1M
N/A
Max Output

Meta Llama 3.1 405B Instruct

Overview

Attribute Value
Model Name Meta: Llama 3.1 405B Instruct
Model ID meta-llama/llama-3.1-405b-instruct
Creator Meta (meta-llama)
Release Date July 23, 2024
Parameters 405 billion
Architecture Auto-regressive transformer with Grouped-Query Attention (GQA)
Context Length 130,815 tokens (~128K)
Knowledge Cutoff December 2023

Description

Llama 3.1 405B Instruct is Meta's largest and most capable open-source language model, representing their flagship offering in the Llama 3.1 series. This 405-billion parameter model features a 128K token context window and demonstrates competitive performance against leading closed-source models including GPT-4o and Claude 3.5 Sonnet in benchmark evaluations.

The model was fine-tuned using:

  • Supervised Fine-Tuning (SFT)
  • Reinforcement Learning with Human Feedback (RLHF)
  • Over 25 million synthetically generated examples plus human-generated data

Supported Languages

  • English
  • German
  • French
  • Italian
  • Portuguese
  • Hindi
  • Spanish
  • Thai

Technical Specifications

Model Architecture

  • Type: Auto-regressive transformer
  • Attention: Grouped-Query Attention (GQA) for improved inference scalability
  • Input/Output: Text only
  • Instruction Type: Llama3
  • Vocabulary Size: 128,256 tokens

Training Data

Attribute Value
Context Window 128,000 tokens
Training Data Size ~15 trillion tokens from publicly available sources
Fine-tuning Data >25M synthetically generated examples + human-generated data
Training Infrastructure Custom Meta GPU cluster

Training Compute

Metric Value
GPU Hours ~30.8M GPU hours
Hardware H100-80GB GPUs (700W TDP)
Total Compute Approximately 3.8 x 10^25 FLOPs
Location-Based Emissions ~8,930 tons CO2eq
Market-Based Emissions 0 tons CO2eq (100% renewable energy)

Supported Parameters

Parameter Supported
max_tokens Yes
temperature Yes
top_p Yes
top_k Yes
stop Yes
frequency_penalty Yes
presence_penalty Yes
repetition_penalty Yes
logit_bias Yes
min_p Yes
tools Yes
tool_choice Yes

Default Stop Tokens

  • <|eot_id|>
  • <|end_of_text|>

Features

Feature Status
Tool Calling Supported
Multipart Requests Supported
Abortable Requests Supported
Reasoning Capabilities Not Supported
JSON Mode Supported

Model Parameters Context Use Case
Llama 3.1 8B Instruct 8B 128K Lightweight deployment
Llama 3.1 70B Instruct 70B 128K Balanced performance/cost
Llama 3.1 405B Instruct 405B 128K Maximum capability
Llama 3.2 Vision Various Various Multimodal (image + text)
Llama 3.3 70B Instruct 70B 128K Next-gen optimized 70B

Providers

Together AI - Primary Provider

Attribute Value
Provider Slug together
Model ID at Provider meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Base URL https://api.langmart.ai/v1
Region US
Data Training No
Prompt Retention No
Terms https://www.together.ai/terms-of-service

Additional Providers

The Llama 3.1 405B model is also available through:

  • Fireworks AI - High-performance inference
  • Lepton AI - Alternative hosting
  • Novita AI - Cost-effective option

Pricing (via LangMart)

Standard Tier (Together AI)

Type Price per Million Tokens
Input $3.50
Output $3.50
  • Quantization: FP8
  • Provider Model ID: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo

Free Tier

Type Price per Million Tokens
Input $0.00
Output $0.00
  • Model ID: meta-llama/llama-3.1-405b-instruct:free
  • Limited rate/quota applies

Performance Benchmarks

Comparison with Other Leading Models

Benchmark Category Llama 3.1 70B Llama 3.1 405B GPT-4o Claude 3.5 Sonnet
MMLU (CoT) Knowledge 86.0% 88.6% 88.7% 88.7%
MMLU Pro (CoT) Knowledge 66.4% 73.3% - -
IFEval Steerability 87.5% 88.6% - -
GPQA Diamond Reasoning 48.0% 49.0% - -
HumanEval Code 80.5% 89.0% 90.2% 92.0%
MBPP EvalPlus Code 86.0% 88.6% - -
MATH (CoT) Math 68.0% 73.8% 76.6% 78.3%
BFCL v2 Tool Use 77.5% 81.1% - -
MGSM Multilingual 86.9% 91.6% - -

Performance Highlights

  • Flagship open-source model with 405B parameters
  • Competitive with GPT-4o and Claude 3.5 Sonnet across major benchmarks
  • 91.6% on MGSM (multilingual reasoning)
  • 89.0% on HumanEval (code generation)
  • 81.1% on BFCL v2 (tool use / function calling)

Hardware Requirements

Inference

The 405B model requires significant GPU resources for inference:

Configuration VRAM Required
FP16 (full precision) ~810 GB
FP8 (quantized) ~405 GB
INT4 (quantized) ~203 GB

For FP8 inference (used by most providers):

  • 8x A100 80GB GPUs, or
  • 8x H100 80GB GPUs, or
  • Multi-node setup with smaller GPUs

Using vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct",
    tensor_parallel_size=8,  # 8 GPUs
    dtype="bfloat16"
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=2048
)

outputs = llm.generate(["Hello, how are you?"], sampling_params)

Tool Use

The model supports function calling with the following pattern:

def get_current_weather(location: str, unit: str = "celsius") -> dict:
    """Get the current weather in a given location.

    Args:
        location: The city and country (e.g., "Paris, France")
        unit: Temperature unit ("celsius" or "fahrenheit")
    Returns:
        Weather information as a dictionary
    """
    return {"temperature": 22, "condition": "sunny"}

messages = [
    {"role": "system", "content": "You are a helpful weather assistant."},
    {"role": "user", "content": "What's the weather in Paris?"}
]

# Using transformers
inputs = tokenizer.apply_chat_template(
    messages,
    tools=[get_current_weather],
    add_generation_prompt=True
)

Tool Response Handling

# After model generates tool call
tool_call = {
    "name": "get_current_weather",
    "arguments": {"location": "Paris, France", "unit": "celsius"}
}
messages.append({
    "role": "assistant",
    "tool_calls": [{"type": "function", "function": tool_call}]
})

# Append tool result
messages.append({
    "role": "tool",
    "name": "get_current_weather",
    "content": '{"temperature": 22, "condition": "sunny"}'
})

License

License: Llama 3.1 Community License Agreement

Key Terms

  • Non-exclusive, worldwide, non-transferable, royalty-free limited license
  • Use, reproduce, distribute, copy, create derivative works
  • Modify the Llama Materials

Commercial Requirements

  • If monthly active users exceed 700M, you must request a license from Meta
  • Must include "Built with Llama" on related websites, user interfaces, and documentation
  • Must include "Llama" at the beginning of any AI model name built with Llama 3.1

Attribution Required

Llama 3.1 is licensed under the Llama 3.1 Community License,
Copyright Meta Platforms, Inc. All Rights Reserved.

Prohibited Uses

  • Violence, terrorism, and illegal activities
  • Child exploitation and abuse material
  • Human trafficking and sexual violence
  • Harassment and bullying
  • Discrimination in employment, credit, housing
  • Unauthorized professional practice (legal, medical, financial)
  • Malware and malicious code creation
  • Fraud, disinformation, and defamation
  • Impersonation and misrepresentation
  • Violations of ITAR, biological/chemical weapons regulations

Safety Considerations

Critical Risk Mitigation Areas

  1. CBRNE Materials - Uplift testing to assess proliferation risks
  2. Child Safety - Expert red teaming across supported languages
  3. Cyber Attack Enablement - Hacking task capability evaluation
Tool Purpose
Llama Guard 3 Input/output filtering
Prompt Guard Prompt injection detection
Code Shield Code security analysis

Multilinguality Caution

The model supports 7 non-English languages with safety thresholds met. Use in non-supported languages is strongly discouraged without:

  • Fine-tuning
  • System controls aligned with use case policies

Data Policy

Policy Status
Training on User Data No
Prompt Retention No
Acceptable Use Policy https://llama.meta.com/llama3/use-policy/

API Usage Examples

LangMart API

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $LANGMART_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.1-405b-instruct",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Free Tier

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $LANGMART_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/llama-3.1-405b-instruct:free",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Together AI Direct

curl https://api.langmart.ai/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ]
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.langmart.ai/v1",
    api_key="YOUR_LANGMART_API_KEY"
)

response = client.chat.completions.create(
    model="meta-llama/llama-3.1-405b-instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)

Resources

Official Documentation

Issue Reporting

Issue Type Contact
Model Issues https://github.com/meta-llama/llama-models/issues
Risky Content developers.facebook.com/llama_output_feedback
Security Bugs facebook.com/whitehat/info
Policy Violations LlamaUseReport@meta.com


Last updated: December 2024 Source: LangMart, Hugging Face, Meta Model Card