Overview
| Attribute |
Value |
| Model Name |
Meta: Llama 3.1 8B Instruct |
| Model ID |
meta-llama/llama-3.1-8b-instruct |
| Creator |
Meta (meta-llama) |
| Release Date |
July 23, 2024 |
| Parameters |
8 billion |
| Architecture |
Auto-regressive transformer with Grouped-Query Attention (GQA) |
| Context Length |
131,072 tokens (128K) |
| Knowledge Cutoff |
December 2023 |
Description
Llama 3.1 8B Instruct is part of Meta's latest class of language models, offering a balance between efficiency and capability. This 8-billion parameter instruction-tuned variant emphasizes speed and efficiency while delivering strong performance comparable to leading closed-source models in human evaluations.
The model was fine-tuned using:
- Supervised Fine-Tuning (SFT)
- Reinforcement Learning with Human Feedback (RLHF)
- Over 25 million synthetically generated examples plus human-generated data
Supported Languages
- English
- German
- French
- Italian
- Portuguese
- Hindi
- Spanish
- Thai
Technical Specifications
Model Architecture
- Type: Auto-regressive transformer
- Attention: Grouped-Query Attention (GQA) for improved inference scalability
- Input/Output: Multilingual text in / text and code out
- Instruction Type: Llama3
Training Data
| Attribute |
Value |
| Context Window |
128,000 tokens |
| Training Data Size |
~15 trillion tokens from publicly available sources |
| Fine-tuning Data |
>25M synthetically generated examples + human-generated data |
| Data Source |
New mix of publicly available online data |
| Training Infrastructure |
Custom Meta GPU cluster |
Training Compute
| Metric |
Value |
| GPU Hours |
1.46M GPU hours |
| Hardware |
H100-80GB GPUs (700W TDP) |
| Location-Based Emissions |
~420 tons CO2eq |
Supported Parameters
| Parameter |
Supported |
max_tokens |
Yes |
temperature |
Yes |
top_p |
Yes |
top_k |
Yes |
stop |
Yes |
frequency_penalty |
Yes |
presence_penalty |
Yes |
repetition_penalty |
Yes |
seed |
Yes |
min_p |
Yes |
response_format |
Yes |
tools |
Yes |
tool_choice |
Yes |
Default Stop Tokens
<|eot_id|>
<|end_of_text|>
Features
| Feature |
Status |
| Tool Calling |
Supported |
| Multipart Requests |
Supported |
| Abortable Requests |
Supported |
| Reasoning Capabilities |
Not Supported |
| Model |
Parameters |
Context |
Use Case |
| Llama 3.1 70B Instruct |
70B |
128K |
Higher capability |
| Llama 3.1 405B Instruct |
405B |
128K |
Maximum capability |
| Llama 3.2 1B Instruct |
1B |
128K |
Edge deployment |
| Llama 3.2 3B Instruct |
3B |
128K |
Mobile/edge |
| Llama 3.3 70B Instruct |
70B |
128K |
Latest 70B model |
| Llama 3.2 Vision |
Various |
Various |
Multimodal (image + text) |
Providers
DeepInfra (Turbo) - Primary Provider
Pricing (via LangMart)
Standard Tier (DeepInfra Turbo)
| Type |
Price per Million Tokens |
| Input |
$0.02 |
| Output |
$0.03 |
- Quantization: FP8
- Max Completion Tokens: 16,384
Free Tier
| Type |
Price per Million Tokens |
| Input |
$0.00 |
| Output |
$0.00 |
- Model ID:
meta-llama/llama-3.1-8b-instruct:free
- Limited rate/quota applies
| Category |
Benchmark |
Metric |
Llama 3.1 8B |
| General |
MMLU (5-shot) |
macro_avg/acc |
69.4% |
|
MMLU (CoT) |
macro_avg/acc |
73.0% |
|
IFEval |
accuracy |
80.4% |
| Reasoning |
ARC-C |
accuracy |
83.4% |
|
GPQA |
exact match |
30.4% |
|
GPQA Diamond |
exact match |
31.8% |
| Code |
HumanEval |
pass@1 |
72.6% |
|
MBPP++ |
pass@1 |
72.8% |
| Math |
GSM-8K (CoT) |
exact match |
84.5% |
|
MATH (CoT) |
final_em |
51.9% |
| Tool Use |
API-Bank |
accuracy |
82.6% |
|
BFCL |
accuracy |
76.1% |
|
BFCL v2 |
accuracy |
65.4% |
| Multilingual |
MGSM (CoT) |
exact match |
68.9% |
Comparison with Larger Models
| Benchmark |
Llama 3.1 8B |
Llama 3.1 70B |
Llama 3.1 405B |
| MMLU (CoT) |
73.0% |
86.0% |
88.6% |
| IFEval |
80.4% |
87.5% |
88.6% |
| HumanEval |
72.6% |
80.5% |
89.0% |
| MATH (CoT) |
51.9% |
68.0% |
73.8% |
| MGSM |
68.9% |
86.9% |
91.6% |
Hardware Requirements
Inference
Use device_map="auto" for automatic device placement:
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
Quantization Options
8-bit Quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=quantization_config
)
4-bit Quantization:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
The model supports function calling with the following pattern:
def get_current_temperature(location: str) -> float:
"""Get the current temperature at a location.
Args:
location: The location (format: "City, Country")
Returns:
Temperature as a float
"""
return 22.0
messages = [
{"role": "system", "content": "You are a weather bot."},
{"role": "user", "content": "What's the temperature in Paris?"}
]
inputs = tokenizer.apply_chat_template(
messages,
tools=[get_current_temperature],
add_generation_prompt=True
)
# After model generates tool call
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
# Append tool result
messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
License
License: Llama 3.1 Community License Agreement
Key Terms
- Non-exclusive, worldwide, non-transferable, royalty-free limited license
- Use, reproduce, distribute, copy, create derivative works
- Modify the Llama Materials
Commercial Requirements
- If monthly active users exceed 700M, you must request a license from Meta
- Must include "Built with Llama" on related websites, user interfaces, and documentation
- Must include "Llama" at the beginning of any AI model name built with Llama 3.1
Attribution Required
Llama 3.1 is licensed under the Llama 3.1 Community License,
Copyright Meta Platforms, Inc. All Rights Reserved.
Prohibited Uses
- Violence, terrorism, and illegal activities
- Child exploitation and abuse material
- Human trafficking and sexual violence
- Harassment and bullying
- Discrimination in employment, credit, housing
- Unauthorized professional practice (legal, medical, financial)
- Malware and malicious code creation
- Fraud, disinformation, and defamation
- Impersonation and misrepresentation
- Violations of ITAR, biological/chemical weapons regulations
Safety Considerations
Critical Risk Mitigation Areas
- CBRNE Materials - Uplift testing to assess proliferation risks
- Child Safety - Expert red teaming across supported languages
- Cyber Attack Enablement - Hacking task capability evaluation
Recommended Safeguards
| Tool |
Purpose |
| Llama Guard 3 |
Input/output filtering |
| Prompt Guard |
Prompt injection detection |
| Code Shield |
Code security analysis |
Multilinguality Caution
The model supports 8 languages with safety thresholds met. Use in non-supported languages is strongly discouraged without:
- Fine-tuning
- System controls aligned with use case policies
API Usage Examples
LangMart API
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.1-8b-instruct",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
Free Tier
curl https://api.langmart.ai/v1/chat/completions \
-H "Authorization: Bearer $LANGMART_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/llama-3.1-8b-instruct:free",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
DeepInfra Direct
curl https://api.langmart.ai/v1/openai/chat/completions \
-H "Authorization: Bearer $DEEPINFRA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
}'
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Resources
Official Documentation
Issue Reporting
Last updated: December 23, 2024
Source: LangMart, Hugging Face, Meta Model Card
Verified: Data confirmed accurate via LangMart API scrape