DeepSeek_V4.pdf

Here's a plain-English breakdown of what's new in DeepSeek V4 compared to V3.2 🚀

🧠 Think of it like this: V3.2 was a faster car. V4 is a rocket ship built for a different road.

1. 🔭 The Big Goal Changed: 1 Million Token Context

V3.2 focused on being smarter and more efficient than before — good at reasoning and agent tasks, context up to 128K tokens.

V4 goes way further: it can process up to 1 million tokens in a single conversation. That's like fitting an entire novel (or a massive codebase) in one prompt.

2. 🏗️ Completely Redesigned Attention (How It "Reads" Text)

This is the core change. Both V3.2 and V4 tackle the same problem — vanilla attention is slow for long texts — but in different ways:

V3.2 introduced DeepSeek Sparse Attention (DSA): instead of every word looking at every other word, it only picks the most relevant ones.
V4 goes further with a hybrid system:
- CSA (Compressed Sparse Attention): squishes groups of tokens together and picks the most relevant ones
- HCA (Heavily Compressed Attention): squishes even more aggressively, like reading a summary instead of the full text

The result? At 1M tokens, V4-Pro uses only 27% of the computing power and 10% of the memory that V3.2 would need. V4-Flash is even leaner at 10% compute and 7% memory.

3. 🔗 Smarter Internal Connections (mHC)

V4 introduces Manifold-Constrained Hyper-Connections (mHC) — a fancy upgrade to how information flows between the model's layers (like upgrading the highways inside the brain). It makes the model more stable and expressive.

V3.2 didn't have this.

4. ⚙️ New Optimizer: Muon

V4 uses a new training algorithm called Muon (instead of the standard AdamW used more broadly). Think of the optimizer as the coach telling the model how to improve — Muon leads to faster convergence and better stability.

5. 🎓 New Post-Training Strategy: "Teach Experts First, Then Merge"

V3.2 trained specialist models and merged them using a mix of RL and distillation.
V4 is more structured: first train separate expert models for math, coding, agents, etc. — then merge them all into one unified model using On-Policy Distillation (OPD). The unified model learns directly from the experts' full output distributions.

6. 📦 New Hardware Tricks (FP4 Precision)

V4 uses FP4 quantization (an even lower-precision number format) for storing model weights during training and inference. This means less memory and faster computation — especially on future hardware.

🧾 TL;DR Table

Feature	V3.2	V4
Max context	128K tokens	1M tokens
Attention	DSA (sparse)	CSA + HCA (compressed + sparse hybrid)
Residual connections	Standard	mHC (new)
Optimizer	AdamW	Muon + AdamW
Post-training	Specialist distillation + mixed RL	Expert cultivation → On-Policy Distillation
Precision	FP8	FP4 + FP8
Model sizes	One main model	V4-Pro (1.6T params) + V4-Flash (284B params)

In short: V4 is purpose-built for the age of million-token reasoning, with a completely rethought attention architecture to make it actually feasible at scale.

What’s new in DeepSeek V4 compared to V3.2? Explain it to a dummy