Whatβs new in DeepSeek V4 compared to V3.2? Explain it to a dummy
Let me dig into both documents at the same time to give you a solid comparison!
Great, I have both structures. Now let me pull the key intro and evaluation pages from both to give you a grounded comparison.
Here's a plain-English breakdown of what's new in DeepSeek V4 compared to V3.2 π
π§ Think of it like this: V3.2 was a faster car. V4 is a rocket ship built for a different road.
1. π The Big Goal Changed: 1 Million Token Context
V3.2 focused on being smarter and more efficient than before β good at reasoning and agent tasks, context up to 128K tokens.
V4 goes way further: it can process up to 1 million tokens in a single conversation. That's like fitting an entire novel (or a massive codebase) in one prompt.
2. ποΈ Completely Redesigned Attention (How It "Reads" Text)
This is the core change. Both V3.2 and V4 tackle the same problem β vanilla attention is slow for long texts β but in different ways:
- V3.2 introduced DeepSeek Sparse Attention (DSA): instead of every word looking at every other word, it only picks the most relevant ones.
- V4 goes further with a hybrid system:
- CSA (Compressed Sparse Attention): squishes groups of tokens together and picks the most relevant ones
- HCA (Heavily Compressed Attention): squishes even more aggressively, like reading a summary instead of the full text
The result? At 1M tokens, V4-Pro uses only 27% of the computing power and 10% of the memory that V3.2 would need. V4-Flash is even leaner at 10% compute and 7% memory.
3. π Smarter Internal Connections (mHC)
V4 introduces Manifold-Constrained Hyper-Connections (mHC) β a fancy upgrade to how information flows between the model's layers (like upgrading the highways inside the brain). It makes the model more stable and expressive.
V3.2 didn't have this.
4. βοΈ New Optimizer: Muon
V4 uses a new training algorithm called Muon (instead of the standard AdamW used more broadly). Think of the optimizer as the coach telling the model how to improve β Muon leads to faster convergence and better stability.
5. π New Post-Training Strategy: "Teach Experts First, Then Merge"
- V3.2 trained specialist models and merged them using a mix of RL and distillation.
- V4 is more structured: first train separate expert models for math, coding, agents, etc. β then merge them all into one unified model using On-Policy Distillation (OPD). The unified model learns directly from the experts' full output distributions.
6. π¦ New Hardware Tricks (FP4 Precision)
V4 uses FP4 quantization (an even lower-precision number format) for storing model weights during training and inference. This means less memory and faster computation β especially on future hardware.
π§Ύ TL;DR Table
| Feature | V3.2 | V4 |
|---|---|---|
| Max context | 128K tokens | 1M tokens |
| Attention | DSA (sparse) | CSA + HCA (compressed + sparse hybrid) |
| Residual connections | Standard | mHC (new) |
| Optimizer | AdamW | Muon + AdamW |
| Post-training | Specialist distillation + mixed RL | Expert cultivation β On-Policy Distillation |
| Precision | FP8 | FP4 + FP8 |
| Model sizes | One main model | V4-Pro (1.6T params) + V4-Flash (284B params) |
In short: V4 is purpose-built for the age of million-token reasoning, with a completely rethought attention architecture to make it actually feasible at scale.