Granite 4.1 LLMs: What IBM Did Differently This Time

Granite 4.1 LLMs: What IBM Did Differently This Time

3 0 0

IBM just dropped the Granite 4.1 family — three dense, decoder-only LLMs (3B, 8B, and 30B) trained from scratch on about 15 trillion tokens. The headline: the 8B instruct model matches or beats the previous Granite 4.0-H-Small (32B-A9B MoE) despite being a simpler dense architecture with way fewer parameters. That’s not nothing.

All three are released under Apache 2.0. The collection is on Hugging Face, the code is on GitHub, and the docs are live. But what’s actually interesting is how they got there.

The Architecture: No Surprises, But Solid Choices

The models use a standard decoder-only dense transformer with Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. Nothing wild here, but the execution matters.

| Component | 3B Dense | 8B Dense | 30B Dense |
|———–|———-|———-|———–|
| Embedding size | 2560 | 4096 | 4096 |
| Number of layers | 40 | 40 | 64 |
| Attention head size | 64 | 128 | 128 |
| Number of attention heads | 40 | 32 | 32 |
| Number of KV heads | 8 | 8 | 8 |
| MLP hidden size | 8192 | 12800 | 32768 |
| MLP activation | SwiGLU | SwiGLU | SwiGLU |
| Position embedding | RoPE | RoPE | RoPE |

All three sizes share the same training pipeline and data strategy. The only differences are architectural dimensions. That’s a smart move — it means the smaller models benefit from the same data engineering work that went into the big one.

The Pre-Training Pipeline: Five Phases, No Shortcuts

This is where Granite 4.1 separates itself from the pack. Instead of one giant training run, they split pre-training into five distinct phases, each with its own data mixture and learning rate schedule. The progression is clear: start broad, then gradually filter for quality and domain specificity.

Phase 1: General Pre-Training (10T tokens)

This is the foundation. Broad language understanding, standard stuff. Data mix:

  • CommonCrawl ~59% — general web data
  • Code ~20% — programming languages and repos
  • Math ~7% — mathematical reasoning
  • Technical ~10.5% — scientific papers, technical docs
  • Multilingual ~2% — non-English
  • Domain Specific ~1.5% — niche content

Power learning rate schedule with warmup. Nothing controversial here.

Phase 2: Math/Code Pre-Training (2T tokens)

This is where they pivot hard toward reasoning. Math jumps 5x to 35%, code goes up 1.5x to 30%. CommonCrawl drops to 12% and is now a high-quality subset. Synthetic data enters at 9%. The message is clear: broad understanding is done, now let’s get smart.

Phase 3: High-Quality Data Annealing (2T tokens)

Mid-training starts here. Exponential decay learning rate. The data mix becomes more balanced and includes chain-of-thought reasoning trajectories (12.5%) and instruction tuning data (12% combined language and code instructions). This is where the model starts learning how to follow instructions and reason step by step.

Phase 4: High-Quality Data Annealing — Refinement (0.5T tokens)

Linear decay to zero. The mix is now almost entirely high-quality: CommonCrawl-HQ at 40%, Code at 20%, Math at 20%, plus instructions and CoT data. This is the polishing stage.

Phase 5: Long Context Training (LCE)

The final phase extends the context window from 4K to 512K in three stages: 32K, 128K, then 512K. The 512K extension uses 80% books and 20% code repository data (for 8B and 30B only). After each LCE stage, they do a model merge to prevent short-context degradation. Smart.

RULER benchmark results for base models:

| Model | 32K | 64K | 128K |
|——-|—–|—–|——|
| granite-4.1-3b-base | 75.0 | 66.6 | 58.0 |
| granite-4.1-8b-base | 83.6 | 79.1 | 73.0 |

These numbers are decent for the size. The 8B holds up well even at 128K.

Post-Training: SFT + On-Policy RL

After pre-training comes supervised fine-tuning on about 4.1M curated samples. They used an LLM-as-Judge framework for data quality filtering, which is becoming standard practice but still takes serious engineering to get right.

Then reinforcement learning via on-policy GRPO with DAPO loss (Yu et al., 2025). The multi-stage RL pipeline targets math, coding, instruction following, and general chat. This is where the 8B model pulls ahead of the previous 32B MoE — the RL work clearly paid off.

What I Think

Granite 4.1 is a textbook example of what happens when you prioritize data quality over brute force. The five-phase pipeline is more complex than the usual “train on everything and hope for the best” approach, but the results speak for themselves. The 8B model outperforming a 32B MoE is no accident.

The Apache 2.0 license is a nice touch too. IBM is clearly playing the long game here — build trust, get adoption, then figure out monetization later.

My only real criticism: the 3B model’s RULER scores at 128K (58.0) are mediocre. Long-context on small models is always a struggle, but I’d have liked to see better. Then again, for most use cases, 32K is plenty.

If you’re building enterprise applications that need solid reasoning and long-context support without the compute cost of a 70B model, Granite 4.1 is worth a serious look. The 8B instruct model is the sweet spot.

Comments (0)

Be the first to comment!