Back$ deep dives

> deep dives · ml fundamentals

Transformers & Attention

From the original paper to multi-head attention, positional encoding, and why the architecture took over everything.

  • Self-Attention
  • Multi-Head
  • Positional Encoding
  • Softmax
  • Scaled Dot-Product
Input Tokens
Q / K / V
Attention Scores
Weighted Sum
Output

Overview

The Big Picture

> Overview — explain the concept from first principles — content coming soon

Core Concepts

Key Ideas to Know

> Core concepts — break down each key idea with diagrams — content coming soon

Deep Dive

How It Actually Works

> Deep dive — step-by-step walkthrough with math or code — content coming soon

Interview Questions

Common Questions & Answers

> Q&A — frequently asked questions with model answers — content coming soon

Gotchas

What Trips People Up

> Gotchas — subtle points, common misconceptions, edge cases — content coming soon