🧩 Key Concepts

Neural Networks
- Feynman style: A neural net is just a giant function approximator that learns the relationship between inputs (x) and outputs (y) by adjusting weights.
- Analogy: Like tuning guitar strings until the chords sound right.
Optimization & Backpropagation
- Gradient Descent: The “compass” that always points downhill in the error landscape.
- Inversion Thinking: What happens if gradients vanish or explode? → Your model doesn’t learn. Solutions: ReLU activations, normalization, careful initialization.
Generative Models
- Discriminative vs. Generative: Discriminative says “this is a cat”, Generative says “here’s a new cat picture you’ve never seen”.
- Generative models build probability distributions and can sample new datadevelopers.google.com.
Transformers & Attention
- Attention = spotlight mechanism. Each word looks at other words to decide importance.
- Solve sequential dependencies without recurrent loops.
- Why it matters: This is the architecture behind GPT, LLaMA, DeepSeek

Feynman Explaination:

We teach a parrot (the model) to guess the next word by seeing lots of text.
The parrot “looks around” inside the sentence using attention to decide what matters.
We speed it up by only waking up the pieces we need (Flash/SDPA, KV-cache).

DeepSeek optimization intuition:

SDPA/FlashAttention: compute attention efficiently inside CUDA kernels.
Mixed precision (bf16/fp16): same accuracy, fewer bytes, faster math.
KV-cache: remember previous keys/values so decoding is O(1) per token, not O(n).

(MoE, progressive distillation, routing come later.)

Google Colab Hands on Learning goals