Feynman Explaination:
DeepSeek optimization intuition:
SDPA/FlashAttention: compute attention efficiently inside CUDA kernels.
Mixed precision (bf16/fp16): same accuracy, fewer bytes, faster math.
KV-cache: remember previous keys/values so decoding is O(1) per token, not O(n).
(MoE, progressive distillation, routing come later.)