I’m non-technical but want to deeply understand AI.
Andrej Karpathy’s “Intro to LLMs” is the best resource I’ve found so far.
Here are my biggest takeaways from his 60-minute talk:
1. An LLM is basically two files: a giant weight file and a tiny run file. The architecture is simple and public; the learned weights are the real asset.
2. Open-weights vs closed models: open models (like LLaMA-2) are customizable and inspectable; closed models (like GPT-4/Claude) are more powerful but opaque.
3. Training vs inference: running a model is cheap; training is the expensive industrial process where most value gets created.
4. Training scale: LLaMA-2-70B took thousands of GPUs and millions of dollars; frontier models scale these numbers by another ~10×.
5. Frontier = more scale: top models (e.g., GPT-5 class) mainly push parameters, data, and compute dramatically higher.
6. The core objective is simple: predict the next word. Capabilities like reasoning and coding emerge from pushing that objective to extremes.
7. Architecture is known: the Transformer is public, mature, and relatively simple. Most differentiation comes from the data and weights, not the wiring.
8. Parameters are a black box: billions of interacting weights produce behavior we can steer but not fully interpret.
9. LLMs are empirical artifacts: closer to biological organisms than engineered machines—you observe, evaluate, and characterize them.
10. Pre-training vs fine-tuning: pre-training fills the model with world knowledge; fine-tuning (including RLHF) shapes behavior and usefulness.
11. RLHF via comparisons: labelers rank outputs rather than write them—an efficient way to align a model’s preferences.
12. Closed vs open as a strategy choice: closed models win on raw capability; open models win on control, customization, and on-premise deployment.
13. Scaling laws: performance increases predictably with more parameters and data; no clear saturation yet.
14. The GPU/data gold rush: belief in scaling laws drives the race for compute, data, and money.
15. LLMs as tool users: they don’t just generate text—they browse, write code, call calculators, generate plots, and coordinate many tools.
16. How tool use works: the model emits special tokens (like |BROWSER|) learned from fine-tuning examples, triggering tool calls.
17. Desired future: trade time for accuracy: let models think longer for harder problems in a principled way—an early glimpse of reasoning models.
19. Retrieval-augmented generation (RAG): rather than browsing the web, the model searches your own files and injects relevant snippets into context.
20. LLMs are analogous to today’s OS: context window ≈ RAM; browsing/RAG ≈ disk access; open vs closed mirrors Windows/Mac vs Linux; context management becomes a product surface.
21. New stack → new security risks: prompt injection, jailbreaks, adversarial prompts—novel attack surfaces unique to probabilistic systems.
Link to full “Intro to LLMs” video below 👇

