What Large Language Models Actually Do (in Plain English)
A clear explanation of how LLMs work under the hood, from tokenization to generation, without the jargon or hype.
Last updated: May 14, 2026
LLMs are functions that predict the next token in a sequence by computing probability distributions, trained on massive text to compress patterns into parameters.
Large language models have become the most discussed technology of the decade, yet most explanations either drown in mathematical notation or resort to vague analogies about “predicting the next word.” Neither approach helps practitioners understand what is actually happening when they interact with these systems. This article bridges that gap.
The Core Mechanism
At the most fundamental level, a large language model is a function that takes a sequence of tokens and produces a probability distribution over what token should come next. That single sentence contains the entire conceptual foundation, but each piece deserves unpacking.
Tokens are not words. They are chunks of text that the model has learned to treat as atomic units. The word “understanding” might be split into “under” and “standing,” or kept whole, depending on how frequently it appeared in the training data. A tokenizer converts raw text into these chunks before the model ever sees it, and converts the model output back into readable text afterward.
The probability distribution is where things get interesting. When you ask a model to complete the sentence “The capital of France is,” it does not look up the answer in a database. Instead, it computes a score for every token in its vocabulary — typically tens of thousands of possibilities — and the token with the highest score happens to be “Paris” because the patterns in its training data overwhelmingly associate that sequence with that completion.
Training: Pattern Compression at Scale
Training an LLM means showing it enormous quantities of text and adjusting its internal parameters so that it becomes better at predicting what comes next. The model starts with random parameters and gradually learns patterns: grammar, facts, reasoning structures, coding conventions, and stylistic tendencies.
Think of it as compression. The model cannot memorize all of its training data — a model with 70 billion parameters trained on trillions of tokens must compress that information. What survives this compression are the patterns that appear most consistently and most broadly. This is why models know that water boils at 100 degrees Celsius but might struggle with the phone number of a specific restaurant: one pattern is reinforced millions of times across the training data, while the other appears perhaps once.
The training process uses backpropagation and gradient descent, but the conceptual insight is simpler: the model makes a prediction, checks how wrong it was, and adjusts its parameters slightly to be less wrong next time. Repeated billions of times across diverse text, this process produces something that looks remarkably like understanding.
The Transformer Architecture
Modern LLMs use the transformer architecture, introduced in 2017. The key innovation is the attention mechanism, which allows the model to consider relationships between all tokens in a sequence simultaneously rather than processing them one by one.
When processing the sentence “The bank by the river was eroding,” the attention mechanism helps the model recognize that “bank” here refers to a riverbank rather than a financial institution, because it can attend to “river” regardless of how far apart the words are in the sequence.
Transformers stack multiple layers of attention and feed-forward networks. Each layer refines the representation of the input, building increasingly abstract features. Early layers might capture syntax and local patterns, while later layers capture semantic relationships and long-range dependencies.
Generation: Sampling from Distributions
When you prompt a model and it generates text, it is running the forward pass repeatedly. It processes your prompt, produces a probability distribution over the next token, selects one (using various sampling strategies), appends it to the sequence, and repeats.
The sampling strategy matters enormously. Greedy decoding always picks the highest-probability token, producing deterministic but often repetitive output. Temperature sampling scales the probabilities before selection: higher temperature makes unlikely tokens more probable (more creative but less coherent), while lower temperature concentrates probability on the top choices (more predictable but more accurate).
Top-p (nucleus) sampling truncates the distribution to only consider tokens whose cumulative probability exceeds a threshold. This prevents the model from occasionally selecting extremely unlikely tokens while still allowing variety.
What Models Do Not Do
Understanding what LLMs are not is as important as understanding what they are. They do not have persistent memory between conversations (unless explicitly given context). They do not access the internet during generation. They do not reason in the way humans do — they pattern-match at a scale and speed that produces outputs resembling reasoning.
They do not have beliefs, desires, or consciousness. When a model says “I think,” it is producing tokens that follow the pattern of how humans express thoughts in text. The model has no inner experience of thinking.
They can and do produce confident-sounding falsehoods. Because they optimize for plausible-sounding continuations rather than truth, they will generate text that reads well but may be factually wrong. This is not a bug that will be fixed with more training — it is a fundamental property of the architecture.
Why This Matters for Practitioners
If you are building with LLMs, this understanding shapes your approach in concrete ways. You know that providing relevant context in the prompt is not optional — the model has no other way to access information not in its training data. You know that asking for step-by-step reasoning can improve outputs because it forces the model to generate intermediate tokens that constrain later predictions. You know that verification is always necessary because the model optimizes for plausibility, not accuracy.
You also know that prompt engineering is not magic — it is the practice of constructing inputs that activate the right patterns in the model. And you know that fine-tuning changes which patterns the model prioritizes, while retrieval-augmented generation provides context the model could not otherwise access.
- LLMs predict the next token by computing probability distributions over their vocabulary
- Training compresses patterns from massive text datasets into model parameters
- The transformer architecture uses attention to capture relationships between all tokens simultaneously
- Generation involves repeated forward passes with various sampling strategies
- Models do not reason, remember, or access external information unless explicitly provided
The gap between what LLMs actually do and what people imagine they do creates both unrealistic expectations and missed opportunities. By understanding the mechanism, you can use these tools more effectively and recognize their genuine limitations without dismissing their genuine capabilities.
Frequently Asked Questions
Do LLMs actually understand language?
No. LLMs pattern-match at scale, producing outputs that resemble understanding. They compute statistical relationships between tokens without any inner experience or comprehension.
Why do LLMs sometimes produce false information?
Because they optimize for plausible-sounding continuations rather than truth. A confident falsehood and a correct answer can both be high-probability completions.
What is the difference between GPT-4 and earlier models?
Primarily scale — more parameters, more training data, and architectural refinements. The fundamental mechanism of next-token prediction remains the same.
Can LLMs learn new information after training?
Not without fine-tuning or retrieval augmentation. During inference, they can only use information provided in the prompt context or learned during training.