Markov Chains: the grandparents of LLMs

If you have been on social media in the past few years, it feels like people believe that AI started in 2022 when ChatGPT was released. Out of a sudden, OpenAI, a company unknown to many, released this groundbreaking service that felt surprisingly smart. But before that, many things happened that led to the current level of AI research, and here I will talk about one very old method that managed to do something very similar: Markov chains.

It’s not too wrong to say that current LLMs are just super advanced versions of what I will talk about in this post.


Markov process

Let’s start by defining what a process is, since all the other definitions will come from this.
We need three ingredients: an actor, some states and some transitions.

For example, imagine a person (the actor) going about their daily routine. The different locations they can be in, such as their house, the office, or a coffee shop, represent the states. The act of moving from one location to another as time passes represents the transitions. By defining these three elements, we can map out a sequence of events. This foundational concept gives us a framework to track behaviour over time. To make it a Markov process, we just need another property: memorylessness.
A specific state in the process depends solely on the previous state, and it has no memory of what happened before.

Let’s transfer it to words, and pick a phrase as an example: “The cat sat on the mat, and the dog sat on the rug.”
Every single word is a state, and the transition is the act of moving from one word to the next.

If we assume this text represents our entire vocabulary, we can calculate the probabilities of moving from one word to another just by counting how often they follow each other. For instance, if the current state is “the”, the next state could be “cat”, “mat”, “dog”, or “rug”.

Current State Next State Options & Probabilities
“the” “cat” (25%), “mat” (25%), “dog” (25%), “rug” (25%)
“cat” “sat” (100%)
“dog” “sat” (100%)
“sat” “on” (100%)
“on” “the” (100%)
“and” “the” (100%)

We create a chain where every word is connected to the previous one. The probability of every word solely depends on its predecessor.

If our Markov chain is currently at the word “sat”, it looks at the probabilities and sees that 100% of the time, the next word is “on”.
But what happens when it reaches “on”? The next word could be “the” (following “sat on…”) or “the” (following the other “sat on…”).
It always leads to “the”. Once we are back at “the”, the chain completely forgets whether we started with the cat or the dog.
The probability of the next word being either cat, mat, dog or rug is equal; we could just roll a four-sided die and pick the next word at random, and we would still follow the same Markov process.

How is this similar to an LLM?

Imagine creating this Markov chain using a lot of books, and then using the chain as an autoregressive generator to make up your own pseudo-logical sentences.
Since the Markov chain was created with real texts, the sentences would probably look correct at first sight, but they do not make too much sense when you try to read them properly.
You have trained a generator that is solely based on probability, and it doesn’t bring any context into its state.

You could think about expanding the state to look at two words instead of one (moving from unigrams to bigrams). This makes the text feel a bit more logical, but it’s still far from smart.

To get truly coherent sentences, your instinct might be to just keep expanding the state: why not look at the last 10 or 20 words?
That’s the problem with Markov chains: if you require an exact match for a long sequence of words, you will rarely find that exact combination repeated anywhere else in your training data.
The chain either gets stuck or it simply copies the original text as it is, because there is only one possible path.

But maybe we can map those words in a different dimension, a dimension where the link between certain sequences of words is more abstract, based more on the meaning than on the actual words used.
This is more or less what LLMs do. They map words into a continuous mathematical space, becoming embeddings, following a probability-based transition based on abstract, mapped concepts rather than rigid words.
The only remaining problem is the memorylessness limit. If the probability of a concept depends solely on the previous one, we still can’t have a very smart machine.

But, like in the simple Markov chain example, maybe we can use more concepts to define a single state.
Instead of looking at just the last concept only, we can define the current state as the entire sequence of abstract concepts we have generated so far. We can give a different weight to all those past concepts so they contribute differently to the probability of the next transition.
That is the famous attention mechanism from the paper “Attention Is All You Need” (2017, Vaswani et al.), the foundation of modern LLMs.

By using the attention mechanism to weigh the entire text, LLMs effectively break free from the memoryless limits of traditional Markov chains.

Conclusion

LLMs, like their grandparents, are still probabilistic models, very complex Markov chains that link abstract concepts together in a way that makes sense.
The better you model those concepts, the more the LLM will appear smart, but that doesn’t change its probabilistic nature, so don’t be surprised if, at some point, you’ll find your cat on the rug instead of your dog.

Do you want to try how a Markov chain works? You want to train a model on your WhatsApp group chat? Export the WhatsApp chat of your friend group as .txt, go to WhatsApp Markov Generator and generate new phrases!