BERT & GPT-1: The Fork
The 2018 split that divided AI into readers and writers.
In 2017, a team at Google published "Attention Is All You Need" and handed the world a new architecture: the Transformer. It had two halves. An Encoder that reads. A Decoder that writes.
For about a year, the full architecture sat there. Then 2018 happened, and two separate teams made the same decision at almost the same time: they would each take one half and push it as far as it could go.
Google took the Encoder and turned it into BERT. Jacob Devlin and his team trained it to read in both directions simultaneously. A standard model reads left to right, so it never fully sees a word in context until it has processed everything before it. BERT flipped that. Every word attends to every other word at once, left and right, building a richer representation of meaning. It was pre-trained to predict masked words in a sentence, which forced it to develop deep language understanding. The result was a model that crushed benchmarks on search, sentiment analysis, named entity recognition, and question answering.
OpenAI took the Decoder and turned it into GPT-1. Alec Radford's team trained it to predict the next token given all the tokens before it. That direction is crucial. The model never looks ahead. It is autoregressive, building the output one step at a time. That constraint sounds limiting, but it is actually a superpower for generation. Because the model is forced to predict what comes next, it learns to write coherently over long sequences. GPT-1 was smaller and less celebrated than BERT at launch, but it established the pattern that would eventually lead to GPT-3 and GPT-4.
The fork was not a rivalry. It was a natural division of labor. Comprehension problems need bidirectionality: to understand a sentence, you need all of it at once. Generation problems need unidirectionality: to write a sentence, you need to commit to each word before revealing the next.
BERT became the backbone of Google Search. GPT became the backbone of ChatGPT. Two paths from one paper.