May 10, 2024

Naga Vydyanathan

On the Origin of Large Language Models: Tracing AI’s Big Bang

Discover how Large Language Models (LLMs) originated. Learn about the transition from language models to LARGE language models, thereby triggering AI’s Big Bang.

Table of contents

Much like the universe's explosive expansion, Large Language Models or LLMs have propelled the field of AI into new realms of comprehension and creativity. These models have fueled a wave of generative AI, leaving an indelible mark across every sphere—from revolutionising art and literature creation to propelling technological advancements and reshaping industries worldwide. ChatGPT has become such a household name that even kids are consulting it for advice on their 'homework emergencies'—who knew bedtime stories could be generated by a friendly AI?

The goal of this article is to break down the concept of large language models in easily understandable terms and explore their origin and evolution. We will look at the various forces that cohesively brought about the transition from language models to LARGE language models, thereby triggering the AI’s Big Bang!

What is a Language Model, in ‘Plain’ Language?

A language model, in very simple terms, predicts the most appropriate next word, given a sequence of words. To do this, it analyzes patterns and relationships among words within vast training datasets, and generates a probability distribution to predict the likelihood of various words occurring next in a sentence or phrase. For instance, if the preceding words in a sequence are "The sky is", the language model might predict "blue" as the next word, given its understanding of typical language patterns.

Statistical Models: Where Language Modeling Began

The earliest language models were based on the Markov Models, developed by the Russian mathematician Andrey Markov in the early 20th century. In 1948, Claude Shannon, in his paper - “A Mathematical Theory of Communication”, used Markov Chains to create a statistical model of the sequences of letters in the English language. Markov models predict the probability of a word occurring based solely on the preceding words in a sequence, assuming that the probability distribution of each word depends only on the previous word or a fixed-size window of preceding words (higher order Markov models). For instance, in a Markov model of order one (unigram model), the probability of the next word is determined solely by the frequency of occurrence of individual words in the corpus, disregarding the sequential context beyond the current word.

Let us consider a set of simple sentences:

  1. the quick brown fox jumps over the lazy dog.
  2. the lazy brown dog barks loudly.

A simple first order Markov chain based on these examples would be:

A first order Markov Chain given the corpus - the quick brown fox jumps over the lazy dog. the lazy brown dog barks loudly.

Now, given “the”, this model would predict either:

  1. the lazy brown dog.
  2. The lazy brown fox jumps over the lazy dog.
  3. The lazy brown fox jumps over the lazy dog barks loudly.
  4. The lazy brown fox jumps over the lazy brown fox jumps over the lazy dog.

      and so on…

Although Markov models offer simplicity and efficiency, they may not always generate grammatically correct predictions, as seen in the above example. Higher-order Markov models, such as the n-gram model, improve prediction accuracy by considering the likelihood of a word in a sequence based on the preceding n-1 words. However, a larger value of n leads to exponential growth in computational complexity, limiting the ability of the n-gram model to capture long-range dependencies and contextual nuances.

Neural Network Models: Working with Larger Contexts

Recurrent Neural Networks (RNNs), which was introduced in the 1980s, revolutionised language modelling by remembering past information while processing current inputs, mimicking human-like sequential thinking. Imagine reading a sentence: RNNs grasp context by considering each word in relation to previous ones. This enables them to predict the next word more accurately. In simple words, RNNs are like a brain that recalls earlier words to understand and predict what comes next in a sentence.

RNN Language Model Unrolled in Time (Courtesy: Conference Paper by John D Kelleher and Simon Dobnik). h1..ht are the hidden internal states of the RNN that captures the context of the previous sequence of words

How do RNNs do this in a scalable manner? RNNs achieve scalability by sharing parameters across time steps, rather than treating each word in the sequence independently. This means that as the network processes each word, it updates its internal state using the same set of parameters. By maintaining context from previous words in this internal state, RNNs can predict the next word in the sequence. This efficient parameter sharing allows RNNs to handle sequences of varying lengths without significantly increasing computational cost, making them well-suited for processing longer texts than the n-gram model.

However, when RNNs are trained using long input sequences, the gradients used to update the model parameters to minimise the difference between predicted and actual outputs, diminish exponentially. This is termed as the ‘vanishing gradient’ problem and causes the network to struggle to learn meaningful patterns from distant time-steps, hindering performance on long sequences.

To mitigate the vanishing gradient problem, alternative architectures such as Long Short-Term Memory (LSTM) networks were proposed in 1997. LSTMs incorporate specialised memory cells with gating mechanisms that allow them to selectively remember or forget information over time, ensuring that relevant information is retained while irrelevant information is discarded. This enables LSTMs to effectively capture long-range dependencies in text, such as the relationship between a pronoun and a far-off antecedent, compared to traditional RNNs. However, LSTMs and RNNs process inputs sequentially, requiring prolonged training times. Moreover, their predictions rely on fixed-length contexts, limiting their efficacy in intricate language tasks.

The Transformer Era: Attention Is All You Need!

The compute boom, driven by the rise of GPUs and large high performance clusters, coupled with the data explosion, ushered in a transformative era in the world of language modelling - the transformers. The concept of transformers was introduced by researchers at Google in their breakthrough paper - “Attention is All You Need” in 2017. 

Transformers are a deep learning model that work quite differently from traditional models like RNNs or LSTMs. Instead of processing words in order, like reading a book from start to finish, transformers look at all the words at once, like taking a snapshot of the entire book and understanding it all at the same time. They use a mechanism called self-attention to figure out which words are important and how they relate to each other.  For example, in the sentence "The big brown dog barks loudly in the park," the transformer might assign higher attention to the word "dog" when understanding the word "barks" because "dog" is typically the subject of the action "barks." Similarly, it might focus on "park" when processing "in" to understand the location of the action. 

Unlike RNNs, transformers create and maintain different internal states/representations for each word, that is independent of other words. This ability to consider the entire context of the sentence simultaneously allows transformers to capture complex relationships and longer dependencies more effectively compared to traditional models. Further, this allows transformers to process sequences of words in parallel. 

From Language Models to Large Language Models and Generative AI

Language models have undergone a remarkable transformation, evolving from traditional statistical models to large-scale deep neural network transformer models that have revolutionised the field of artificial intelligence. “Scale” in the context of Large Language Models (LLMs) encompasses two dimensions - 1)  the size of the model measured in terms of the parameters it learns, and 2) the extensive volume of data it trains on. As LLMs are trained on vast amounts of text data using sophisticated deep learning techniques, they can learn intricate patterns, relationships, and nuances inherent in a language, leading to what is known as “emergence”. "Emergence" refers to the phenomenon where these models exhibit complex behaviours or capabilities that were not explicitly programmed into them. For example, when chatGPT is given the prompt - write a short story about a magical adventure in a mysterious forest”, it can churn out a compelling story featuring magical creatures and enchanted landscapes, even without being explicitly trained for it. The emergence of these new capabilities is a result of the model's training process, where it learns to generalise from the data and make predictions based on learned patterns, ultimately demonstrating behaviours that may appear intelligent or creative. 

So, What’s Next?

Looking forward, the horizon of Large Language Models (LLMs) and generative AI brims with promise and potential for further advancement. The primary focus will be on refining existing LLMs to efficiently handle larger datasets and more complex and specialised tasks. For instance, there have been significant advances in the open source community focused on fine-tuning base LLMs for specific domains and quantizing models for efficient operation on smaller compute platforms. This will involve exploring architectural enhancements, optimising training algorithms, and leveraging hardware innovations to enhance scalability and performance. Concurrently, efforts will persist in improving the robustness, interpretability, and ethical considerations surrounding LLMs and generative AI.

The integration of LLMs and generative AI across various domains is set to expand, revolutionising content creation, conversational interfaces, and personalised services. LLMs can serve as the foundational component in the development of Large Action Models or LAMs such as Rabbit R1, which are AI models that translate human intention into action. Furthermore, synergies with other AI paradigms like reinforcement learning and unsupervised learning promise novel applications and breakthroughs. As research, collaboration, and responsible development practices advance, the trajectory of LLMs and generative AI is poised to redefine industries, empower individuals, and usher in a new era of creativity and innovation.

Naga Vydyanathan
Naga Vydyanathan