LLMs or Large Language Models (LLM) are advanced deep learning algorithms capable of performing a wide range of tasks related to natural language processing (NLP). Language Models have been around for a while.
Short History
Language models have been around in various forms for several decades, evolving significantly with advancements in machine learning and computational power. We could say the first models were the Statistical Language Models. These were the earliest language models and purely statistical in nature, primarily focusing on the probabilities of word occurrences and co-occurrences. These models, such as the n-gram model, became very popular in the 1980s and 1990s. They were used in various natural language processing (NLP) tasks including speech recognition and, particularly, machine translation. The first Google Translate and Bing Translator were based on this technology.
With the advent of neural networks, language modeling witnessed a shift towards neural language models from 2006/2007. These models leveraged the power of neural networks to learn complex patterns in text data, leading to better performance in many NLP tasks. Most machine translation companies changed their frameworks to the new neural paradigm by 2007 or 2008.
The rise of deep learning further propelled advancements in language modeling around the 2015s. Deep Learning-based Language Models like Word2Vec and GloVe, which utilized neural networks with multiple layers to learn word embeddings, were among the early adopters of deep learning in language modeling.
The speed of change, due to the advent of faster and faster GPUs cleared the way for Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Models which are adept at handling sequential data, and that became popular for language modeling tasks during this period as well. They were better at capturing long-range dependencies in text compared to their predecessors.
**********
The input data is processed individually and sequentially rather than as a whole corpus. This means that when an LSTM is trained, the window of context is fixed, extending only beyond an individual input for several steps in the sequence. This limits the complexity of the relationships between words and the meanings that can be derived.
Language sequencing has significantly evolved with the advent of novel modeling architectures. This article delineates the transition from Long Short-Term Memory (LSTM) techniques to Generative Pre-training Transformer (GPT) models introduced by OpenAI, highlighting the pivotal role of the self-attention mechanism in enhancing language processing and interpretation.
1. Introduction:
Language sequencing is paramount for understanding and generating text. LSTM models have been a basic technique used for this task, albeit with certain limitations in evaluating surrounding context and processing input data in a sequential manner.
2. Limitations of LSTM:
• Context Weighting: LSTM models lack the ability to weight surrounding words based on their relevance, which may result in inaccurate interpretation of context.
• Sequential Processing: By processing input data sequentially, LSTM models have a fixed context window, restricting the complexity of relationships between words that can be derived.
3. Emergence of Transformers:
In 2017, the Google Brain team introduced transformers, which, unlike LSTM models, can process all input data simultaneously. This shift allowed for better interpretation of context and processing significantly larger datasets through a self-attention mechanism.
4. Development of GPT and Self-Attention:
• GPT Evolution: Since the launch of GPT-1 in 2018, GPT models have continually evolved, benefiting from advancements in computational efficiency that enable training on larger datasets.
• Self-Attention Mechanism: Central to GPT models, the self-attention mechanism allows for differentially weighting parts of the input sequence to infer meaning and context. This process is conducted through a series of steps including the creation of query, key, and value vectors, and generating normalized weights to represent the importance of each token within the sequence.
5. Expansion of Self-Attention in GPT:
The ‘multi-head’ attention mechanism evolves self-attention by iterating over the basic process multiple times in parallel, enabling the model to grasp sub-meanings and more complex relationships within the input data.
6. Incorporation of Human Feedback in ChatGPT:
ChatGPT, a spinoff of InstructGPT, introduces a novel methodology of incorporating human feedback into the training process to better align model outputs with user intent. Concepts of Supervised fine-tuning, reward mode, reinforcement learning, and evaluation are essential for refining the model and creating ChatGPT.
The evolution from LSTM to GPT evidences significant advancements in the field of language sequencing, showcasing how new architectures and mechanisms such as self-attention can overcome previous limitations and provide more accurate, contextual interpretation of language.
Transformers changed everything
The introduction of the Transformer architecture in 2017 marked a significant milestone in the field of language modeling. The architecture, proposed in the paper “Attention is All You Need” in by Vaswani et al., showed exceptional performance in handling sequential data and led to the development of models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).
Finally, in recent years, Large-scale Pretrained Language Models like GPT-3, released by OpenAI in 2020, and its successors have become the trend. These models are trained on massive datasets and have achieved state-of-the-art performance on a wide array of NLP tasks.
What’s changed with LLMs since 2022
The difference that we have all noticed since late 2022 or early 2023 lies in the size and amount of training data, as well as the scaling in billions of parameters (although this is not necessarily an indication of quality). The new models, built on the Transformers architecture – currently the most popular – are trained with such vast data sets, that they “learn” impressive abilities to recognize, summarize, translate, predict and generate text often indistinguishable from human quality. If we add a chatbot layer to interact, as OpenAI did with ChatGPT, Meta with Llama2 or Google with Bart, then we have such a cognitive experience that the human psyche assumes we are having an intelligent conversation with a machine. That’s why we have so much fun and get “hooked” on GPT models: just as it happens with VR, we know we are seeing reality or talking to a human but the experience is so realistic that our brain is fooled into believing what we’re presented with.
Transformers architecture and its meaning
The term “large” refers to the number of values (parameters) that the model can change on its own during the learning process. Some of the most successful LLMs have hundreds of billions of parameters.
The heart of an LLM is usually a Transformer. These are composed of an encoder and a decoder and are known for their ability to handle long distance dependencies through what is known as self-attenuation mechanisms. As the name implies self-attention, in particular multi-headed attention, allows the model to consider multiple parts of the text simultaneously, providing a more holistic and richer understanding of the content.
Key components of LLMs
Within these models, we find various layers of neural networks working together:
• Embedding Layer: Transforms the input text into vectors, capturing its semantic and syntactic essence.
• Feedforward Layer: Comprises fully connected networks that process the embeddings, aiding in understanding the intent behind an input.
• Recurrent Layer: Traditionally, these layers interpret words in a sequence, establishing relationships among them.
• Attention Mechanism: Zooms in on specific parts of the text that are relevant to the task at hand, enhancing the accuracy of predictions.
Types of LLMs and General Applications
Owing to their natural expression and ability to scale the production of language-based products, solutions predicated on LLMs have garnered substantial funding. Numerous companies, irrespective of their size, are investing in the customization of Large Language Models. These models hold the promise of addressing large-scale challenges across a multitude of industries. For instance, in the healthcare sector, they can aid in diagnostic processes, while in the marketing domain, sentiment analysis rendered by these models can be pivotal.
The most popular models nowadays would be the “Dialogue Models”, which are crafted to simulate conversations, akin to chatbots or AI-driven assistants. Previous and simpler models like the Generic (predicting the subsequent word as it happens in popular email programs or Instruction-trained Models (for tasks like sentiment analysis or code generation) are now considered too narrow or specialized, although they provide a perfect and affordable solution when just those specific tasks are required).
Do LLMs hallucinate?
In a way, LLMs do “hallucinate” since they are trained on large volumes of text data, which may contain incorrect or biased information. When LLMs generate text, they may incorporate this incorrect or biased information into their responses. This can create the impression that LLMs are hallucinating, as they are generating information that is not real or based on reality in a manner that might mislead the user into believing it’s a correct response.
LLMs can hallucinate because they are trained on vast amounts of text and code data that may contain incorrect or biased information. All LLMs use CommonCrawl and various internet sources as foundational material for training and learning. Despite cleaning processes and bias mitigation efforts, it’s impossible to verify all the information when dealing with terabytes of text. Therefore, an LLM has a “knowledge cutoff date,” although efforts are ongoing to enhance responses with more updated information including web search results.
For example, an LLM might be trained on a dataset containing outdated or incorrect information about the weather. The dataset might state that the average temperature in a country is 20°C. When asked about the climate of that country, the LLM might respond that the average temperature is 20°C. This would be a hallucination since the actual average temperature in that country (taking Spain as an example) is 17 degrees.
LLMs can also hallucinate because they are designed to be creative and “generative.” All other capabilities, like coding or translating, emerged unintentionally as a result of recognizing linguistic patterns across vast text datasets.
When presented with a new question, an LLM can generate a novel and intriguing response, which, however, might not be accurate or consistent with the real world. Early critiques of ChatGPT, for instance, centered around it being a “stochastic parrot.”
For another example, an LLM trained on a dataset containing information about Spain’s history might state that Spain was founded by a group of people who came from Africa, based on the data it was trained on. This would be a hallucination since the real history of Spain is more complex.
Furthermore, LLMs might be prone to generating creative or imaginative responses. This is because they are trained to generate text similar to that in their training dataset. If the dataset contains creative or imaginative text, LLMs might generate similar text, possibly creating the impression of hallucination, as they are generating non-real information.
However, it’s crucial to understand that LLMs are not conscious beings. They cannot experience reality in the same way humans do. The information LLMs generate is merely a function of the data on which they have been trained.
Additional Information:
• Bias and Ethical Considerations: The propensity of LLMs to hallucinate, especially in a manner that may reflect societal biases present in the training data, raises serious ethical concerns. Efforts are ongoing within the AI community to address these issues, develop bias mitigation strategies, and ensure the responsible use and deployment of LLMs.
• Update and Re-training: Continual training or periodic updating of LLMs with new data can help mitigate the hallucination issue to some extent, ensuring the models stay current with factual information and societal norms.
Auto-Regressive LLMs and Non-Autoregressive LLMs
We may have grown accustomed to deal with just one type of LLMs (auto-regressive) but there are a number of LLMs that are not auto-regressive. These models are still under development, but they have the potential to overcome some of the limitations of auto-regressive LLMs. One example is the masked language model (MLM). MLMs are trained to predict the next word in a sequence, but they do not do so by generating the words one at a time. Instead, MLMs are trained to predict the missing word in a masked sequence. This allows MLMs to learn the context of words and phrases, which can then be used for tasks such as text generation, translation, and summarization.
Another example of a non-autoregressive LLM is the denoising autoencoder (DAE). DAEs are trained to reconstruct a corrupted input sequence. This process forces DAEs to learn the underlying patterns in the data, which can then be used for tasks such as text generation, translation, and summarization.
Non-autoregressive LLMs are still under development, but they have the potential to overcome some of the limitations of autoregressive LLMs. For example, non-autoregressive LLMs are less likely to generate biased text, and they can be trained on smaller datasets.
One example of a non-autoregressive LLM is PaLM, which was developed by Google AI. PaLM is a 540-billion parameter model that is trained on a massive dataset of text and code. PaLM can generate text, translate languages, and answer questions in a comprehensive and informative way.
Another example of a non-autoregressive LLM is Megatron-Turing NLG, which was developed by Nvidia. Megatron-Turing NLG is a 530-billion parameter model that is trained on a dataset of text and code that is 10 times larger than the dataset used to train PaLM. Megatron-Turing NLG can generate text, translate languages, and write different kinds of creative content.
Some examples of other non-autoregressive LLMs are:
- BART
- RoBERTa
- XLNet
- ALBERT
- T5
- GPT-Neo
These models are all trained on different datasets and using different techniques, but they all share the common goal of learning the underlying patterns in language.
Non-autoregressive LLMs are still under development, but they have the potential to overcome some of the limitations of auto-regressive LLMs. For example, non-autoregressive LLMs may be less likely to generate biased text, and they may be more efficient at generating long sequences of text.
Here is a table that summarizes the key differences between auto-regressive and non-autoregressive LLMs:
Feature | Auto-regressive LLMs | Non-autoregressive LLMs |
---|---|---|
How they generate text | Generate text one token at a time, starting with an initial set of tokens | Generate text all at once |
Strengths | Can generate text that is indistinguishable from human-written text | May be less likely to generate biased text, may be more efficient at generating long sequences of text |
Weaknesses | Can be biased, depending on the data they are trained on | Still under development, may not be as accurate as auto-regressive LLMs at generating short sequences of text |
Overall, non-autoregressive LLMs are a promising new area of research. They have the potential to overcome some of the limitations of auto-regressive LLMs, and they could be used to create new and innovative applications in the future.