An Introduction to the Transformer Model: The Brains Behind Large Language Models

Tyler Au
8 minutes
May 1st, 2024
Tyler Au
8 minutes
May 1st, 2024

What is a Transformer Model?

Just last week, Meta launched Meta Llama 3, the newest iteration of their Llama large language model (LLM). Llama 3 marks a drastic improvement from their Llama 2 model, being trained on 7x more tokens, being 3x more efficient, and going head to head with other popular LLMs. The free open source LLM has also garnered tons of community support and adoption. Excelling in areas such as “language nuances, contextual understanding and [...] translation and dialogue generation” Llama 3 has made Meta a formidable competitor in the open AI space, but why is their newest LLM so powerful?

Large language models, LLMs, are machine learning (ML) models capable of understanding and generating human language texts by processing immense amounts of data. In particular, LLMs are deep learning models, ML models that process data and are able to make complex associations similar to the human brain. These associations are based on probability, with LLMs being fed enough data to accurately predict logical outcomes, such as finishing an incomplete sentence. This complex foundation allows LLMs to execute a number of complex tasks and functions, making it the go-to basis for things like generative AI, copywriting, code generation, and so on.

In order to support their full predictive capabilities, LLMs are built upon a neural network known as a transformer model. Developed in 2017 by Google, transformers and the transformer model have since then shaken up the natural language processing (NLP) field, cementing a place for themselves in many artificial intelligence applications today. With Llama 3 and some of the most powerful LLMs being built upon a transformer architecture, it’s no wonder why some of the biggest tech companies around the world have been implementing transformers into their own solutions. But what makes transformers so critical to LLMs?

The Transformer Architecture and What It Means for NLP

Transformer Architecture

Transformer models can be thought of as a human brain in many aspects. For one, the composition of transformer models is extremely similar to brains themselves: complex networks of countless nodes and interdependent layers work together to transcribe information and put out information.

Processing information is the brain’s biggest capability- information reaches our brains and we react, or put out, a response to it. Transformer architecture is based on this idea.

At its core, a transformer model typically follows an encoder-decoder architecture, with the encoder and decoder being built of several components themselves. The encoder is responsible for extracting meaning from an input sequence, which is then sent to the decoder to generate an output sequence based on the extracted meaning. Meaning is derived from tokens, with tokens in natural language processing (NLP) referring to individual chunks within a sequence that are ready for analysis. Encoders parse the words in a sentence for underlying meaning, with the encoder arranging a new sequence based on the meaning. Within these encoders and decoders are different transformer blocks, or layers, that each serve their own purpose. Examples of some of these layers include the attention layer, feed-forward layer, and normalization layers.

Image courtesy of "All You Need is Attention"

Like a brain, in order to improve its response and performance, a transformer model must learn. 

Transformer models rely on a mechanism called self-attention. Similar to noticing body language in a conversation, self-attention is used to detect how elements in a sequence are non-sequentially related. These elements are then weighed with the given overall context, with some pieces of information being more relevant than others. Through this, a transformer model is able to understand context and pull information that has the most weight within a sequence, generating a sequence that is most relevant to the input sequence.

What makes the integration of self-attention into transformer models so important is because it allows these models to focus on all previous tokens within sequences, expanding greatly from the reference capabilities of earlier NLP models like Recurrent Neural Network (RNN) and Long Short Term Memory (LTSM). This GIF by Michael Phi does an extremely great job at illustrating this idea:

GIF Courtesy of Michael Phi in "Illustrated Guide to Transformers- Step by Step Explanation"

Self-attention is vital to LLMs because it enables their transformer foundations to understand sequences from a long range of inputs, making them more efficient and able to be trained on larger data sets. In particular, self-attention lets LLMs excel in things like sentence completion, question answering, translation, and so on.

Embracing self-attention mechanisms and creating multifaceted ways of encoding and decoding information through the various layers have allowed transformer models to become a mainstay in the NLP field, marking a shift within the space.

The New Kid on the Block

Before transformer models were introduced in 2017, natural language processing relied on two main architectures: Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks.

An RNN is another deep learning model that processes input and output data sequentially, being the one that laid the foundation for every sequence modeling mechanism to come. RNNs are extremely simple and dependable for sequences that present information in a timely order. RNNs, however, couldn’t process information in the past, making important information forgotten.

LTSMs were created to address the shortcomings of RNNs, allowing information from previous steps within a sequence to be retained, though upping the complexity from RNNs.

The transformer model addresses the shortcomings of both RNNS and LTSMs through the multi-head self-attention mechanism.

Self-attention allows transformer models to pull previous information, or long-range dependencies, and then some from earlier sequences. This mechanism also offers transformers greater interpretability and more accurate results and efficient performance. Multi-headed attention allows input sequences to be run in parallel to each other, making processing faster and reducing overall training time. 

Despite the increased benefits the transformer model offers in comparison to its predecessors, this architecture is also way more complex than its predecessors. With RNNs, LTSMs, and transformer models offering NLP capabilities through three different approaches, each architecture excels in varying different use cases, making none of them necessarily obsolete despite transformer models taking the center stage in today’s AI-enthusiastic landscape.

The LLM Transformer Relationship - Why ALL LLMs Use Transformers

With LLMs typically being trained on billions of datasets, the transformer foundations they’re built on must be resilient and efficient, amongst other things. The demands that LLMs must meet are intense, though the benefits that transformer models present certainly meet the power needed. 

As mentioned before LLMs are built specifically on top of transformer models because of the self-attention mechanism. With the mechanism, transformers are able to learn context and adjust their output sequences accordingly, while interpreting complex forms of language. Transformers are also able to capture long-range dependencies more effectively and make better predictions with self-attention, making transformers the perfect candidate for LLM usage. Outside of fed datasets, transformers are also capable of achieving unsupervised training, learning and evolving by themselves.

For one, transformer models grant the parallelization that LLMs need for efficient training. Processing inputted sequences in parallel is a huge improvement from the RNN and LTSM that were only capable of sequential processing. This not only allows for better long-range dependence capturing, but also for better data training- especially when it comes to the amount of data that transformers and LLMs work with. This parallelization also allows for better scale, enabling data processing and sequence generation that far surpasses the capabilities of the transformer’s predecessors.

Transformer models can also be pre-trained on massive datasets for general use, allowing for transfer learning, or using a pre-trained model on new problems. Transfer learning is important in computer vision and NLP disciplines because it sets a new starting point when developing new deep learning models for new problems, which can take a ton of time and resources. Transfer learning also provides a degree of performance improvement, as each model learns from the last.

Above all, transformer models host their own generative capabilities, making it the perfect candidate for LLMs that need things such as text completion and generation, speech translation, and so on.

Large Language Models that Optimize Transformer Usage

Although it’s seemingly the LLM that gets the limelight in the transformer LLM relationship, transformers themselves shouldn’t be overlooked. Here are some of the most popular transformer models today:

BERT (Bidirectional Encoder Representations from Transformers)

Developed in 2018 by Google, BERT is an LLM used to understand the context of words based on preceding and succeeding words. BERT was trained on an impressive 3.3 billion words in multiple languages, making it extremely capable in context understanding. Since its launch, BERT models have been widely adopted and loved by many, excelling in targeted search queries- so much so that Google itself uses BERT to understand text importance and identify sequence nuances.

BERT in Search: Esthetician Example
Image courtesy of Google

GPT4 (Generative Pre-trained Transformer 4)

Perhaps the most well known of the bunch, GPT4 is Open.Ai’s latest iteration of their GPT LLM. GTP4 marks an impressive improvement from its predecessors. Best known for its chat capabilities, especially in ChatGPT, GPT4 is built upon the foundation that earlier versions have built, making the newest LLM more capable of solving problems than ever before. From faster coding to stronger text analysis, GPT4 is capable of doing whatever your imagination is limited to!

GPT4 — The New Radiology Whiz on the Block! | by Rutuja Desai | Medium
Image courtesy of Open.Ai

Meta Llama 3

And of course, back to where we started. 

Meta Llama 3 is Meta’s newest iteration of their open source Llama LLM. Trained with over 15 trillion tokens, Llama 3’s training is expanded over 7x the training of Llama 2, with the addition of 4x more code. Llama 3 is an important addition to the LLM field because it’s open source, anyone can use Llama 3 today and evolve the solution endlessly. Especially adept in fields such as writing and translations, Llama 3 is a must for most generative AI projects.

Image courtesy of Meta

LLMs at Lyrid?

The newly released Meta Llama 3 is one of the many large language models proving to make waves within the AI space, improving our understanding of language and speech contextualization as a result. A building is only as good as its foundation, and with many of these powerful LLMs, their foundation consists of a equally as powerful transformer model, amongst other things.

Transformers come from a long line of natural language processing techniques aimed at pulling meaning from sequences in hopes of providing better contextual understanding to the machines that use these models. From recurrent neural networks to long short term memory networks to eventually the transformer models, NLPs have improved how our LLMs interact with huge datasets and how they perform based on our sequences.

With the enthusiasm and innovations around LLMs, AI, and transformers showing no signs of slowing down, support systems for these complex solutions must be put in place. We’re excited to announce that we’re building our own LLM solution. Our solution is aimed at supporting the infrastructure of LLMs while allowing for greater scalability and growth! In fact, we’ve built a GPT solution using Meta’s Llama 3.

Looking to learn more? Book a call with us to learn more about what we have in store for LLMs!

Schedule a demo

Let's discuss your project

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

99 South Almaden Blvd. Suite 600
San Jose, CA

Jl. Pluit Indah 168B-G, Pluit Penjaringan,
Jakarta Utara, DKI Jakarta