Understanding Tokenization in Large Language Models

Have you ever wondered how LLMs understand and process text? Or why the costs associated with using the Large Language Models (LLMs) are calculated based on something called tokens? The answer lies in the fundamental concept of tokenization and the inner workings of these large language models.

In this blog, we will try to get a better understanding of tokenization, and its role in modern LLMs along with some examples.

So what is Tokenization?

Tokenization is a foundational concept in Natural Language Processing (NLP) that involves breaking down the text into smaller units called tokens. They are the smallest data units a model can use to process and interpret data.

This is done by converting the raw into a format the models can work with. Each token can represent a word, a part of a word, or even a single character or punctuation mark. The number of tokens used in a given interaction directly correlates with the computational resources required to process that text.

Below are some examples of sentences that can be broken down using different tokenization methods.

# Word Tokens
["Back", "to", "the", "drawing", "board"]

# Subword Tokens
["Back", "to", "the", "draw", "##ing", "bo", "##ard"]

# Character Tokens
["B", "a", "c", "k", " ", "t", "o", " ", "t", "h", "e", " ", "d", "r", "a", "w", "i", "n", "g", " ", "b", "o", "a", "r", "d"]

As we make our tokens more granular, the cost of computation increases. Just from the above example, we see a nearly 5x increase in the computation required for processing the same input when using Character-based tokens compared to Word Tokens.

Now let’s have a closer look at the different tokenization techniques mentioned above.

Word-Level Tokenization

Word-level tokenization splits text into words based on delimiters such as spaces and punctuation.

This method is straightforward and often the first step in text-processing tasks. It works well for languages where spaces separate words, like English.

However, it can struggle with contractions, compound words, and languages that do not use spaces as word boundaries.

Example:

string text = "Hello, world! How are you?";
string[] tokens = text.Split(new char[] { ' ', ',', '!', '?' }, StringSplitOptions.RemoveEmptyEntries);
// Result: ["Hello", "world", "How", "are", "you"]

Subword-Level Tokenization

Subword tokenization breaks words into smaller units, which helps manage rare words and accommodates morphologically rich languages.

This technique is particularly useful in machine translation and language models where understanding the structure of words can enhance performance.

Byte Pair Encoding (BPE)

BPE is a popular subword tokenization algorithm used in natural language processing and machine learning. It creates a vocabulary of subword units, which helps handle rare words and morphological variants efficiently using the following steps.

Initialization: Start with a vocabulary of individual characters (character-level tokens).
Frequency Analysis: Count the frequency of adjacent token pairs in the training corpus.
Merge Operation:

Identify the most frequent pair of adjacent tokens.
Create a new token by merging this pair.
Add this new token to the vocabulary.

Iteration: Repeat steps 2-3 for a predetermined number of times or until a desired vocabulary size is reached.
Tokenization: Use the final vocabulary to tokenize new text by applying the learned merges.

Let’s say after following the above steps the patterns “low” and “er” were identified and tokenized. So when we input the word “lowered” it gets broken down as follows.

string text = "lowered";
string[] tokens = SubWordEncoding(text);
// Result: ["low", "er", "e", "d"]

string[] SubWordEncoding(string text) {
 // BPE inference implementation
}

Character-Level Tokenization

Character-level tokenization treats each character as a token, which can be useful in certain contexts, particularly when dealing with languages that have complex scripts or when analyzing character-level patterns, such as in spelling correction or handwriting recognition.

This technique is particularly useful for languages with no clear word boundaries, such as Chinese or Japanese, where each character can represent a concept or idea.

Example:

string text = "Hello";
char[] tokens = text.ToCharArray();
// Result: ['H', 'e', 'l', 'l', 'o']

Vocabulary: The Heart of Tokenization

The vocabulary of an LLM refers to the set of unique tokens that the model was trained on. The size and composition of this vocabulary play an important role in the model’s performance and efficiency.

Whenever we ask an LLM a question it first gets converted to tokens using this vocabulary. The model then uses these tokens to predict the next tokens in the sentence one by one un-till the end of the sentence.

A well-chosen vocabulary significantly enhances a model’s ability to understand and generate text accurately. If the vocabulary is too small, the model may struggle with rare or complex words. Conversely, an excessively large vocabulary can complicate the model and increase its computational demands.

Let’s look at the vocabulary sizes of some popular LLMs:

BERT: Approximately 30,000 tokens
GPT-4: Around 100,000 tokens
LLaMA 3: 128,000 tokens

These larger vocabularies allow models to handle a diverse range of inputs, from everyday phrases to specialized terminology, enhancing their ability to generalize across various contexts and languages.

Let’s see an example.

// Our dummy vocabulary
  private Dictionary<string, int> vocab = new Dictionary<string, int>()
 {
 {"Black", 0}, {" hole", 1}, {" is", 2}, {" a", 3}, {" reg", 4}, {"ion", 5}, 
 {" of", 6}, {" spa", 7}, {"ce", 8}, {" where", 9}, {" grav", 10}, {"ity", 11}, 
 {" so", 12}, {" strong", 13}, {" nothing", 14}, {" can", 15}, {" es", 16}, {"cape", 17}, {" it", 18}, {"." , 19}
 };
    
  public List<int> Tokenize(string input){}
  public string Detokenize(List<int> tokenIds){}

Here we have a dummy vocabulary along with a Tokenize and Detokenize function. We won’t worry about their implementation details here.

 // Repeated words and complex words example
  string inputText = "Black hole is a region of space where gravity is strong.Black hole is strong.";

  List<int> tokenizedOutput = Tokenize(inputText);
 // Tokenize Result : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 2, 13, 19, 0, 1, 2, 13, 19

  string detokenizedText = Detokenize(tokenizedOutput);
 // Detokenize Result: Black hole is a region of space where gravity is strong. Black hole is strong.

Now when we try to tokenize a sentence we can effectively reduce its length by representing the sub-words as numbers. So we went from 78 characters to 20 Tokens including spaces.

To get a visual representation of how your inputs are tokenized for different LLM models, you can use tools like the Tiktokenizer.

Context Length

Context length is another critical factor that significantly influences the performance and capabilities of LLMs. It represents the maximum number of tokens an LLM can process at a time.

While the original GPT-3.5 had a context length of around 4,096 tokens, newer models have dramatically expanded this capability. For instance, some of the newer LLM models now boast a context length of up to 128000 or even a million tokens.

However, there’s a trade-off between context length and response time. The more tokens an LLM needs to process, the slower its response typically becomes. This is because the computational complexity increases with the increase in the number of tokens.

Wrapping up

Understanding tokenization is key to grasping the inner workings of large language models. Tokens are the fundamental building blocks that LLMs process, and impact both model performance and computational cost.

The size of a model’s vocabulary and its context length also play vital roles in determining how efficiently the model can understand and generate responses. These concepts shape the user experience and cost-effectiveness of using LLMs in real-world applications.