Positional Encodings in Transformer Models


Natural language processing (NLP) has evolved significantly with transformer-based models. A key innovation in these models is positional encodings, which help capture the sequential nature of language. In this post, you will learn about:

  • Why positional encodings are necessary in transformer models
  • Different types of positional encodings and their characteristics
  • How to implement various positional encoding schemes
  • How positional encodings are used in modern language models

Let’s get started!

svetlana gumerova 6fAgnT3Dhl4 unsplash scaled

Positional Encodings in Language Models
Photo by Svetlana Gumerova. Some rights reserved.

Overview

This post is divided into five parts; they are:

  • Understanding Positional Encodings
  • Sinusoidal Positional Encodings
  • Learned Positional Encodings
  • Rotary Positional Encodings (RoPE)
  • Relative Positional Encodings

Understanding Positional Encodings

Consider these two sentences: “The fox jumps over the dog” and “The dog jumps over the fox”. They contain the same words but in different orders. In recurrent neural networks, the model processes words sequentially, naturally capturing this difference. However, transformer models process all words in parallel, making them unable to distinguish between these sentences without additional information.

Positional encodings solve this problem by providing information about each token’s position in the sequence. Each token is converted into a vector through the model’s embedding layer, with the vector size called the “hidden dimension”. Positional encoding adds position information by creating a vector of the same hidden dimension.

The positional encodings are added to the input in the attention module. During the dot-product operation, these encodings emphasize relationships between nearby tokens, helping the model understand context. This allows the model to distinguish between sentences with the same words in different orders.

The most common types of positional encodings are:

  1. Sinusoidal Positional Encodings (used in the original Transformer): Uses constant vectors built with sine and cosine functions
  2. Learned Positional Encodings (used in BERT and GPT): Vectors are learned during training
  3. Rotary Positional Encodings (RoPE, used in Llama models): Uses constant vectors built with rotational matrices
  4. Relative Positional Encodings (used in T5 and MPT): Based on distances between tokens rather than absolute positions
  5. Attention with Linear Bias (ALiBi, used in Falcon models): A bias term added to attention scores based on token distances

Each type has unique advantages and limitations, which we’ll explore in detail.

Sinusoidal Positional Encodings

The original Transformer paper introduced sinusoidal positional encodings. Deterministic functions are used to generate unique patterns for each position, as shown in the following formulas:

$$
\begin{aligned}
PE(p, 2i) &= \sin\left(\frac{p}{10000^{2i/d}}\right) \\
PE(p, 2i+1) &= \cos\left(\frac{p}{10000^{2i/d}}\right)
\end{aligned}
$$

Where $d$ is the hidden dimension (must be even), and $i$ ranges from 0 to $d/2$. The positional encoding $PE(p, k)$ represents the $k$-th element in the vector for position $p$. The constant 10000 was suggested by the original Transformer paper. It should be larger than the maximum sequence length.

Here’s the PyTorch implementation:

In this implementation, div_term computes $1/N^{2i/d}$ for $i=0$ to $d/2-1$. The position matrix has shape (seq_len,1). Their multiplication in the sine and cosine functions produces a matrix of shape (seq_len, dim//2). The results are interleaved in the output matrix pe of shape (seq_len, dim).

Sinusoidal encodings have two key advantages: they’re deterministic and can extrapolate to longer sequences than seen during training. The relative position between tokens can be easily computed from the dot product of their positional encoding vectors, thanks to the properties of sinusoidal functions.

However, these encodings don’t adapt to data characteristics and may be less effective for very long sequences.

Learned Positional Encodings

Models like GPT-2 use learned positional encodings. Here’s the PyTorch implementation:

The nn.Embedding layer acts as a lookup table mapping integer indices to vectors of dimension dim. In the forward() function, the positions tensor has shape (batch_size, seq_len, dim), matching the input x. The positional encoding is added to x before the attention operation.

Learned positional encodings adapt to data characteristics through training, potentially offering better performance when trained properly. However, they can’t extrapolate to longer sequences and may overfit. They also increase model size since they’re part of the model parameters.

Rotary Positional Encodings (RoPE)

Most modern large language models use rotary positional encodings (RoPE). These encode relative positions through rotation matrices, with each position representing a geometric progression of angles. The formulas are:

$$
\begin{aligned}
\hat{x}_m^{(i)} &= x_m^{(i)} \cos(m\theta_i) + x_m^{(d/2+i)} \sin(m\theta_i) \\
\hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} \cos(m\theta_i) – x_m^{(i)} \sin(m\theta_i) \\
\end{aligned}
$$

where $\theta_i = 10000^{-2i/d}$, with $d$ as the embedding dimension, $m$ the position index, and $i$ ranging from 0 to $d/2-1$. In matrix form, it is:

$$
\mathbf{\hat{x}}_m = \mathbf{R}_m\mathbf{x}_m = \begin{bmatrix}
\cos(m\theta_i) & -\sin(m\theta_i) \\
\sin(m\theta_i) & \cos(m\theta_i)
\end{bmatrix} \mathbf{x}_m
$$

For $\mathbf{x}_m$ representing a pair $(i, d/2+i)$ of elements in the vector at position $m$.

Here’s the PyTorch implementation:

The register_buffer() calls cache the sine and cosine calculations for efficiency. The inv_freq variable computes $\theta_i$ for all $i$, position represents $m$ (indices from 0 to max_seq_len-1), and sinusoid_inp holds $m\theta_i$ in a matrix of shape (max_seq_len, dim//2). The rotate_half() function converts a vector $(x_1, x_2, \cdots, x_{d-1}, x_{d})$ to $(-x_{d/2+1}, -x_{d/2+2}, \dots, x_{d/2-1}, x_{d/2})$. Then, apply_rotary_pos_emb() applies the rotation matrix to the input.

RoPE offers several advantages:

  • The rotation matrix $\mathbf{R}_m$ geometrically rotates the 2D input vector by an angle $m\theta_i$
  • The transpose $\mathbf{R}_m^\top = \mathbf{R}_m^{-1}$ represents reverse rotation. Hence the relative positions can be easily computed as $\mathbf{R}_{m-n} = \mathbf{R}_m\mathbf{R}_n^\top$
  • It can extrapolate to longer sequences due to the geometric progression of angles
  • Since $\cos^2t+\sin^2t=1$, RoPE preserves vector norms of $\mathbf{x}_m$, aiding training stability

Relative Positional Encodings

While previous implementations use absolute token positions, usually what matters is the relative positions between tokens. Here’s a simplified implementation on how to use relative positional encodings:

The relative_position matrix has shape (length, length), with each element representing the relative position between tokens $i$ and $j$. This is computed by subtracting an $N\times 1$ matrix context_position from a $1\times N$ matrix memory_position.

The relative_position_bucket shifts values to be non-negative, and position encoding vectors are looked up from the relative_attention_bias tensor.

Relative positional encodings naturally handle variable-length sequences and work well for tasks like translation, making them the choice for models like T5.

Attention with Linear Bias (ALiBi) is a related approach that adds a bias matrix to attention scores instead of manipulating the input sequence. In the code above, you see that relative_positon_bucket is used to look up a sequence of vectors as the positional encoding, which is then added to the input sequence in the attention module. In ALiBi, the input sequence are used directly in calculating the attention score. But afterwards, the matrix of relative_positon_bucket is scaled and added to the attention score matrix before proceeding to the softmax operation. The scaling factor in ALiBi is computed as $m_h=1/2^{8h/H}$, where $h$ is the head index and $H$ is the total number of attention heads.

Further Readings

Below are some further readings on the topic:

Summary

In this article, you learned about positional encodings and their importance in transformer models. In particular, you learned:

  • Positional encodings are necessary because transformers process tokens in parallel
  • Different types of positional encodings have different advantages and limitations
  • Sinusoidal encodings are deterministic and can extrapolate to longer sequences
  • Learned encodings are simple but can’t extrapolate
  • RoPE provides better performance on long sequences
  • Relative positional encodings focus on inter-token distances

Learn Transformers and Attention!

Building Transformer Models with Attention

Teach your deep learning model to read a sentence

…using transformer models with attention

Discover how in my new Ebook:

Building Transformer Models with Attention

It provides self-study tutorials with working code to guide you into building a fully-working transformer models that can

translate sentences from one language to another

Give magical power of understanding human language for
Your Projects

See What’s Inside


Leave a Comment