Natural language processing (NLP) has evolved significantly with transformer-based models. A key innovation in these models is positional encodings, which help capture the sequential nature of language. In this post, you will learn about:
- Why positional encodings are necessary in transformer models
- Different types of positional encodings and their characteristics
- How to implement various positional encoding schemes
- How positional encodings are used in modern language models
Let’s get started!
Overview
This post is divided into five parts; they are:
- Understanding Positional Encodings
- Sinusoidal Positional Encodings
- Learned Positional Encodings
- Rotary Positional Encodings (RoPE)
- Relative Positional Encodings
Understanding Positional Encodings
Consider these two sentences: “The fox jumps over the dog” and “The dog jumps over the fox”. They contain the same words but in different orders. In recurrent neural networks, the model processes words sequentially, naturally capturing this difference. However, transformer models process all words in parallel, making them unable to distinguish between these sentences without additional information.
Positional encodings solve this problem by providing information about each token’s position in the sequence. Each token is converted into a vector through the model’s embedding layer, with the vector size called the “hidden dimension”. Positional encoding adds position information by creating a vector of the same hidden dimension.
The positional encodings are added to the input in the attention module. During the dot-product operation, these encodings emphasize relationships between nearby tokens, helping the model understand context. This allows the model to distinguish between sentences with the same words in different orders.
The most common types of positional encodings are:
- Sinusoidal Positional Encodings (used in the original Transformer): Uses constant vectors built with sine and cosine functions
- Learned Positional Encodings (used in BERT and GPT): Vectors are learned during training
- Rotary Positional Encodings (RoPE, used in Llama models): Uses constant vectors built with rotational matrices
- Relative Positional Encodings (used in T5 and MPT): Based on distances between tokens rather than absolute positions
- Attention with Linear Bias (ALiBi, used in Falcon models): A bias term added to attention scores based on token distances
Each type has unique advantages and limitations, which we’ll explore in detail.
Sinusoidal Positional Encodings
The original Transformer paper introduced sinusoidal positional encodings. Deterministic functions are used to generate unique patterns for each position, as shown in the following formulas:
$$
\begin{aligned}
PE(p, 2i) &= \sin\left(\frac{p}{10000^{2i/d}}\right) \\
PE(p, 2i+1) &= \cos\left(\frac{p}{10000^{2i/d}}\right)
\end{aligned}
$$
Where $d$ is the hidden dimension (must be even), and $i$ ranges from 0 to $d/2$. The positional encoding $PE(p, k)$ represents the $k$-th element in the vector for position $p$. The constant 10000 was suggested by the original Transformer paper. It should be larger than the maximum sequence length.
Here’s the PyTorch implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
mport torch import numpy as np
def create_sinusoidal_encodings(seq_len, dim): N = 10000 i = torch.arange(0, dim//2) div_term = torch.exp(–np.log(N) * (2*i / dim)) position = torch.arange(seq_len).unsqueeze(1)
pe = torch.zeros(seq_len, dim) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) return pe
# Example usage seq_len = 512 dim = 768 positional_encodings = create_sinusoidal_encodings(seq_len, dim) sequence = sequence + positional_encodings |
In this implementation, div_term
computes $1/N^{2i/d}$ for $i=0$ to $d/2-1$. The position
matrix has shape (seq_len,1)
. Their multiplication in the sine and cosine functions produces a matrix of shape (seq_len, dim//2)
. The results are interleaved in the output matrix pe
of shape (seq_len, dim)
.
Sinusoidal encodings have two key advantages: they’re deterministic and can extrapolate to longer sequences than seen during training. The relative position between tokens can be easily computed from the dot product of their positional encoding vectors, thanks to the properties of sinusoidal functions.
However, these encodings don’t adapt to data characteristics and may be less effective for very long sequences.
Learned Positional Encodings
Models like GPT-2 use learned positional encodings. Here’s the PyTorch implementation:
import torch.nn as nn
class LearnedPositionalEncoding(nn.Module): def __init__(self, max_seq_len, dim): super().__init__() self.position_embeddings = nn.Embedding(max_seq_len, dim)
def forward(self, x): positions = torch.arange(x.size(1), device=x.device).expand(x.size(0), –1) position_embeddings = self.position_embeddings(positions) return x + position_embeddings
# Example usage model = LearnedPositionalEncoding(max_seq_len=512, dim=768) |
The nn.Embedding
layer acts as a lookup table mapping integer indices to vectors of dimension dim
. In the forward()
function, the positions
tensor has shape (batch_size, seq_len, dim)
, matching the input x
. The positional encoding is added to x
before the attention operation.
Learned positional encodings adapt to data characteristics through training, potentially offering better performance when trained properly. However, they can’t extrapolate to longer sequences and may overfit. They also increase model size since they’re part of the model parameters.
Rotary Positional Encodings (RoPE)
Most modern large language models use rotary positional encodings (RoPE). These encode relative positions through rotation matrices, with each position representing a geometric progression of angles. The formulas are:
$$
\begin{aligned}
\hat{x}_m^{(i)} &= x_m^{(i)} \cos(m\theta_i) + x_m^{(d/2+i)} \sin(m\theta_i) \\
\hat{x}_m^{(d/2+i)} &= x_m^{(d/2+i)} \cos(m\theta_i) – x_m^{(i)} \sin(m\theta_i) \\
\end{aligned}
$$
where $\theta_i = 10000^{-2i/d}$, with $d$ as the embedding dimension, $m$ the position index, and $i$ ranging from 0 to $d/2-1$. In matrix form, it is:
$$
\mathbf{\hat{x}}_m = \mathbf{R}_m\mathbf{x}_m = \begin{bmatrix}
\cos(m\theta_i) & -\sin(m\theta_i) \\
\sin(m\theta_i) & \cos(m\theta_i)
\end{bmatrix} \mathbf{x}_m
$$
For $\mathbf{x}_m$ representing a pair $(i, d/2+i)$ of elements in the vector at position $m$.
Here’s the PyTorch implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import torch import numpy as np
def rotate_half(x): x1, x2 = x.chunk(2, dim=–1) return torch.cat((–x2, x1), dim=–1)
def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin)
class RotaryPositionalEncoding(nn.Module): def __init__(self, dim, max_seq_len=512): super().__init__() N = 10000 inv_freq = 1. / (N ** (torch.arange(0, dim, 2).float() / dim)) position = torch.arange(max_seq_len).float() sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin())
def forward(self, x, seq_len=None): if seq_len is None: seq_len = x.size(1) cos = self.cos[:seq_len].view(1, seq_len, 1, –1) sin = self.sin[:seq_len].view(1, seq_len, 1, –1) return apply_rotary_pos_emb(x, cos, sin) |
The register_buffer()
calls cache the sine and cosine calculations for efficiency. The inv_freq
variable computes $\theta_i$ for all $i$, position
represents $m$ (indices from 0 to max_seq_len-1
), and sinusoid_inp
holds $m\theta_i$ in a matrix of shape (max_seq_len, dim//2)
. The rotate_half()
function converts a vector $(x_1, x_2, \cdots, x_{d-1}, x_{d})$ to $(-x_{d/2+1}, -x_{d/2+2}, \dots, x_{d/2-1}, x_{d/2})$. Then, apply_rotary_pos_emb()
applies the rotation matrix to the input.
RoPE offers several advantages:
- The rotation matrix $\mathbf{R}_m$ geometrically rotates the 2D input vector by an angle $m\theta_i$
- The transpose $\mathbf{R}_m^\top = \mathbf{R}_m^{-1}$ represents reverse rotation. Hence the relative positions can be easily computed as $\mathbf{R}_{m-n} = \mathbf{R}_m\mathbf{R}_n^\top$
- It can extrapolate to longer sequences due to the geometric progression of angles
- Since $\cos^2t+\sin^2t=1$, RoPE preserves vector norms of $\mathbf{x}_m$, aiding training stability
Relative Positional Encodings
While previous implementations use absolute token positions, usually what matters is the relative positions between tokens. Here’s a simplified implementation on how to use relative positional encodings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import torch import torch.nn as nn
class RelativePositionalEncoding(nn.Module): def __init__(self, max_relative_position, d_model): super().__init__() self.max_relative_position = max_relative_position self.relative_attention_bias = nn.Parameter( torch.randn(2 * max_relative_position + 1, d_model) )
def forward(self, length): context_position = torch.arange(length, dtype=torch.long)[:, None] memory_position = torch.arange(length, dtype=torch.long)[None, :] relative_position = memory_position – context_position relative_position_bucket = relative_position + self.max_relative_position return self.relative_attention_bias[relative_position_bucket] |
The relative_position
matrix has shape (length, length)
, with each element representing the relative position between tokens $i$ and $j$. This is computed by subtracting an $N\times 1$ matrix context_position
from a $1\times N$ matrix memory_position
.
The relative_position_bucket
shifts values to be non-negative, and position encoding vectors are looked up from the relative_attention_bias
tensor.
Relative positional encodings naturally handle variable-length sequences and work well for tasks like translation, making them the choice for models like T5.
Attention with Linear Bias (ALiBi) is a related approach that adds a bias matrix to attention scores instead of manipulating the input sequence. In the code above, you see that relative_positon_bucket
is used to look up a sequence of vectors as the positional encoding, which is then added to the input sequence in the attention module. In ALiBi, the input sequence are used directly in calculating the attention score. But afterwards, the matrix of relative_positon_bucket
is scaled and added to the attention score matrix before proceeding to the softmax operation. The scaling factor in ALiBi is computed as $m_h=1/2^{8h/H}$, where $h$ is the head index and $H$ is the total number of attention heads.
Further Readings
Below are some further readings on the topic:
Summary
In this article, you learned about positional encodings and their importance in transformer models. In particular, you learned:
- Positional encodings are necessary because transformers process tokens in parallel
- Different types of positional encodings have different advantages and limitations
- Sinusoidal encodings are deterministic and can extrapolate to longer sequences
- Learned encodings are simple but can’t extrapolate
- RoPE provides better performance on long sequences
- Relative positional encodings focus on inter-token distances